## Introduction

I’ve always been inspired by the `simdjson`

project. Their commitment to speed, but also modern development practices and ease-of-use are … well … inspiring!

I watched a talk given by Daniel Lemire on how `simdjson`

is so fast. In the presentation he mentioned that they are not writing their own bespoke assembly as I had assumed, but rather using “intrinsic functions”.

### ciso8601

I am the maintainer of `ciso8601`

the fastest ISO 8601 timestamp parser for Python.

Currently, `ciso8601`

is written in extremely boring C code. This is great from a maintenance perspective since it is simple to audit and doesn’t force me to do anything too strange in order to work with any C compiler.

But I’ve always wondered whether we could go faster if we were willing to manage multiple bespoke implementations for each of the common architectures.

“Intrinsic functions” sounded like it might strike a manageable compromise.

### Intrinsic functions

Intrinsic functions are functions provided by C/C++/Rust compilers that you can use to access architecture-specific functionality. Typically, they are used to access vector CPU instructions with performance in mind.

Vector instructions allow you to perform operations across many bytes with a single instruction, speeding up CPU-bound computation. For example, with the AVX-512 instructions on modern Intel CPUs, you can manipulate 64 bytes at a time!

Vector instructions can be useful when parsing because you no longer have to parse byte-by-byte; sometimes you can use them to parse many bytes at once.

### My use case: RFC 3339

In my case, I was trying to parse RFC 3339 timestamps as fast as possible. RFC 3339 can be [roughly] thought of as strict subset of the timestamp formats allowed by ISO 8601.

The first 19 bytes of a RFC 3339 timestamp are all mandatory, and are always in the same positions:

`YYYY-MM-DDThh:mm:ss`

(After those 19 bytes, you have a variable length fractional second, and a variable-length UTC offset)

My non-vectorized code to parse these timestamps looks something like this:

```
char *str = "2014-01-09T21:48:56.12345-05:30";
char *c = str;
// Parse year
int year = 0;
for (i = 0; i < 4; i++) {
if (*c >= '0' && *c <= '9') {
year = 10 * year + *c++ - '0';
}
else {
return format_unexpected_character_exception("year", c, (c - str) / sizeof(char), 4 - i);
}
}
// Parse month
// ...
```

It is parsing each byte, one at a time. This is still pretty fast. But can it be faster? (**Spoilers:** yes)

## A naive vector implementation

The parsing of a RFC 3339 string involves:

- Verifying that the all the characters are within acceptable ranges
- Parsing the 9 fields (year, month, day, hour, minute, second, fractional second, and UTC offset) into 9 integers.
- Validating the integers form a valid datetime
- (e.g., That the day integer is valid for the given month; there is no February 30!)

My development machine is a x86 Intel CPU (“Coffee Lake”), so I started with the oldest (and presumably most widespread) SSE instrinsic function set.

SSE instructions operate on 128 bits at a time, or 16 bytes. Conveniently, the first 16 bytes are aligned with the first 5 fields of an RFC 3339 timestamp:

`2014-01-09T21:48`

(year, month, day, hour, and minute)

So I set out to parse the first 16 bytes with SSE vector instructions, then use my existing non-vectorized implementation for any remaining bytes.

### Verifying character bounds

First step is to verify that all the bytes are within bounds:

```
const __m128i characters = _mm_loadu_si128((__m128i_u *)str);
const __m128i lower_bound = _mm_setr_epi8(
48, 48, 48, 48, // Year; 48 = ASCII '0'
45, // ASCII '-'
48, 48, // Month
45, // ASCII '-'
48, 48, // Day
84, // ASCII 'T'
48, 48, // Hour
58, // ASCII ':'
48, 48 // Minute
);
const __m128i upper_bound = _mm_setr_epi8(
57, 57, 57, 57, // Year; 57 = ASCII '9'
45, // ASCII '-'
57, 57, // Month
45, // ASCII '-'
57, 57, // Day
84, // ASCII 'T'
57, 57, // Hour
58, // ASCII ':'
57, 57 // Minute
);
const __m128i all_ones = _mm_set1_epi8(-1); // 128 bits of 1s
const __m128i too_low = _mm_cmpgt_epi8(lower_bound, characters);
const __m128i too_high = _mm_cmpgt_epi8(characters, upper_bound);
const __m128i out_of_bounds = _mm_or_si128(too_low, too_high);
const int within_range = _mm_test_all_zeros(out_of_bounds, all_ones);
```

`within_range`

will be `1`

if and only if all characters are within their bounds.
Note that this doesn’t check that the values are valid dates, just that the separators are correct and the rest are digits.

Now that we’ve got that out of the way, it’s time to do the actual parse. But what’s the fastest way to parse ASCII digits into integers?

### Parsing ASCII digits into integers

During his talk, Daniel Lemire presents a slide that parses 8 ASCII digit characters into an integer. At first it seems wildly magical:

```
uint32_t parse_eight_digits_unrolled(const char *chars) {
uint64_t val;
memcpy(&val, chars, sizeof(uint64_t));
val = (val & 0x0F0F0F0F0F0F0F0F) * 2561 >> 8;
val = (val & 0x00FF00FF00FF00FF) * 6553601 >> 16;
return (val & 0x0000FFFF0000FFFF) * 42949672960001 >> 32);
}
```

Looking for the equivalent code in the `simdjson`

codebase, I found a comment that mentioned that they got the idea from this blog post by Johnny Lee (who in turn got it from this Russian coding site). After bashing my head against it for a while (my bit manipulation skills are rusty), I came to understand this line:

`(val & 0x0F0F0F0F0F0F0F0F) * 2561 >> 8;`

I promise you that this turns the pairs of adjacent ASCII digits in `val`

into 8-bit integers. Let’s break it down.

Let’s start with the ASCII string `12345678`

which is `31 32 33 34 35 36 37 38`

in hexadecimal. On a little endian machine, the bytes will be reversed in memory:

`38 37 36 35 34 33 32 31`

The bitwise-AND with `0x0F0F0F0F0F0F0F0F`

turns ASCII digits in 8-bit integers. This is a magic constant that masks the top half of the digits, relying on the fact that all the ASCII digits are less than `0x0F`

. We could equivalently (and more legibly) subtract ASCII ‘0’ (`0x30`

) from each byte instead.

`08 07 06 05 04 03 02 01`

Now for the multiplication. Multiplying by the magic constant 2561 is better understood as `2561x = (256x * 10) + 1x`

:

Multiplying by 256 (since 256 is 2^8) is equivalent to bit shifting the number 8 bits to the left (i.e., `<< 8`

).
When compared to the original string, this is similar to moving the “tens” digit over top of the “ones” digit:

```
07 06 05 04 03 02 01 00 // 256x
08 07 06 05 04 03 02 01 // x
```

Multiply the number by 10 makes the “tens” integers into their actual values:

```
46 3C 32 28 1E 14 0A 00 // In decimal: 70 60 50 40 30 20 10 00
```

Then adding the additional 1x has the effect of adding the “ones” integers to the “tens” integers:

```
46 3C 32 28 1E 14 0A 00 // 2560x
08 07 06 05 04 03 02 01 // x
4E 43 38 2D 22 17 0C 01 // 2560x + x = 2561x
```

Which when interpreted in decimal, gives us: `78 67 56 45 34 23 12 01`

. Note that each byte is the result of adjacent pairs of digits parsed as integers.

Swapping it back to big-endian and masking every other byte makes it more obvious: `__ 12 __ 34 __ 56 __ 78`

.

So thats how we come to the magic constant of 2561. The single multiplication operation is a way to implicitly have a bit shift by 8, a multiplication by 10, and an addition. Neat!

Finally, they perform a bit shift in the other direction (`>> 8`

) to get the bytes into the right spots for the next steps of their algorithm.
The next lines of their algorithm are remixes on this line, the second shifting and combining sets of 4 digits, and the third combining all 8 digits.

### Vectorized parse

Now that we understand the magic constants, let’s adapt them for our use case.
We’re parsing 16 bytes, not 8, so our mask needs to be twice as long: `0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F`

.
We also have separator characters, not just digits, so we should change the mask to zero those out: `0x0F0F0F0F000F0F000F0F000F0F000F0F`

.

Unlike the example code where they were combining 8 digits, we only need to combining pairs of digits, so we only have to do the `* 2561`

line and we don’t have to do the bit shift back.

```
const __m128i ascii_digit_mask = _mm_setr_epi8(15, 15, 15, 15, 0, 15, 15, 0, 15, 15, 0, 15, 15, 0, 15, 15); // 15 = 0x0F
const __m128i integers = _mm_and_si128(characters, ascii_digit_mask);
const __m128i parsed = integers * 2561;
char *result = (char *)parsed;
year = (result[1] * 100) + result[3];
month = result[6];
day = result[9];
hour = result[12];
minute = result[15];
```

Note that since year is 4 digits, we combined the two pairs of parsed digits with normal multiplication and addition. But all the other fields are immediately parsed and ready for use!

## Results

The performance results are noticeable.

In my non-vectorized implementation, it takes ~76.5 nanoseconds to parse the timestamp and initialize a Python datetime with the result.

In this vectorized implementation, it takes ~61.8 nanoseconds.

This is a significant improvement, especially given that the initialization of the Python datetime object is at least 55 nanoseconds of that time.

Previously our code was roughly 21 nanoseconds of the total, and now it’s roughly 7 nanoseconds, or at least a 3x speedup!

## Next steps

There are so many things that would need to be done to make this production ready:

- I’d need to be able to reliably detect when SSE instructions are available and fall back to a non-vectorized implementation when they are not.
- We lost a lot of detail in our error messages. We can no longer say which bytes were out of range for example, just that at least one byte is out of range. We may want to fallback and redo the parse with the non-vectorized version in case of validation errors, so we can have detailed messages in the error case.
- There are probably ways to improve this even further, especially if I start using the more modern AVX/AVX2/AVX-512 instruction sets.
- I’d probably want an ARM implementation as well.
- I’d need to overhaul my performance measurement scripts to be more precise. We’re now at the point where it would be hard to separate legitimate performance improvements/regressions from sampling noise.
- I might pull out this implementation into a stand-alone C library, and then use it as a dependency of
`ciso8601`

. This would allow others to benefit from this work.

It’s exciting to see what’s possible. Now the real work begins 🚀