When every nanosecond matters

When I promised new release of tarantool driver at the end of July, I did realize amount of work it would require. Especially, if you’ll take into account changing jobs and other stuff in my personal life. Today I’m happy to announce a release of new library msgpack.spec. The purpose of this library is to achieve maximum performance and memory efficiency. In fact, library should not use additional memory at all. That purpose was fulfilled: library does not allocate additional memory on write and does allocate some memory from pool on read, but all memory is given to user to release when he oes not need it anymore. Library is 2 times slower than native ones. I wrote at my facebook, that it’s 4 times slower, but I was wrong. Let’s look onto benchmark: it’s quite simple: pack an array of one hundred of big int32 numbers. We’ll use “high” level method for that, so there will some ifs and stuff.

C# code:

private const ushort length = 100;
private const uint baseInt = 1 << 30;
private readonly byte[] _buffer = ArrayPool<byte>.Shared.Rent(short.MaxValue);
public void MsgPackSpecArrayMinus()
{
    var buffer = _buffer.AsSpan();
    var wroteSize = MsgPackSpec.WriteArray16Header(buffer, length);
    for (var i = 0; i < length; i++)
        wroteSize += MsgPackSpec.WriteInt32(buffer.Slice(wroteSize), baseInt - i);
}

С code, using msgpuck:

#include <msgpuck.h>
char buf[65535];
extern void serializeIntArrayMinus()
{
    const uint32_t size = 100;
    const int64_t base = 1L << 30;
    char *w = buf;
    w = mp_encode_array(w, size);
    int64_t idx = 0;
    for (; idx < size; ++idx)
        w = mp_encode_uint(w, base-idx);
}

C++ code, using msgpack-c:

#include <msgpack.hpp>
using namespace msgpack;
extern "C" void serializeIntArrayMinus()
{
    const uint32_t size = 100;
    const int64_t base = 1L << 30;
    sbuffer buffer;
    packer<sbuffer> pk(&buffer);
    pk.pack_array(size);
    int64_t idx = 0;
    for (; idx < size; ++idx)
        pk.pack(base-idx);
}

You can see results below. Empty is a benchmark of PInvoke. That means that native code does its work in 82-84 ns.

           Method |      Mean |     Error |    StdDev |        Q3 | Scaled | ScaledSD | Allocated | --------------------- |----------:|----------:|----------:|----------:|-------:|---------:|----------:| MsgPackSpecArrayMinus | 342.31 ns | 3.2070 ns | 2.8429 ns | 343.53 ns |   3.62 |     0.11 |       0 B |
      CArrayMinus |  92.95 ns | 1.1212 ns | 0.9363 ns |  93.77 ns |   0.98 |     0.03 |       0 B |
            Empty |  10.15 ns | 0.2168 ns | 0.2028 ns |  10.33 ns |   0.11 |     0.00 |       0 B |
    CppArrayMinus |  94.17 ns | 2.0049 ns | 3.0009 ns |  95.35 ns |   1.00 |     0.04 |       0 B |

Frankly speaking, I was dissapointed. Twice as slow is acceptable, but 4 times? @EgorBo told me, that Span<T> is not free and may be we’re paying for safety here. And if we use unsafe and pointers we will be as fast as native code. Ok, let’s rewrite it to pointers:

[Benchmark]
public unsafe void Pointer()
{
    fixed (byte* pointer = &_buffer[0]) // bounds check, pinning of pointer
    {
        pointer[0] = DataCodes.Array16;
        Unsafe.WriteUnaligned(ref pointer[1], length);
        for (var i = 0u; i < length; i++)
        {
            pointer[3 + 5 * i] = DataCodes.UInt32;
            Unsafe.WriteUnaligned(ref pointer[3 + 5 * i + 1], baseInt);
        }
    }
}

Results are promising: we’re slower only by 25% (80 ns - native, we have 100 ns). That can be explained by bounds check during acquiring pointer and pinning an address of array, so GC will not move it around during heap compaction. It’s not that heap compaction should occur in our case, but GC and runtime does not know anything about it and do pin always.

           Method |      Mean |     Error |    StdDev |        Q3 | Scaled | ScaledSD | Allocated | --------------------- |----------:|----------:|----------:|----------:|-------:|---------:|----------:| MsgPackSpecArrayMinus | 342.31 ns | 3.2070 ns | 2.8429 ns | 343.53 ns |   3.62 |     0.11 |       0 B |
          Pointer |  99.37 ns | 1.1726 ns | 1.0969 ns | 100.32 ns |   1.05 |     0.03 |       0 B |

I noticed that I’m writing ints as is. But mine CPU is made by Intel, so it use little endian byte order and msgpack - big endian. So, maybe byte reversing is so slow?

           Method |      Mean |     Error |    StdDev |        Q3 | Scaled | ScaledSD | Allocated | --------------------- |----------:|----------:|----------:|----------:|-------:|---------:|----------:| MsgPackSpecArrayMinus | 342.31 ns | 3.2070 ns | 2.8429 ns | 343.53 ns |   3.62 |     0.11 |       0 B |
 PointerBigEndian | 103.97 ns | 1.0718 ns | 1.0026 ns | 104.85 ns |   1.13 |     0.03 |       0 B |

No, not so slow. Let’s go next hypothesis - Span<T> is expensive or slicing without setting length. Replace all pointers with spans!

           Method |      Mean |     Error |    StdDev |        Q3 | Scaled | ScaledSD | Allocated | --------------------- |----------:|----------:|----------:|----------:|-------:|---------:|----------:| MsgPackSpecArrayMinus | 342.31 ns | 3.2070 ns | 2.8429 ns | 343.53 ns |   3.62 |     0.11 |       0 B |
    SpanBigEndian | 155.05 ns | 2.3034 ns | 2.1546 ns | 157.00 ns |   1.64 |     0.05 |       0 B |   SpanLengthBigEndian | 153.94 ns | 1.6384 ns | 1.5326 ns | 155.42 ns |   1.63 |     0.05 |       0 B |
 PointerBigEndian | 103.97 ns | 1.0718 ns | 1.0026 ns | 104.85 ns |   1.13 |     0.03 |       0 B |

Hm, yeah. Span<T> is expensive. We’re slower by 100% than native code and by 50% than csharp-with-pointers code. It seems that setting of length of span helps a bit, later I’ll to set up it everywhere. Let’s return back packing via high-level method from BinaryPrimitives class.

                   Method |      Mean |     Error |    StdDev |        Q3 | Scaled | ScaledSD | Allocated | ----------------------------- |----------:|----------:|----------:|----------:|-------:|---------:|----------:|
    MsgPackSpecArrayMinus | 342.31 ns | 3.2070 ns | 2.8429 ns | 343.53 ns |   3.62 |     0.11 |       0 B |
      SpanLengthBigEndian | 153.94 ns | 1.6384 ns | 1.5326 ns | 155.42 ns |   1.63 |     0.05 |       0 B |  SpanBigEndianBinaryPrimitive | 162.62 ns | 2.3434 ns | 2.1920 ns | 164.92 ns |   1.72 |     0.06 |       0 B |

More high-level code brings us more checks. It costs us 10 ns more. Still - 200 ns more still unaccounted. And now I saw that in pointer code I serialize basent, but in base code - baseInt - i. Maybe that’ll affect our timing? Lets find out.

           Method |      Mean |     Error |    StdDev |        Q3 | Scaled | ScaledSD | Allocated | --------------------- |----------:|----------:|----------:|----------:|-------:|---------:|----------:| MsgPackSpecArrayMinus | 342.31 ns | 3.2070 ns | 2.8429 ns | 343.53 ns |   3.62 |     0.11 |       0 B |
 MsgPackSpecArray | 164.12 ns | 3.2278 ns | 3.9640 ns | 166.59 ns |   1.74 |     0.07 |       0 B |

Woohoo. Does subtraction affect native code in same way?

                   Method |      Mean |     Error |    StdDev |        Q3 | Scaled | ScaledSD | Allocated | ----------------------------- |----------:|----------:|----------:|----------:|-------:|---------:|----------:|
    MsgPackSpecArrayMinus | 342.31 ns | 3.2070 ns | 2.8429 ns | 343.53 ns |   3.62 |     0.11 |       0 B |
         MsgPackSpecArray | 164.12 ns | 3.2278 ns | 3.9640 ns | 166.59 ns |   1.74 |     0.07 |       0 B |
                   CArray |  94.60 ns | 1.9332 ns | 2.9523 ns |  96.06 ns |   1.00 |     0.00 |       0 B |
              CArrayMinus |  92.95 ns | 1.1212 ns | 0.9363 ns |  93.77 ns |   0.98 |     0.03 |       0 B |
                 CppArray |  84.76 ns | 1.3228 ns | 1.2374 ns |  85.66 ns |   0.90 |     0.03 |       0 B |
            CppArrayMinus |  94.17 ns | 2.0049 ns | 3.0009 ns |  95.35 ns |   1.00 |     0.04 |       0 B |

Is seems that it doesn’t. More in part 2.

Developer stories

Stories from the life of a software developer

When every nanosecond matters