Three Variations on Asynchronous IO in Fredis.net

In an attempt to improve the performance of Fredis.net, to bring it as close a possible to that of the Microsoft Open Tech version of Redis, I implemented three different versions of async message processing. These different async implementations have large differences in performance. The three implementations are

1. 100% async using F# computation expressions, all socket/stream reads and writes are async.

2. hybrid ‘async at the borders’, the first socket read of an incoming message and the final write/flush of a reply are async, all other reads and writes are synchronous.

3. 100% async using SocketAsyncEventArgs, adapted to work with F# Async computation expressions.

The graphs below show the number of requests per second Fredis.net can process for the PING_INLINE, PING_BULK, GET, SET, INCR and MSET commands, for each type of async IO. The number of clients ranges from 1 to 1024. The data was generated by redis-benchmark running on the same machine as Fredis.net.

 

Surprisingly, to me at least, the hybrid-async/sync option was faster than using fully asynchronous socketAsyncEventArgs (except for PingInline, which i think is a special case, as it does only requires three read/write ops). I suspect what happens is that the first async read pulls-in more bytes than asked for, subsequent synchronous reads are fast as the data is already available and sync reads do not pay the costs of async. Similarly sync writes may be buffered but not sent, before an async flush triggers the socket write op. Because this the code is async ‘at the borders’ there is no thread blocking while waiting for an incoming client message.

Async function calls do more work than the corresponding sync call due to their thread-hopping, continuation calling nature. To quantify async overhead* I used BenchmarkDotNet, and wrote a simple program that compares Stream.AsyncRead, which returns an F# Async, and Stream.Read. I also benchmarked Stream.ReadAsync, which returns a TPL Task, and C# async/await because why not. This benchmark is not intended to measure the advantages of async IO, there is no IO being performed. An array of bytes was written to a MemoryStream, then MemoryStream sync and async read functions were timed by BenchmarkDotNet. (code is at the end of this article)

F# benchmark results
Type=BenchmarkSyncVsAsync Mode=Throughput

    Method | ArraySize | Median     | StdDev    |
---------- |---------- |----------- |---------- |
 Read      | 256       | 3.3425 ns  | 0.0395 ns |
 AsyncRead | 256       | 15.6435 ns | 0.1872 ns |
 ReadAsync | 256       | 12.5468 ns | 0.2811 ns |
 Read      | 1024      | 3.3362 ns  | 0.0280 ns |
 AsyncRead | 1024      | 15.5780 ns | 0.1480 ns |
 ReadAsync | 1024      | 12.3052 ns | 0.0501 ns |
 Read      | 4096      | 3.3496 ns  | 0.0329 ns |
 AsyncRead | 4096      | 15.6718 ns | 0.1052 ns |
 ReadAsync | 4096      | 12.3786 ns | 0.0985 ns |
 Read      | 16384     | 3.3665 ns  | 0.0347 ns |
 AsyncRead | 16384     | 15.7203 ns | 0.5591 ns |
 ReadAsync | 16384     | 12.4000 ns | 0.3800 ns |
 Read      | 65536     | 3.3617 ns  | 0.0399 ns |
 AsyncRead | 65536     | 15.7170 ns | 0.1364 ns |
 ReadAsync | 65536     | 12.4067 ns | 0.0872 ns |


C# async/await stream read benchmark results
Type=CsBenchmarkAsyncAwait Mode=Throughput

Method        | ArraySize | Median     | StdDev    |
 ------------ |---------- |----------- |---------- |
 CsAsyncRead  | 256       | 13.3116 ns | 0.2075 ns |
 CsAsyncRead  | 1024      | 13.2593 ns | 1.6684 ns |
 CsAsyncRead  | 4096      | 13.2188 ns | 0.1407 ns |
 CsAsyncRead  | 16384     | 13.2381 ns | 0.1144 ns |
 CsAsyncRead  | 65536     | 13.2687 ns | 0.2302 ns |

 

The benchmark shows that sync reads are roughly 5x faster than async reads for x64 applications, which might explain why the hybrid async/sync approach is faster.

Notes

BenchmarkDotNet system config output

 BenchmarkDotNet=v0.9.7.0
 OS=Microsoft Windows NT 6.2.9200.0
 Processor=Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz, ProcessorCount=8
 Frequency=2533209 ticks, Resolution=394.7562 ns, Timer=TSC
 HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
 JitModules=clrjit-v4.6.1080.0

 

F# Read vs AsyncRead vs ReadAsync benchmark code

type BenchmarkSyncVsAsync () =

    let memStrm:MemoryStream = new MemoryStream()
    let mutable dst:byte array = null

    [<Params(256, 1024, 4096, 16384, 65536)>]
    member val public ArraySize = 0 with get, set

    [<Setup>]
    member this.Setup () =
        let arr = Array.zeroCreate<byte> this.ArraySize
        let rnd = System.Random()
        rnd.NextBytes arr
        memStrm.Write(arr, 0, this.ArraySize)
        dst <- Array.zeroCreate<byte> this.ArraySize

    [<Benchmark>]
    member this.Read () = 
       memStrm.Read( dst, 0, this.ArraySize )

    [<Benchmark>]
    member this.AsyncRead () = 
        async{
           return! memStrm.AsyncRead ( dst, 0, this.ArraySize )
        }

    [<Benchmark>]
    member this.ReadAsync () = 
        let tsk = memStrm.ReadAsync ( dst, 0, this.ArraySize )
        tsk.Wait()
        tsk.Result

[<EntryPoint>]
let Main args =
 BenchmarkRunner.Run<BenchmarkSyncVsAsync>() |> ignore
 0

C# async/await benchmark code

public class CsBenchmarkAsyncAwait
{
    [Params(256, 1024, 4096, 16384, 65536)]
    public int ArraySize { get; set; }

    private byte[] dst;
    private MemoryStream memStrm = new MemoryStream();

    [Setup]
    public void Setup()
    {
        dst = new byte[ArraySize];
        var src = new byte[ArraySize];
        var rnd = new Random();
        rnd.NextBytes(src);
        memStrm.Write(src, 0, ArraySize);
     }
    private async Task<int> ReadAsync()
    {
        var tsk = memStrm.ReadAsync(dst, 0, ArraySize);
        var numBytes = await tsk;
        return numBytes;
    }

    [Benchmark]
    public int CsAsyncRead()
    {
       var tsk = memStrm.ReadAsync(dst, 0, ArraySize);
       tsk.Wait();
      return tsk.Result;
    }
}

class Program
{
    static void Main(string[] args)
    {
         BenchmarkRunner.Run<CsBenchmarkAsyncAwait>();
    }
}

*disclaimer, computation expressions and async IO are wonderful, just because I say they have a cost does not mean I am against their use.

 

Advertisements
Three Variations on Asynchronous IO in Fredis.net

Can a program written in a managed functional language approach the performance of a program written in C?

Performance tests say ‘maybe’.

Tests captured the number of GET, SET, INCR and MSET ‘requests per second’ vs ‘number of clients’. Tests were initiated using the ‘redis-benchmark’ utility running on the same machine as the instance of Redis/Fredis.net being tested.  See ‘Testing Setup’ below for further details

getSet

msetincr

Obviously Microsoft Redis trounces both OSX Redis (3.0.7) and Fredis.net. However, Fredis.net is in the same ballpark as OSX Redis, which is not insignificant when comparing a program written in a managed, functional language against one with a reputation for speed written in C. I need to be careful about what I say I’m comparing, these tests also compare IO completion ports + Windows VS Kqueue + OSX. Also, Fredis.net uses a simple .net Dictionary to store string data (Redis strings are really byte arrays), Redis functionality not implemented by Fredis.net may require something more sophisticated than a simple Dictionary but which comes with a cost. What can be said is that you are not shooting yourself in the foot by choosing a managed functional language for performance critical applications.

Fredis.net does consume more CPU that either Microsoft Redis or Redis on OSX, it also consumes much more memory, this can be seen when running the performance tests. As an experiment I added a GC.Collect call in Fredis.net FlushDB processing after which memory use was comparable to either version of Redis (I removed the GC.Collect after this experiment, manually triggering GC collections is usually the dumb thing to do).

Future Fredis.net development

The next version of Fredis.net will, probably, use SocketAsyncEventArgs, which should improve Fredis.net performance, and maybe CPU and memory usage/GC pressure.

Testing setup

Tests were run on a late 2013 MacBook Pro, with 16 GB of ram and a Core i7-4960HQ CPU. Microsoft Redis andFredis.net were run on Windows 10 running via bootcamp. OSX Redis was run on El Capitan. Both Microsoft and official Redis had persistence disabled.

A key space (the number of distinct keys) of 64 and a message size of 1K were chosen for no other reason than they were neither very small nor large. Tests were run for 1, 2, 4, 8 … 1024 clients.

Output from a script like the one below was piped to a text file, which was in turn processed by a small F# program which produced charts using FSharp.Charting to generate the graphs.

redis-cli flushdb

echo “redis-benchmark -r 64 -d 1024 -t ping,set,get,incr,mset -n 100000 -q -c 1”

redis-benchmark -r 64 -d 1024 -t ping,set,get,incr,mset -n 100000 -q -c 1

redis-cli flushdb

echo “redis-benchmark -r 64 -d 1024 -t ping,set,get,incr,mset -n 100000 -q -c 2”

redis-benchmark -r 64 -d 1024 -t ping,set,get,incr,mset -n 100000 -q -c 2

Can a program written in a managed functional language approach the performance of a program written in C?

The state of Fredis.net as-of Feb 2016

All Redis string functions have been implemented except SETEX (set with expiry), specifically

APPEND, BITCOUNT, BITOP, BITPOS, DECR, DECRBY, FLUSHDB, GET, GETBIT, GETRANGE, GETSET, INCR, INCRBY, INCRBYFLOAT, MGET, MSET, MSETNX, PING, SET, SETBIT, SETNX, SETRANGE, STRLEN

Fredis.net uses the same RESP protocol used by Redis, so tools such as redis-cli and redis-benchmark will work with Fredis.net.

Redis features not supported

  • Transactions
  • multiple db’s indexed by number
  • persistence
  • sharding
  • master/slave instances

Fredis.net uses F# async workflows, and therefore IO Completion ports and the .Net threadpool, to convert RESP messages received from clients into Redis commands. Commands from different threadpool threads are multiplexed down to a single command executing thread by sending them to an F# mailbox. Replies are sent back to the client using only async socket calls. This is all vanilla F#, I did not need to go to extreme lengths to coax performance out of Fredis.net.

Client input can be received in either a partially or fully async manner. ‘Partially’ in the sense that Fredis.net waits for new input on a Stream.ReadAsync call (so no thread blocking), subsequent reads are synchronous until the current RESP message has been read. Partially async input message processing is more performant than fully async. This may be due to the multiple nested callbacks associated with many fine grained async actions.

Client input is encoded in the Redis RESP protocol, some elements of which are length prefixed, others are delimited by CRLF. Reading delimited RESP from streams is done one byte at a time while searching for the delimiter, which could be inefficient, so a BufferedStream is used to wrap the sockets network stream.

Fredis.net is written in F# 4.0, and where possible in a functional style e.g. with algebraic data types instead of classes.

Fredis.net is a vehicle for scalability experiments (and not a replacement for Redis), please feel free to point out any mistakes or things which could be improved.

The state of Fredis.net as-of Feb 2016