Recent discussions on Comet server performance have focused on sockets and IO:
- A Million-user Comet Application with Mochiweb, Part1, Part 2 and most interestingly Part 3
- A Gazillion-user Comet Server with libevent
- And here on Comet Daily: Is it Raining Comets and Threads?
Liberator was designed with performance in mind from Day 1. In fact, before Liberator we had a win32 C++ server which started life as a proof of concept. Liberator came about when we realised that we needed better performance.
I have talked about Benchmarking Comet Servers before and described how there are many variables involved, the key ones being the number of users and the amount of data. However, different usage profiles can cause some very different results.
A key part of the discussion recently has been low level socket handling, in Java with IO or NIO, and in C (or Erlang with C underneath) using techniques supported by libevent. Liberator has an event abstraction similar to libevent, with different implementations all providing the typical read, write and timed events with callbacks. This event manager started its life using select(), but quite quickly moved to use poll() which showed some improvements. The select() implementation was kept for the win32 build (the event manager is part of the library that Liberator uses along with our other products, but Liberator itself does not have a win32 build these days).
A few years ago we moved on to use epoll on Linux and /dev/poll on Solaris, both of which give a more efficient API as they allow the OS to give you back a handle when an event occurs, as opposed to select or poll where you generally have to loop through all your open sockets finding which one the event matches. This is clearly going to be better for high numbers of sockets.
One quirk we found under testing resulted in the default configuration of Liberator using epoll for client-side sockets, where there could be high numbers, and standard poll for the server-side (Data Sources) where there are typically only a few sockets.
Liberator runs an asynchronous event loop per thread which means it can take advantage of multi cpu or multi core machines. There is a thread per server-side connection (of which there are typically only a few) and a configurable number of threads to handle clients. It depends on this usage profile, but typically you would have a client handling a thread per CPU core. With very high numbers of server-side updates, you may want fewer client handling threads as the server-side threads require more CPU time—the Operating System can obviously handle this, but fine tuning can improve performance. Multiple sources of data can be configured either for performance reasons, for example load balancing, or to partition data, for example different back end applications or data feeds.
Memory and Abstractions
One important area of a C application is memory management. Liberator has evolved over the years in this area. In the beginning handling of messages internal and external aimed for a zero copy architecture, but after performing tests it was clear that in our case the extra burden of the management of this was not a good trade off.
There are also areas where memory is pre-allocated and cached on a per thread basis, which can avoid the mutexes used by most malloc libraries. However, this has never actually been proven to provide any significant benefit under testing.
There are still various parts of Liberator which avoid unnecessary copying and also unnecessary allocating of memory which leads into abstractions. As mentioned, Liberator was designed for performance and this led to a design which does not abstract away too much functionality behind interfaces, giving direct access where beneficial. There are still internal APIs and abstractions, but that requires intimate knowledge in these areas.
There are various protocols used between clients and Comet servers, I have posted about bandwidth before, as this is a critical area when considering server performance, especially to high numbers of clients. Liberators protocol has always aimed to be as brief as possible.
There are two areas specifically where this has been achieved. Firstly, the subject of a subscription is not sent for every update, it is mapped at subscription time onto a shorter identifier which can considerably decrease the size of update messages, especially in the common case of payload size not being particularly big. Secondly the payload itself, for simple name/value pair type messages, the field names are also mapped onto shorter identifiers. This is not as simple for some Comet servers which aim to let the payloads be completely free form, i.e. a JSON data type, but for common usage a streamlined name/value pair data type saves bandwidth. Some other Comet servers have implemented similar techniques.
Batching and Throttling
Two techniques that help out with performance are batching and throttling (or conflation).
Batching is fairly common with network programming. When lots of small messages are to be sent, batching them together to send in one request can be beneficial in a number of ways. Although you are introducing latency by delaying the send of the first messages, overall latency can improve due to better use of TCP/IP. Operating systems actually do this for you, using Nagle’s algorithm, however, application level knowledge usually means improvements can be made to this when implemented by the application.
Batching sends all of the same data, just packaged up differently, but throttling reduces the data you actually send. Again, this is not really possible with completely free-form messages, but in Liberator the record data type is intended for use with financial market data. If the stock price of MSFT is updating 10 times a second, the client may not need to receive all of those updates. So the updates are conflated so, for examples, every half a second the latest value of any field that has changed will be sent out. For benchmarking purposes or in scenarios where the client needs each individual update, throttling can simply be turned off. The time period can also be configured on a subject-space basis, so some data is treated differently than other data.
There are some basics about publish/subscribe servers that you have to get right, and I would hope everyone does. An example: when a message needs to be sent to subscribers, you don’t want to have to loop through all the users finding which ones are interested, you should be able to traverse a list of subscribers directly. Over the years, various other data structures in Liberator have been changed as performance issues arose - when customers are let loose on a project they can do things you didn’t really think they would, and this generally comes back to the ‘variables’ involved in benchmarking. A simple case of removing a subscription can be an expensive operation when the user has thousands of subscriptions if the right data structure is not used, but if you never test a scenario like that, you might not realise you have an inefficient operation.
This is something that Google is very hot on: finding the right algorithms and data structures means you can scale, optimising specific areas of code can help too, but often can yield insignificant gains once you scale things to large numbers.
There are many aspects to the performance of a Comet server, I have touched on a few and how Liberator tackles them.
Richard Jones’ Million Comet Users interested me a lot. Liberator has focused on small numbers of users, relative to Richard’s goals, but with much higher update rates to each user. I would like to test Liberator with a million users, but that does require a fair amount of time and effort - since the core basics of Richard’s test are similar to Liberator I would think that it could cope in a similar way.