Orbited

Latency: Long Polling vs Forever Frame

by Greg WilkinsDecember 18th, 2007

One of the oft-cited advantages of forever frame over long polling is that it does not suffer from the 3x max latency issue. This is when an event occurs the instant after a long poll response is sent to a client, so the event must wait for that response and the subsequent long poll request before sending a response containing that event. Thus while the average latency of long polling is very good, the theoretical max latency is 3x the average latency, which is the time taken to transit the network one-way.

Forever frame is said not to suffer from this issue, as it can send a response back at any time, even the instant after a previous event has been sent. Strictly speaking, that is not always the case, as forever frame implementations also need to terminate responses and issue new requests, at the very least to prevent memory leaks on the client. But for the purposes of this musing, let’s assume that it is true.

Does this theoretical lowering of the maximum latency actually enable any applications to be developed that would be impossible with the 3x max latency? For example, could forever frame be used to implement a first person shooter game that would be unplayable with long polling injecting a 3x latency on occasions (normally just as you charge into the room full of enemy guns lagging…).

Unfortunately, I think not. The problem is that comet will never be suitable for any application that cannot accept a bit of jitter in the application latency. Comet can achieve great average latency, often <100ms over the internet, but it is always going to suffer from the possibility of an occasional long delay.

The reason is that TCP/IP is by definition the transport that will be used for comet (that adheres to open standards) and TCP/IP is simply not a protocol that can guarantee constant low latency. Like long polling, TCP/IP gives very good average latency, but all it takes is 1 dropped packet and you will incur a TCP/IP timeout and resend, which by definition will be at least 3 x the network traversal time (sender must wait at least 2x network time before deciding that the ack will never come, then it must resend). Sure TCP/IP has lots of tricks and optimizations that are designed to help with latency for missed packets (e.g. fast resend, piggyback ack), but they rely on other traffic being sent in order to quickly detect the dropped packet. If a lone event is sent in a single packet, then at least 3x latency will result. One could even argue that the client’s need to send a new poll request with long polling will provide a convenient data packet on which an ack can piggyback, and could improve latency in some situations.

So any application that cannot tolerate 3x max latency is an application that should not be considered for comet. Comet is ideal for applications that thrive on good average latency, but that can tolerate the odd delay. For such applications, long polling is a good match and the theoretical latency gains of forever frame are probably just that—theoretical.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]
SitePen, Inc. Comet Services

10 Responses to “Latency: Long Polling vs Forever Frame”

  1. Ben Scherrey Says:

    I agree that, with a request/response protocol like http, event based turn around times are only going to get so good and comet’s probably taking it as far as it can go. So, while your premise that most apps will rarely ever benefit from the theoretical improvement in latency, is there any downside in selecting ‘forever frame’ over ‘long polling’? Long polling just doesn’t smell as architecturally elegant as a solution albeit the nature of http makes any coment implementation a bit of a hack so “elegance” is clearly a relative concept here. Thanx for the informative post!

  2. Martin Tyler Says:

    I’m afraid I disagree with this. It’s true that if absolute guaranteed low latency is needed then comet, and indeed the Internet, is not for you. However, a full streaming connection reduces this jitter and reduces the bandwidth considerably.

    I have had a customer doing extensive monitoring of latency and compiling stats.. he complained that he got the occasional blip in latency - in this case it was 1 in every hundred thousand updates roughly. It turned out it was a backup cron job on their servers in the end. Anyway, a long polling solution would simply not be acceptable to him or most of our customers.

    Are you suggesting that applications that want the lowest latency possible have two options - 1. Give up, or 2. Put up with more latency that necessary ?

    The gains of a streaming connection are in no way theoretical.

  3. GregWilkins Says:

    Ben,

    there are indeed down sides of using forever frame.

    Firstly it is not strictly legal HTTP. A proxy is entitled to cache the entire response before forwarding it to a client. A client is entitled to buffer the entire response before acting on any part of it. Forever frame only works because implementations NOT specifications allow it to.

    Also, the server must allocate output buffers while sending forever frame responses. This can be a large memory commitment that Long Polling does not have to make. Long Polling needs to allocate resources to creating a response only when there is a response to make.

  4. GregWilkins Says:

    Martin,

    I hear what you are saying: ie that achieved latency can be very very good.

    But I don’t see why Long Polling would also not be acceptable on such networks?
    The blips that long polling introduces is ONLY if two events come close enough together
    to fall into the re-poll window, but not so close together that they end up in the same
    response anyway.

    So many many load profiles would not suffer from any blip at all. Those that do will
    get a blip on average only 2x the network traversal. ie probably < 100ms later than
    it would have been delivered. What applications can’t accept an occasional
    100ms blip in delivery latency, but can accept the potentions 1000ms latency of a
    TCP/IP timeout?

  5. Martin Tyler Says:

    “What applications can’t accept an occasional 100ms blip in delivery latency, but can accept the potentions 1000ms latency of a TCP/IP timeout?”

    For high frequency data in financial applications the long polling blip is more than a blip, it will occur very often. Blips caused by the network is something you have to accept if you use the Internet, but you dont have to accept long polling unless a proxy is not happy with streaming (which is rare).

    However, even with high frequency data, or rather even more so with high frequency data, you might want to batch updates together which can help with CPU and network. Long polling enforces this since you will obviously be batching updates while you wait for the next request to hit. This is fine if your desired batch time is greater than or equal to the round trip latency. In my opinion it probably is, but unfortunately our customers would not agree and like to configure much lower batch times to get the lowest possible latency.

    This aside, the extra bandwidth of long polling is a killer for some applications.

    Interestingly, one prospective customer a few years ago was very concerned with blips in latency over the Internet and we developed a proof of concept ‘dual channel’ system, where each channel took a different network route to the server and the application chose the message that got there first for each sequence number. We did some analysis and found there weren’t that many blips at all. The dual channel side of things worked, but since there werent that many blips the overhead of setting up routes and ISPs etc was not worth it

  6. GregWilkins Says:

    Martin,

    I completely agree. Specifically that the natural batching of the high frequency data changes is probably close to what you would want for efficiency batching anyway.

    The applications you appear to be describing don’t sound like they are user interfaces, but data transfer between financial applications. These are the type of application that regular blips of long polling could have a systematic effect on - and thus not something that long polling is suitable for. I’m not advocating comet long polling for arbitrary data transfer, but for the last mile to a UI running in a web browser. With UI’s it is often the case that after displaying an event (eg a price change), it is pointless displaying another until the human eye has perceived a change.

    If a price was changing 20 times per second, you could not display each price change as the result would literally be a blur. Far better to display the price 5 times a second, and give the eye a chance to see something. This display latency occurs naturally with long polling and thus it is a good fit for UIs.

    For non UI applications or UIs that themselves contain logic that consumes data streams, then I agree that long polling can introduce significant issues.

    But I would also caution about relying on the frequently reliable latency of TCP/IP. It is the case that packets are infrequently dropped, but it remains true that the packets are mostly carried on infrastructure that relies on dropping packets when there is congestion. This when the stock market melts down or there is some international financial crisis, is probably precisely the time that congestion will happen and TCP/IP will introduce some huge speed humps.

    cheers

  7. Martin Tyler Says:

    Greg,

    They are user interfaces.. A single item updating 20 times a second is not necessary, you are right, but you’d be surprised how many different instruments people view on a single screen.. so 20 times a second can be very common for the whole UI.

  8. Jörn Zaefferer Says:

    Could someone provide a few more details about long-polling and batching? The only time I heard something about that was when DWR’s batching features were mentioned.

  9. GregWilkins Says:

    Jörn,

    I will try to write more about it…. but simply here: batching is a good technique for a latency vs throughput trade off. Most software can be more efficient if it processes jobs/messages/tasks in batches as it can do so uninterrupted with the suitable level of parallelism needed. Throughput is improved by batching, but latency is increases as some jobs/messages/tasks need to wait longer than others.

    Long polling has natural batching. Messages queue until a poll is woken up to deliver them to the client. If the server is busy or the long poll is currently with the client, then the wakeup can can longer and more messages queue. But this gives a natural batch and reduces comms and CPU overhead for the server. So when idle, batches are small and when busy batches get bigger. It is a great automatic latency vs throughput trade off…. so long as the latency never grows larger than what is acceptable for your webapp.

    Take martins example of a UI with lots of separate instruments on the screen (something that I have been doing for a client just this month). If there are 20 price changes per second, then there are two extremes:

    0) send 1 price update every 50ms. This puts maximal asynchronous load on your server, network and client, but has the best latency for getting prices in front of the users eyes. You would need a streaming transport to implement this.

    1) send 20 price updates every 1000ms . This is a batch that will save considerable server and network resources, but the UI experience will be a bit more jerky and some prices may be delayed by up to a second. Which can be an issue as some traders still try to match their
    reflexes against computer trading :-)

    I would maintain that long polling can give a happy medium between these two. If the network ping time is 150ms (pretty slow for these networks), then network transit time is 75ms
    Give 25ms for a poll turn around, and that means that you can do 10 long poll cycles per second. Each long poll would return on average 2 prices and the max additional latency cause by long polling would be 150ms, but on average it would be much less than that. I can imagine traders complaining about prices being 1s late, but I can’t imagine complaints at 150ms!

    In fact, I have been asked to increase the latency and reduce the request rate. Setting a poll interval of 200ms on our bayeux implementation means that after a long poll returns, the next is not sent for 200ms. For a busy system this tends to cause 5 long poll cycles per second with on average 4 prices per poll. Request rate has been halved and the max latency has been increased by 350ms (average ~ 175ms). The UI is still very responsive without the clunk clunk clunk feel of changes every second, but server resources are controlled.

    Of course there is still the issue of imagined latency. If you ask a trader: “is it OK to delay the price update by N ms?”, the answer will probably be NO! no matter what the value of N is.
    Some clients will just want the minimal possible latency at any cost, and for them streaming is the solution. But Long Polling gives acceptable and scalable results for most reasonable expectations.

  10. Alessandro Alinone Says:

    Batching is actually very useful in streaming mode too, because it helps to create larger TCP packets, reducing both CPU usage and network resources consumption. So batching is one of the key elements in the determination of a classical trade-off of real-time systems: latency vs. scalability (that is, you can scale to more users when accepting higher latencies).

    For example, with Lightstreamer, the maximum batching time can be configured, and it applies to both streaming and long polling. Such configured time can be dynamically increased by the natural batching of long polling, as mentioned by Greg, or by network congestions in streaming mode.

Leave a Reply



Copyright 2014 Comet Daily, LLC. All Rights Reserved