A scientific theory says the oceans were created by millions of comets and asteroids colliding with the Earth many millions of years ago.
A long polling application is very similar, like a rain of:
- … and Threads??
These requests have arrived to stay in the server like water in the ocean, but do these Comets contain threads?
Threads and ItsNat
I thought that because this approach followed the standard, worked on any servlet container, and most of the time threads will be asleep, there would be no problem with having so many sleeping threads!
My happiness was very short; soon I realized the ItsNat approach did not scale for many concurrent users because too many threads were needed.
There is significant evidence, people, and products focused on the goodness of asynchronous servlet containers or using techniques avoiding threads in general: IceFaces offers a custom asynchronous servlet container, while GlassFish, Tomcat and Jetty are pushing solutions of containers with the promise of very few threads servicing many clients. However, in my opinion this is the reason that NIO was invented: to avoid threads blocking when reading and writing streams. The typical pattern of classic IO (if a thread is blocked and a new connection has arrived a new thread must be created or this connection will be stalled too), and of course thread synchronization is time consuming and affects performance. Everybody seemed to know NIO was superior, in performance and scalability, to IO because of threads.
So at this point, while the ItsNat approach might satisfy the needs of many Comet applications (not every web site/application is Google or eBay), it failed with too many concurrent requests. Detecting and using the new servlet containers become a priority in my “TODO” list (in spite of current solutions not being standardized). For a while, I left this scalability problem alone and simply cited in in the ItsNat documentation about our long-polling Comet implementation.
As a frequent reader of javalobby.org, one day I found an article on avoiding NIO to get better throughput. Dan Dyer cited the blog and a paper of Paul Tyma, a Software Engineer at Google. In his research paper, Paul destroys, with humor and with some tests, the typical well established myths about the superiority of performance and scalability of NIO against IO. I must recognize that I was shocked with Paul’s investigations and soon realized the correlation with long-polling.
The origins of the NIO approach and why threads were “bad”
To understand Paul’s insights, first we must understand some background with thread programming. First of all, in any system, threads are basically pieces of memory: stacks and state registries. There is no correspondence between threads and hardware. Furthermore, usually any processor (or core) is almost a single-thread processor. in other words, a very very small number of running threads completely fulfills the “real” parallelism of any average computer. The parallelism of threading is an illusion and thread management is basically a software problem with some necessary support of hardware.
Managing tons of threads is not a hardware problem, but one with software. For example, consider a single core processor saturated with two threads doing some non-blocking tasks. If in a single-real-thread-processor one task needs X time to complete, Y threads doing the same need X*Y as a minimum and in ideal circumstances (in this extreme case threading cost is 0), but the parallelism sensation is important and justifies threading.
The problem of many threads is that the performance cost is significant:
- Thread switching has a cost, and this may increase when more threads are added.
- Uncontended synchronization (calling a synchronized method not blocked) has a cost.
- Contended synchronization has a cost.
- The thread stack occupies memory.
NIO was invented as an alternative to thread-based communications (one socket per thread blocking when necessary). The basic idea of NIO is to avoid threads, where one single thread iterates over all open connections asking for and writing new data with no blocking. In other words, if there is no data pending read or no data cannot be written in a connection, nothing is done, and the next connection is tried. This way the illusion of parallelism is fulfilled (every effective read/write is a small buffer) and thread costs are gone. This idea is heavily based on Matt Welsh’s Ph.D. thesis and the SEDA project.
Welsh’s thesis was right; the SEDA project started around 2001, and scalability of threading in Linux 2.4 kernel was bad, really bad: only a few hundreds threads were enough to stall an average Linux system. NIO was introduced in Java 1.4 in 2002, heavily based on Welsh’s ideas.
As time passes, NIO in the real world uses some threads (very few) to avoid unavoidable locks and to make better use of multi-core systems (NIO is not thread-free), and now the JVM is better managing threads reducing costs, and most importantantly: Linux Threads were re-engineered in the 2.6 version of the kernel as Linux NTPL:
In tests, NPTL succeeded in starting 100,000 threads on a IA-32 in two seconds. In comparison, this test under a kernel without NPTL would have taken around 15 minutes
Paul Tyma’s investigations at Google about NIO and IO
In spite of living in the multi-core age (some threading is absolutely necessary in any server, IO or NIO), Linux NTPL, and Java JVM threading improvements, the conclusion of superior performance with NIO was not questioned. That is, until Paul Tyma took a look and found he was not alone in questioning this assumption:
For a tangential purpose I was benchmarking NIO and IO (simple “blast data” benchmark) I could only get NIO to transfer data up to about 75% of IO’s speed, asynchronous (blocking NIO was just as fast), I blamed myself because we all know asynchronous is faster than synchronous right? Started doing some emailing and googling looking for benchmarks, experiential reports. Yes, everyone knew NIO was faster. No, no one had actually personally tested it. Started formalizing my benchmark then found: http://www.theserverside.com/discussions/thread.tss?thread_id=26700 and http://www.realityinteractive.com/rgrzywinski/archives/000096.
TheServerSide.com page revealed the same initial conclusions of Paul, in this case by Rahul Bhargava, CTO of Rascal Systems:
As you can see the last line indicates vanilla blocking server (thread per connection) produced the best throughput even with 1700 threads active in the JVM.
No one commented on this post. To be fair, the Discussions section of TheServerSide.com was never very popular compared to the News section, but still, not a single person commented to agree or disagree with this comment.
When Paul realized that NIO presuppositions could be false or may be not so definitive, new tests were done on Linux (kernel 2.6), Windows XP and a very big multicore system of Azul (a beast of 768 cores), trying to validate or to debunk some well established presuppositions against classic blocking and thread based IO:
- Asynchronous I/O is faster
- Thread context switching is slow and synchronization among threads will kill your performance
- Threads take up too much memory
- One thread per connection does not scale
The results of Paul’s tests using up to 1000 threads state the contrary:
- IO outperforms asynchronous IO in a single client/server scenario (including on Windows XP)
- Thread context switching, idle thread and uncontented synchronization cost is near zero. Sometimes explicit synchronization can be avoided using utilities for non-blocking concurrency; in fact IO methods used under the hood concurrency utilities when possible. Furthermore, NIO is not free of context switching cost when querying and managing client data associated to the keys returned by the selector.
- The Java stack can be reduced to 48KB (minimum), so 41666 stacks (threads) fulfills 2GB of memory (obviously more space is need for the OS, applications, etc). This stack “automatically” saves the state of the communication. In the NIO solution, the client state is associated to the selector keys and must also be saved to keep track of communication flow and phases.
- Performance is greatly independent of the number of threads (up to 1000). In the case of Azul, more threads clearly implies more performance.
And finally Paul states two advantages of classic IO:
- Coding is much simpler: the code contains the communication flow and states.
- Make better use of multi-cores
Finally Paul tells “The story of Rob Van Behren (as remembered by me from a lunch with Rob)”:
Conclusions: Not So Fast…
First of all, Paul’s study and conclusions are based on raw socket connections, not about HTTP and servlets. A concrete study applied to HTTP, standard servlets and modern asynchronous approaches, would be necessary to validate Paul’s conclusions on higher levels.
For me, the most important conclusion is NIO (asynchronous in general) and IO approaches are comparable, and there is no longer one definitive single correct answer. There may be some scenarios where NIO or IO is better and other scenarios where it is irrelevant. And of course there is space for technology improvement.
This case remind me the old battle about performance between Java and C. Java is no longer the snail technology it was, and many agree that C and Java performance is today very similar, and speed is no longer an important criterion when selecting the technology in most typical scenarios.
In the case of ItsNat, supporting new asynchronous servlet containers is desirable but no longer an absolute priority for scaling, and no longer problematic. Because ItsNat is based on Java 1.4 and new containers are compiled with Java 1.5, reflection will be the path we take, and a new asynchronous servlet standard is around the corner!