History
Roughly about a month ago I wondered if Erlang's promise of massive concurrency was actually true.
Rather than take everyone's word for it, inspired by Richard Jones' C1M experiment I decided to whip together a benchmark of my own.
This benchmark pit a Cowboy implemented WebSocket echo server against five echo servers implemented in other languages. The platforms I chose were known for their ability to do c10k such as Netty, Node.js and gevent.
I ran the benchmark of these servers on a pair of m1.medium EC2 instance. The outcome was surprising. All the servers but the Erlang/Cowboy implementation had massive connection timeouts. At 3am I sent out an off the cuff tweet that was heard around the world.
I woke up that morning with cries of foul and anger that their framework of choice was poorly represented or that my methodology was faulty.
The Upgrade
After that initial benchmark, I spent the past month refining the process. I sifted through the trolls, accepted pull requests and refined my methodology with the help of many people
I automated the test to make it easier to benchmark the 19 servers that composed the test.
After many small scale benchmarks on AWS amidst data-center failures, I found that even a multi-core EC2 instance performed in a similarly with a majority of the servers. I decided that I needed actual hardware so I picked up a AMD Phenom 9600 Quad Core - 2300 mhz with 2GB of Memory off a Craigslist to compare the results.
Methodology
Before I get into the data, I'll briefly explain the testing methodology.
Each server needed to accept a connection and echo back every message that is sent to it. The benchmark process created a connection as fast as the server would allow and each client connection would send a 33 byte WebSocket message every second (33 bytes is how large a binary encoded ref() value is). Connection initiation, WebSocket Handshakes, Message send, Message receive and any error was recorded in a leveldb event log on the client with a timestamp.
I used two individual machines, a client and a server. On the server I ran supervisord which I used to start and stop the servers using its XML-RPC interface.
For each server, the client would do the following:
- Start the server
- Warm up the server by doing 10% of the test; 1000 clients for 30 seconds
- Do the full test of 10,000 clients for 5 minutes.
- Cool down the machine by waiting 15 seconds after stopping the server
For the EC2 benchmark I used two m1.large instances because they are multi-core.
For the local benchmark I used a Quad-core AMD Phenom 9600 machine as the server and my Mac Mini with a 2.0 GHz Core 2 Duo running Ubuntu 12.04 on the metal (dual-booted).
Results
EC2:
Implementation | Handshake Time (mean) | Handshake Time (σ) | Handshake Time (median) | Handshake Time (95%) | Handshake Time (99%) | Handshake Time (99.9%) | Latency (mean) | Latency (σ) | Latency (median) | Latency (95%) | Latency (99%) | Latency (99.9%) | Connection Timeouts |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
erlang-cowboy | 46.435 | 101.107 | 18.974 | 147.261 | 352.168 | 1063.972 | 294.412 | 596.273 | 64.735 | 1273.218 | 3120.932 | 4288.831 | 0 |
go-gonet | 56.979 | 140.531 | 15.68 | 185.868 | 1003.776 | 1236.265 | 1044.942 | 1137.411 | 1055.743 | 3283.068 | 4368.498 | 5537.2 | 0 |
pypy-tornado-N | 64.072 | 118.957 | 21.854 | 246.67 | 423.039 | 1103.369 | 1327.666 | 1823.508 | 1028.256 | 4354.553 | 8401.357 | 16029.463 | 0 |
node-ws-cluster | 249.623 | 508.366 | 50.147 | 1249.241 | 2211.754 | 3985.56 | 30833.332 | 32986.663 | 14660.698 | 100549.366 | 114553.006 | 121371.939 | 0 |
scala-play | 908.584 | 1393.172 | 141.461 | 3733.631 | 5936.176 | 8190.456 | 366.273 | 609.687 | 79.203 | 1330.0 | 3117.377 | 3476.734 | 0 |
python-twisted-1 | 2076.194 | 2100.532 | 1166.352 | 5596.974 | 6058.901 | 6316.446 | 93796.664 | 61363.174 | 94101.715 | 185265.599 | 197313.297 | 199846.245 | 0 |
java-webbit | 52.947 | 123.325 | 16.228 | 161.269 | 1003.017 | 1060.396 | 210.729 | 453.46 | 56.76 | 1103.055 | 1357.181 | 6149.677 | 1 |
python-twisted-N | 755.72 | 1136.685 | 187.814 | 3434.43 | 4442.401 | 4889.723 | 54210.334 | 38623.448 | 50500.447 | 117697.486 | 131026.009 | 140935.195 | 1 |
pypy-twisted-1 | 149.641 | 352.28 | 21.637 | 1102.857 | 1355.267 | 1479.419 | 1084.287 | 1558.89 | 274.745 | 4321.805 | 6739.464 | 9197.006 | 9 |
python-gevent-websocket-N | 481.563 | 829.23 | 95.228 | 2439.405 | 3439.794 | 4204.543 | 22003.975 | 13912.214 | 21685.753 | 44442.627 | 52024.362 | 64726.842 | 10 |
pypy-twisted-N | 193.31 | 1033.237 | 36.382 | 783.271 | 1132.858 | 16287.595 | 2338.537 | 2966.061 | 1244.591 | 7651.461 | 13346.289 | 22984.759 | 11 |
python-tornado-N | 607.127 | 786.601 | 246.747 | 2164.769 | 3179.422 | 4381.209 | 56953.943 | 37114.033 | 56394.833 | 118502.284 | 130582.121 | 134621.715 | 15 |
clojure-aleph | 267.895 | 514.649 | 37.883 | 1382.268 | 2158.25 | 4674.043 | 1865.476 | 1265.293 | 2028.303 | 4141.692 | 5156.172 | 7294.652 | 320 |
haskell-snap | 427.79 | 962.598 | 44.693 | 2051.212 | 5218.067 | 7650.281 | 60924.311 | 39791.393 | 62482.556 | 124664.402 | 131380.959 | 134375.903 | 494 |
pypy-tornado-1 | 518.101 | 829.457 | 66.107 | 2624.818 | 2890.471 | 3940.884 | 6477.86 | 5016.109 | 5672.344 | 15301.014 | 23690.165 | 29351.708 | 543 |
python-tornado-1 | 2214.414 | 2177.561 | 1543.396 | 6886.381 | 8047.27 | 24362.644 | 88912.451 | 51015.442 | 89075.187 | 166195.222 | 173810.909 | 177687.522 | 829 |
node-ws | 1795.977 | 1716.92 | 1068.45 | 5041.783 | 6172.564 | 8210.864 | 113848.38 | 70812.913 | 127665.168 | 209871.774 | 224936.342 | 233250.464 | 1425 |
python-gevent-websocket-1 | 3755.372 | 4884.994 | 817.072 | 11084.85 | 20798.476 | 26905.43 | 33670.264 | 23107.803 | 30398.543 | 74106.901 | 81290.792 | 89833.786 | 2830 |
perl-ev | 2229.378 | 4027.54 | 639.591 | 10472.944 | 20696.707 | 29750.725 | 1191.099 | 2025.464 | 408.571 | 5899.183 | 9190.275 | 17235.574 | 4371 |
Local
Implementation | Handshake Time (mean) | Handshake Time (σ) | Handshake Time (median) | Handshake Time (95%) | Handshake Time (99%) | Handshake Time (99.9%) | Latency (mean) | Latency (σ) | Latency (median) | Latency (95%) | Latency (99%) | Latency (99.9%) | Connection Timeouts |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
node-ws-cluster | 13.213 | 9.284 | 12.626 | 29.475 | 37.785 | 49.817 | 10.281 | 6.076 | 9.055 | 21.966 | 31.651 | 38.85 | 0 |
pypy-tornado-N | 13.323 | 9.099 | 12.701 | 27.625 | 35.48 | 54.795 | 10.283 | 6.025 | 9.048 | 21.828 | 32.21 | 44.626 | 0 |
python-tornado-N | 13.564 | 9.352 | 12.68 | 30.095 | 39.922 | 56.357 | 13.91 | 8.865 | 11.853 | 29.213 | 45.431 | 89.296 | 0 |
python-gevent-websocket-N | 13.657 | 9.702 | 12.775 | 29.703 | 36.899 | 76.452 | 10.931 | 6.3 | 9.836 | 21.572 | 32.288 | 53.608 | 0 |
erlang-cowboy | 13.874 | 8.697 | 13.64 | 28.067 | 35.509 | 51.35 | 11.183 | 7.911 | 9.332 | 24.767 | 40.981 | 67.124 | 0 |
python-twisted-N | 14.129 | 9.586 | 13.24 | 30.999 | 40.043 | 55.277 | 17.946 | 13.381 | 14.658 | 43.116 | 68.38 | 110.879 | 0 |
go-gonet | 14.278 | 14.903 | 12.595 | 28.756 | 59.599 | 169.155 | 13.824 | 43.38 | 9.274 | 22.799 | 40.163 | 710.677 | 0 |
pypy-twisted-N | 15.672 | 17.321 | 13.027 | 33.088 | 75.013 | 193.445 | 12.449 | 7.772 | 10.935 | 25.944 | 41.256 | 72.259 | 0 |
java-webbit | 17.547 | 55.258 | 13.121 | 30.681 | 52.433 | 1004.434 | 11.422 | 7.918 | 9.717 | 28.734 | 43.628 | 55.844 | 0 |
pypy-twisted-1 | 27.441 | 80.393 | 13.516 | 41.03 | 564.404 | 756.465 | 13.872 | 13.44 | 11.223 | 26.465 | 87.42 | 149.827 | 0 |
pypy-tornado-1 | 29.905 | 84.353 | 14.086 | 57.936 | 501.039 | 772.808 | 43.032 | 66.326 | 19.413 | 226.929 | 308.387 | 351.444 | 0 |
scala-play | 37.897 | 129.066 | 13.425 | 67.791 | 1011.026 | 1025.334 | 50.354 | 109.526 | 11.429 | 332.623 | 535.78 | 642.44 | 0 |
node-ws | 390.509 | 503.25 | 100.412 | 1421.621 | 1784.192 | 1994.033 | 13681.632 | 4458.928 | 15166.09 | 18194.522 | 19270.177 | 20861.466 | 0 |
python-twisted-1 | 583.673 | 987.51 | 32.343 | 3018.707 | 3397.069 | 3697.019 | 28912.227 | 13276.581 | 29386.427 | 48560.488 | 50650.745 | 52461.485 | 0 |
clojure-aleph | 126.745 | 299.223 | 15.478 | 1005.36 | 1015.957 | 1428.287 | 1121.035 | 486.259 | 1013.214 | 2020.782 | 3020.174 | 3037.376 | 187 |
python-tornado-1 | 1639.882 | 1451.978 | 1200.693 | 4856.938 | 6088.798 | 6448.096 | 65803.397 | 39733.592 | 66941.409 | 128679.909 | 135929.904 | 137092.752 | 265 |
haskell-snap | 176.855 | 383.722 | 20.219 | 1010.429 | 1742.526 | 2042.309 | 24428.761 | 14931.595 | 24404.437 | 47912.411 | 50257.414 | 50702.091 | 275 |
python-gevent-websocket-1 | 1218.935 | 1822.901 | 239.617 | 5067.439 | 6033.99 | 6535.422 | 11203.601 | 6301.863 | 11600.681 | 20927.817 | 23976.174 | 25890.765 | 538 |
perl-ev | 1555.812 | 3017.7 | 479.488 | 6982.455 | 16591.74 | 27522.464 | 35.84 | 17.569 | 34.168 | 66.911 | 82.117 | 91.16 | 3955 |
rows are sorted by connection timeout, handshake time and stddev, and message letency and stddev
Starting with the EC2 benchmark (times are in milliseconds), you will notice the peculiar behavior that EC2 exhibited:
Handshake elapsed time vs time and message latencies vs time:
There is what looks like O(log N) growth in connection times and linear growth in the message latency.
When compared to the non-virtualized, local setup, the data takes a much different shape.
If you look at the box plots of the EC2:
You will notice that while Erlang did the best, but 5% of the messages sent were >1.3 seconds which is likely unacceptable. It is a bitter-sweet victory for Erlang on EC2.
On my local hardware, all the servers did ridiculously better:
So much better that the top 5 servers are nearly a tie when it comes to time and consistency.
My Conclusion
This test shows the baseline overhead of the servers at ten thousand concurrent client connections. On physical hardware, save a few outliers, the frameworks did comparably well.
Testing these on EC2 on the other hand, the story is much different. All of the servers had wide deviations from the median. You will need to load balance your service in order to reach c10k on a m1.large EC2 instance. A single node using any platform will have trouble reaching c10k on a m1.large instance. Even the leader in the EC2 benchmark (Erlang) reached unacceptable message latencies of > 1 second. I am sure that there exists a instance type in EC2 that able to hit c10k on a single node but they will cost more than I am willing to spend on this test.
The moral of this story is that you must test your assumptions and do some capacity planning. If you're tempted to stick that node.js prototype up on EC2 without load testing to show your boss that node.js scales; you are likely to have egg on your face and be out of a job.
Code, Charts, Stats and Raw data
Here are links to more detailed charts and stats along with the raw CSV data. Try to be kind to Dropbox and only download the raw data if you really plan on using it.
Plea for support
I used Amazon affiliate fees earned through my Books Every Self-Taught Computer Scientist Should Read post to fund this project. The server I bought off craigslist and the cost of the numerous EC2 instances spun up for this test were all made possible through those funds. If you are in the market for any of these books, I would appreciate the kick-back. You get a book, I get some money and it only cuts into Amazon's bottom-line (which is a really large line).
I only recommend books that I have actually read and enjoyed. I hope to put the money to good use. (I think a Raspberry PI powered FAWN cluster written in Erlang is next but no promises)
Comments !