Websocket Demo Results V2

History

Roughly about a month ago I wondered if Erlang's promise of massive concurrency was actually true.

Rather than take everyone's word for it, inspired by Richard Jones' C1M experiment I decided to whip together a benchmark of my own.

This benchmark pit a Cowboy implemented WebSocket echo server against five echo servers implemented in other languages. The platforms I chose were known for their ability to do c10k such as Netty, Node.js and gevent.

I ran the benchmark of these servers on a pair of m1.medium EC2 instance. The outcome was surprising. All the servers but the Erlang/Cowboy implementation had massive connection timeouts. At 3am I sent out an off the cuff tweet that was heard around the world.

I woke up that morning with cries of foul and anger that their framework of choice was poorly represented or that my methodology was faulty.

The Upgrade

After that initial benchmark, I spent the past month refining the process. I sifted through the trolls, accepted pull requests and refined my methodology with the help of many people

I automated the test to make it easier to benchmark the 19 servers that composed the test.

After many small scale benchmarks on AWS amidst data-center failures, I found that even a multi-core EC2 instance performed in a similarly with a majority of the servers. I decided that I needed actual hardware so I picked up a AMD Phenom 9600 Quad Core - 2300 mhz with 2GB of Memory off a Craigslist to compare the results.

Methodology

Before I get into the data, I'll briefly explain the testing methodology.

Each server needed to accept a connection and echo back every message that is sent to it. The benchmark process created a connection as fast as the server would allow and each client connection would send a 33 byte WebSocket message every second (33 bytes is how large a binary encoded ref() value is). Connection initiation, WebSocket Handshakes, Message send, Message receive and any error was recorded in a leveldb event log on the client with a timestamp.

I used two individual machines, a client and a server. On the server I ran supervisord which I used to start and stop the servers using its XML-RPC interface.

For each server, the client would do the following:

  1. Start the server
  2. Warm up the server by doing 10% of the test; 1000 clients for 30 seconds
  3. Do the full test of 10,000 clients for 5 minutes.
  4. Cool down the machine by waiting 15 seconds after stopping the server

For the EC2 benchmark I used two m1.large instances because they are multi-core.

For the local benchmark I used a Quad-core AMD Phenom 9600 machine as the server and my Mac Mini with a 2.0 GHz Core 2 Duo running Ubuntu 12.04 on the metal (dual-booted).

Results

EC2:

Implementation Handshake Time (mean) Handshake Time (σ) Handshake Time (median) Handshake Time (95%) Handshake Time (99%) Handshake Time (99.9%) Latency (mean) Latency (σ) Latency (median) Latency (95%) Latency (99%) Latency (99.9%) Connection Timeouts
erlang-cowboy 46.435 101.107 18.974 147.261 352.168 1063.972 294.412 596.273 64.735 1273.218 3120.932 4288.831 0
go-gonet 56.979 140.531 15.68 185.868 1003.776 1236.265 1044.942 1137.411 1055.743 3283.068 4368.498 5537.2 0
pypy-tornado-N 64.072 118.957 21.854 246.67 423.039 1103.369 1327.666 1823.508 1028.256 4354.553 8401.357 16029.463 0
node-ws-cluster 249.623 508.366 50.147 1249.241 2211.754 3985.56 30833.332 32986.663 14660.698 100549.366 114553.006 121371.939 0
scala-play 908.584 1393.172 141.461 3733.631 5936.176 8190.456 366.273 609.687 79.203 1330.0 3117.377 3476.734 0
python-twisted-1 2076.194 2100.532 1166.352 5596.974 6058.901 6316.446 93796.664 61363.174 94101.715 185265.599 197313.297 199846.245 0
java-webbit 52.947 123.325 16.228 161.269 1003.017 1060.396 210.729 453.46 56.76 1103.055 1357.181 6149.677 1
python-twisted-N 755.72 1136.685 187.814 3434.43 4442.401 4889.723 54210.334 38623.448 50500.447 117697.486 131026.009 140935.195 1
pypy-twisted-1 149.641 352.28 21.637 1102.857 1355.267 1479.419 1084.287 1558.89 274.745 4321.805 6739.464 9197.006 9
python-gevent-websocket-N 481.563 829.23 95.228 2439.405 3439.794 4204.543 22003.975 13912.214 21685.753 44442.627 52024.362 64726.842 10
pypy-twisted-N 193.31 1033.237 36.382 783.271 1132.858 16287.595 2338.537 2966.061 1244.591 7651.461 13346.289 22984.759 11
python-tornado-N 607.127 786.601 246.747 2164.769 3179.422 4381.209 56953.943 37114.033 56394.833 118502.284 130582.121 134621.715 15
clojure-aleph 267.895 514.649 37.883 1382.268 2158.25 4674.043 1865.476 1265.293 2028.303 4141.692 5156.172 7294.652 320
haskell-snap 427.79 962.598 44.693 2051.212 5218.067 7650.281 60924.311 39791.393 62482.556 124664.402 131380.959 134375.903 494
pypy-tornado-1 518.101 829.457 66.107 2624.818 2890.471 3940.884 6477.86 5016.109 5672.344 15301.014 23690.165 29351.708 543
python-tornado-1 2214.414 2177.561 1543.396 6886.381 8047.27 24362.644 88912.451 51015.442 89075.187 166195.222 173810.909 177687.522 829
node-ws 1795.977 1716.92 1068.45 5041.783 6172.564 8210.864 113848.38 70812.913 127665.168 209871.774 224936.342 233250.464 1425
python-gevent-websocket-1 3755.372 4884.994 817.072 11084.85 20798.476 26905.43 33670.264 23107.803 30398.543 74106.901 81290.792 89833.786 2830
perl-ev 2229.378 4027.54 639.591 10472.944 20696.707 29750.725 1191.099 2025.464 408.571 5899.183 9190.275 17235.574 4371

Local

Implementation Handshake Time (mean) Handshake Time (σ) Handshake Time (median) Handshake Time (95%) Handshake Time (99%) Handshake Time (99.9%) Latency (mean) Latency (σ) Latency (median) Latency (95%) Latency (99%) Latency (99.9%) Connection Timeouts
node-ws-cluster 13.213 9.284 12.626 29.475 37.785 49.817 10.281 6.076 9.055 21.966 31.651 38.85 0
pypy-tornado-N 13.323 9.099 12.701 27.625 35.48 54.795 10.283 6.025 9.048 21.828 32.21 44.626 0
python-tornado-N 13.564 9.352 12.68 30.095 39.922 56.357 13.91 8.865 11.853 29.213 45.431 89.296 0
python-gevent-websocket-N 13.657 9.702 12.775 29.703 36.899 76.452 10.931 6.3 9.836 21.572 32.288 53.608 0
erlang-cowboy 13.874 8.697 13.64 28.067 35.509 51.35 11.183 7.911 9.332 24.767 40.981 67.124 0
python-twisted-N 14.129 9.586 13.24 30.999 40.043 55.277 17.946 13.381 14.658 43.116 68.38 110.879 0
go-gonet 14.278 14.903 12.595 28.756 59.599 169.155 13.824 43.38 9.274 22.799 40.163 710.677 0
pypy-twisted-N 15.672 17.321 13.027 33.088 75.013 193.445 12.449 7.772 10.935 25.944 41.256 72.259 0
java-webbit 17.547 55.258 13.121 30.681 52.433 1004.434 11.422 7.918 9.717 28.734 43.628 55.844 0
pypy-twisted-1 27.441 80.393 13.516 41.03 564.404 756.465 13.872 13.44 11.223 26.465 87.42 149.827 0
pypy-tornado-1 29.905 84.353 14.086 57.936 501.039 772.808 43.032 66.326 19.413 226.929 308.387 351.444 0
scala-play 37.897 129.066 13.425 67.791 1011.026 1025.334 50.354 109.526 11.429 332.623 535.78 642.44 0
node-ws 390.509 503.25 100.412 1421.621 1784.192 1994.033 13681.632 4458.928 15166.09 18194.522 19270.177 20861.466 0
python-twisted-1 583.673 987.51 32.343 3018.707 3397.069 3697.019 28912.227 13276.581 29386.427 48560.488 50650.745 52461.485 0
clojure-aleph 126.745 299.223 15.478 1005.36 1015.957 1428.287 1121.035 486.259 1013.214 2020.782 3020.174 3037.376 187
python-tornado-1 1639.882 1451.978 1200.693 4856.938 6088.798 6448.096 65803.397 39733.592 66941.409 128679.909 135929.904 137092.752 265
haskell-snap 176.855 383.722 20.219 1010.429 1742.526 2042.309 24428.761 14931.595 24404.437 47912.411 50257.414 50702.091 275
python-gevent-websocket-1 1218.935 1822.901 239.617 5067.439 6033.99 6535.422 11203.601 6301.863 11600.681 20927.817 23976.174 25890.765 538
perl-ev 1555.812 3017.7 479.488 6982.455 16591.74 27522.464 35.84 17.569 34.168 66.911 82.117 91.16 3955

rows are sorted by connection timeout, handshake time and stddev, and message letency and stddev

Starting with the EC2 benchmark (times are in milliseconds), you will notice the peculiar behavior that EC2 exhibited:

Handshake elapsed time vs time and message latencies vs time:

EC2 handshake times EC2 latency times

There is what looks like O(log N) growth in connection times and linear growth in the message latency.

When compared to the non-virtualized, local setup, the data takes a much different shape.

Local handshake times Local latency times

If you look at the box plots of the EC2:

EC2 handshake times box plot EC2 latency times box plot

You will notice that while Erlang did the best, but 5% of the messages sent were >1.3 seconds which is likely unacceptable. It is a bitter-sweet victory for Erlang on EC2.

On my local hardware, all the servers did ridiculously better:

Local handshake times Local latency times

So much better that the top 5 servers are nearly a tie when it comes to time and consistency.

My Conclusion

This test shows the baseline overhead of the servers at ten thousand concurrent client connections. On physical hardware, save a few outliers, the frameworks did comparably well.

Testing these on EC2 on the other hand, the story is much different. All of the servers had wide deviations from the median. You will need to load balance your service in order to reach c10k on a m1.large EC2 instance. A single node using any platform will have trouble reaching c10k on a m1.large instance. Even the leader in the EC2 benchmark (Erlang) reached unacceptable message latencies of > 1 second. I am sure that there exists a instance type in EC2 that able to hit c10k on a single node but they will cost more than I am willing to spend on this test.

The moral of this story is that you must test your assumptions and do some capacity planning. If you're tempted to stick that node.js prototype up on EC2 without load testing to show your boss that node.js scales; you are likely to have egg on your face and be out of a job.

Code, Charts, Stats and Raw data

Here are links to more detailed charts and stats along with the raw CSV data. Try to be kind to Dropbox and only download the raw data if you really plan on using it.

Plea for support

I used Amazon affiliate fees earned through my Books Every Self-Taught Computer Scientist Should Read post to fund this project. The server I bought off craigslist and the cost of the numerous EC2 instances spun up for this test were all made possible through those funds. If you are in the market for any of these books, I would appreciate the kick-back. You get a book, I get some money and it only cuts into Amazon's bottom-line (which is a really large line).

I only recommend books that I have actually read and enjoyed. I hope to put the money to good use. (I think a Raspberry PI powered FAWN cluster written in Erlang is next but no promises)

Eric Moritz
: erlang / websockets / c10k / node / nodejs

Comments !