Websocket Demo Results V2

History

Roughly about a month ago I wondered if Erlang's promise of massive concurrency was actually true.

Rather than take everyone's word for it, inspired by Richard Jones' C1M experiment I decided to whip together a benchmark of my own.

This benchmark pit a Cowboy implemented WebSocket echo server against five echo servers implemented in other languages. The platforms I chose were known for their ability to do c10k such as Netty, Node.js and gevent.

I ran the benchmark of these servers on a pair of m1.medium EC2 instance. The outcome was surprising. All the servers but the Erlang/Cowboy implementation had massive connection timeouts. At 3am I sent out an off the cuff tweet that was heard around the world.

I woke up that morning with cries of foul and anger that their framework of choice was poorly represented or that my methodology was faulty.

The Upgrade

After that initial benchmark, I spent the past month refining the process. I sifted through the trolls, accepted pull requests and refined my methodology with the help of many people

I automated the test to make it easier to benchmark the 19 servers that composed the test.

After many small scale benchmarks on AWS amidst data-center failures, I found that even a multi-core EC2 instance performed in a similarly with a majority of the servers. I decided that I needed actual hardware so I picked up a AMD Phenom 9600 Quad Core - 2300 mhz with 2GB of Memory off a Craigslist to compare the results.

Methodology

Before I get into the data, I'll briefly explain the testing methodology.

Each server needed to accept a connection and echo back every message that is sent to it. The benchmark process created a connection as fast as the server would allow and each client connection would send a 33 byte WebSocket message every second (33 bytes is how large a binary encoded ref() value is). Connection initiation, WebSocket Handshakes, Message send, Message receive and any error was recorded in a leveldb event log on the client with a timestamp.

I used two individual machines, a client and a server. On the server I ran supervisord which I used to start and stop the servers using its XML-RPC interface.

For each server, the client would do the following:

Start the server
Warm up the server by doing 10% of the test; 1000 clients for 30 seconds
Do the full test of 10,000 clients for 5 minutes.
Cool down the machine by waiting 15 seconds after stopping the server

For the EC2 benchmark I used two m1.large instances because they are multi-core.

For the local benchmark I used a Quad-core AMD Phenom 9600 machine as the server and my Mac Mini with a 2.0 GHz Core 2 Duo running Ubuntu 12.04 on the metal (dual-booted).

Results

EC2:

Implementation	Handshake Time (mean)	Handshake Time (σ)	Handshake Time (median)	Handshake Time (95%)	Handshake Time (99%)	Handshake Time (99.9%)	Latency (mean)	Latency (σ)	Latency (median)	Latency (95%)	Latency (99%)	Latency (99.9%)	Connection Timeouts
erlang-cowboy	46.435	101.107	18.974	147.261	352.168	1063.972	294.412	596.273	64.735	1273.218	3120.932	4288.831	0
go-gonet	56.979	140.531	15.68	185.868	1003.776	1236.265	1044.942	1137.411	1055.743	3283.068	4368.498	5537.2	0
pypy-tornado-N	64.072	118.957	21.854	246.67	423.039	1103.369	1327.666	1823.508	1028.256	4354.553	8401.357	16029.463	0
node-ws-cluster	249.623	508.366	50.147	1249.241	2211.754	3985.56	30833.332	32986.663	14660.698	100549.366	114553.006	121371.939	0
scala-play	908.584	1393.172	141.461	3733.631	5936.176	8190.456	366.273	609.687	79.203	1330.0	3117.377	3476.734	0
python-twisted-1	2076.194	2100.532	1166.352	5596.974	6058.901	6316.446	93796.664	61363.174	94101.715	185265.599	197313.297	199846.245	0
java-webbit	52.947	123.325	16.228	161.269	1003.017	1060.396	210.729	453.46	56.76	1103.055	1357.181	6149.677	1
python-twisted-N	755.72	1136.685	187.814	3434.43	4442.401	4889.723	54210.334	38623.448	50500.447	117697.486	131026.009	140935.195	1
pypy-twisted-1	149.641	352.28	21.637	1102.857	1355.267	1479.419	1084.287	1558.89	274.745	4321.805	6739.464	9197.006	9
python-gevent-websocket-N	481.563	829.23	95.228	2439.405	3439.794	4204.543	22003.975	13912.214	21685.753	44442.627	52024.362	64726.842	10
pypy-twisted-N	193.31	1033.237	36.382	783.271	1132.858	16287.595	2338.537	2966.061	1244.591	7651.461	13346.289	22984.759	11
python-tornado-N	607.127	786.601	246.747	2164.769	3179.422	4381.209	56953.943	37114.033	56394.833	118502.284	130582.121	134621.715	15
clojure-aleph	267.895	514.649	37.883	1382.268	2158.25	4674.043	1865.476	1265.293	2028.303	4141.692	5156.172	7294.652	320
haskell-snap	427.79	962.598	44.693	2051.212	5218.067	7650.281	60924.311	39791.393	62482.556	124664.402	131380.959	134375.903	494
pypy-tornado-1	518.101	829.457	66.107	2624.818	2890.471	3940.884	6477.86	5016.109	5672.344	15301.014	23690.165	29351.708	543
python-tornado-1	2214.414	2177.561	1543.396	6886.381	8047.27	24362.644	88912.451	51015.442	89075.187	166195.222	173810.909	177687.522	829
node-ws	1795.977	1716.92	1068.45	5041.783	6172.564	8210.864	113848.38	70812.913	127665.168	209871.774	224936.342	233250.464	1425
python-gevent-websocket-1	3755.372	4884.994	817.072	11084.85	20798.476	26905.43	33670.264	23107.803	30398.543	74106.901	81290.792	89833.786	2830
perl-ev	2229.378	4027.54	639.591	10472.944	20696.707	29750.725	1191.099	2025.464	408.571	5899.183	9190.275	17235.574	4371

Local

Implementation	Handshake Time (mean)	Handshake Time (σ)	Handshake Time (median)	Handshake Time (95%)	Handshake Time (99%)	Handshake Time (99.9%)	Latency (mean)	Latency (σ)	Latency (median)	Latency (95%)	Latency (99%)	Latency (99.9%)	Connection Timeouts
node-ws-cluster	13.213	9.284	12.626	29.475	37.785	49.817	10.281	6.076	9.055	21.966	31.651	38.85	0
pypy-tornado-N	13.323	9.099	12.701	27.625	35.48	54.795	10.283	6.025	9.048	21.828	32.21	44.626	0
python-tornado-N	13.564	9.352	12.68	30.095	39.922	56.357	13.91	8.865	11.853	29.213	45.431	89.296	0
python-gevent-websocket-N	13.657	9.702	12.775	29.703	36.899	76.452	10.931	6.3	9.836	21.572	32.288	53.608	0
erlang-cowboy	13.874	8.697	13.64	28.067	35.509	51.35	11.183	7.911	9.332	24.767	40.981	67.124	0
python-twisted-N	14.129	9.586	13.24	30.999	40.043	55.277	17.946	13.381	14.658	43.116	68.38	110.879	0
go-gonet	14.278	14.903	12.595	28.756	59.599	169.155	13.824	43.38	9.274	22.799	40.163	710.677	0
pypy-twisted-N	15.672	17.321	13.027	33.088	75.013	193.445	12.449	7.772	10.935	25.944	41.256	72.259	0
java-webbit	17.547	55.258	13.121	30.681	52.433	1004.434	11.422	7.918	9.717	28.734	43.628	55.844	0
pypy-twisted-1	27.441	80.393	13.516	41.03	564.404	756.465	13.872	13.44	11.223	26.465	87.42	149.827	0
pypy-tornado-1	29.905	84.353	14.086	57.936	501.039	772.808	43.032	66.326	19.413	226.929	308.387	351.444	0
scala-play	37.897	129.066	13.425	67.791	1011.026	1025.334	50.354	109.526	11.429	332.623	535.78	642.44	0
node-ws	390.509	503.25	100.412	1421.621	1784.192	1994.033	13681.632	4458.928	15166.09	18194.522	19270.177	20861.466	0
python-twisted-1	583.673	987.51	32.343	3018.707	3397.069	3697.019	28912.227	13276.581	29386.427	48560.488	50650.745	52461.485	0
clojure-aleph	126.745	299.223	15.478	1005.36	1015.957	1428.287	1121.035	486.259	1013.214	2020.782	3020.174	3037.376	187
python-tornado-1	1639.882	1451.978	1200.693	4856.938	6088.798	6448.096	65803.397	39733.592	66941.409	128679.909	135929.904	137092.752	265
haskell-snap	176.855	383.722	20.219	1010.429	1742.526	2042.309	24428.761	14931.595	24404.437	47912.411	50257.414	50702.091	275
python-gevent-websocket-1	1218.935	1822.901	239.617	5067.439	6033.99	6535.422	11203.601	6301.863	11600.681	20927.817	23976.174	25890.765	538
perl-ev	1555.812	3017.7	479.488	6982.455	16591.74	27522.464	35.84	17.569	34.168	66.911	82.117	91.16	3955

rows are sorted by connection timeout, handshake time and stddev, and message letency and stddev

Starting with the EC2 benchmark (times are in milliseconds), you will notice the peculiar behavior that EC2 exhibited:

Handshake elapsed time vs time and message latencies vs time:

EC2 handshake times EC2 latency times

There is what looks like O(log N) growth in connection times and linear growth in the message latency.

When compared to the non-virtualized, local setup, the data takes a much different shape.

Local handshake times Local latency times

If you look at the box plots of the EC2:

EC2 handshake times box plot EC2 latency times box plot

You will notice that while Erlang did the best, but 5% of the messages sent were >1.3 seconds which is likely unacceptable. It is a bitter-sweet victory for Erlang on EC2.

On my local hardware, all the servers did ridiculously better:

Local handshake times Local latency times

So much better that the top 5 servers are nearly a tie when it comes to time and consistency.

My Conclusion

This test shows the baseline overhead of the servers at ten thousand concurrent client connections. On physical hardware, save a few outliers, the frameworks did comparably well.

Testing these on EC2 on the other hand, the story is much different. All of the servers had wide deviations from the median. You will need to load balance your service in order to reach c10k on a m1.large EC2 instance. A single node using any platform will have trouble reaching c10k on a m1.large instance. Even the leader in the EC2 benchmark (Erlang) reached unacceptable message latencies of > 1 second. I am sure that there exists a instance type in EC2 that able to hit c10k on a single node but they will cost more than I am willing to spend on this test.

The moral of this story is that you must test your assumptions and do some capacity planning. If you're tempted to stick that node.js prototype up on EC2 without load testing to show your boss that node.js scales; you are likely to have egg on your face and be out of a job.

Code, Charts, Stats and Raw data

Here are links to more detailed charts and stats along with the raw CSV data. Try to be kind to Dropbox and only download the raw data if you really plan on using it.

Plea for support

I used Amazon affiliate fees earned through my Books Every Self-Taught Computer Scientist Should Read post to fund this project. The server I bought off craigslist and the cost of the numerous EC2 instances spun up for this test were all made possible through those funds. If you are in the market for any of these books, I would appreciate the kick-back. You get a book, I get some money and it only cuts into Amazon's bottom-line (which is a really large line).

I only recommend books that I have actually read and enjoyed. I hope to put the money to good use. (I think a Raspberry PI powered FAWN cluster written in Erlang is next but no promises)

Eric Moritz