Will Amazon SQS scale?

Will Amazon SQS scale?

Amazon SQS put to the test in a full stack environment

SQS is a hosted service provided by Amazon. You can start using it without any startup fee. The only thing you pay for is the requests and the bandwidth you use.

  1. How much traffic can it handle?
  2. What is it intended for?
  3. How should you build your applications to get the most out of this service?

A production test

Ok, let's try this out in a full stack production environment. I have previously built a Node.js web application to handle analytics tracking from web pages. This application is open source and called pixel-pong.

Amazon setup

First I created an image I could run inside EC2 instances. The image is based on Ubuntu 14.03. I then installed Node.js and my application Pixel-pong.

Secondly I created an Amazon Elastic Load Balancer to be able to distribute the traffic over several EC2 instances (c3.large). This is done so I'm able to simulated our current production environment. I want to test the full stack and see how everything plays together. It's important for me to see where the bottle neck is.

Thirdly I've signed up for a LoadImpact account to be able to perform a distributed load test as close to human behavior as possible. I configured my test with the maximum amount of users (10.000) allowed from this subscription. Each users a requesting a new "page" every 2-3 sec.

This should generate a peak traffic of about 3500 req/s in my first test without auto scaling enabled.

LoadImpact test results wo/auto scaling

LoadImpactLoadImpact

My load test was configured to accelerate from 0-10000 users in about 1:45:00. Around 10:32am we hit some kind of limit. LoadImpact then continued to increase the traffic.

Amazon ELB request countAmazon ELB request count

At the peak the total traffic from LoadImpact was 2.39K req/sec.

Amazon SQS countAmazon SQS count

Peak message insert into our SQS queue was 1046 msg/sec (314K msg/5min).

Server free memoryServer free memory

All servers ran out of memory at the same time as we hit the wall.

Server CPU idleServer CPU idle

An of course all the resources was drained at the exact same time the server started swapping.

Summary

  • Max rate 2.39K req/sec.
  • Peaking at about 1046 messages/sec. (206 million messages/day)
  • The servers seems to stack up with a lot of IO wait. All this waiting consumes all the resources on the 3 servers. This is strange...

We need to enable auto scaling and try again!

LoadImpact test results w/auto scaling

This test has been done several times to be sure the results are correct.

LoadImpact 2LoadImpact 2

My load test was configured to accelerate from 0-10000 users in about 50 minutes. Around 15:13:45 - 15:24:00 we hit some kind of limit. LoadImpact then continued to increase the traffic and Amazon SQS seems to be scaled up.

Amazon ELB request count 2Amazon ELB request count 2

At the peak the total traffic from LoadImpact was 6.14K req/sec.

Amazon SQS sent messagesAmazon SQS sent messages

Peak message insert into our SQS queue was 2236 msg/sec (671K msg/5min).

Server free memoryServer free memory

All servers is running low on memory after a short while. This indicates that there is a lot of processes in IO wait state.

Server CPU idleServer CPU idle

CPU idle is also low due to the amount of IO going on.

Amazon EC2 server countAmazon EC2 server count

Our autoscaling is set to increase with 3 servers every time our average 1 min load goes above 60%. In total we had a total of 18 servers up and running. This is way too much for this kind of traffic.

Summary

  • Max rate 6.14K req/sec.
  • Peaking at about 2236 messages/sec. (206 million messages/day)
  • Same as the test above. Servers seems to stack up with a lot of IO wait. All this waiting consumes all the resources. SQS inserts seems to peak at around 2.2 msg/sec.

Raw performance test

As a control check I'm going to test one web server and it's integration against the SQS queue to try to find the limit of 1 instance.

Apache Benchmark test

Ssh into the web server and install Apache Benchmark:

$ sudo apt-get install apache2-utils

Running the ab test, 1 mill requests with 30 simultaneous connections:

ab -n 1000000 -c 30 'http://localhost:80/pulse?url=http%3A%2F%2Fpluss.vg.no%2F2014%2F09%2F30%2F1777%2F1777_23306617&uid=a33ba410-51e4-4061-a523-3f437a676e71&sid=c455f7bf-0592-435a-af24-780fbb2d992a&a=1412152071778&t=1&spid=304985&did=X9ilkcXmsmldvX88s74b&cid=4ef1cfb0e962dd2e0d8d0000&ti=Fiffens%20trening%20-%20VG%2B&ref=http%3A%2F%2Fpluss.vg.no%2Fauth%2Fauthenticate%3FredirectTo%3D%2F2014%2F09%2F30%2F1777%2F1777_23306617%26code%3D20e98fb0f62d2582e26af6f68eac5b4290a4abae&vs=1464x1131&ss=2560x1440&ps=0x0&mti=undefined&md=Her%20koster%20det%20opp%20mot%2011%20500%20kroner%20i%20m%C3%A5neden%20for%20%C3%A5%20trene.%20Stadig%20flere%20%20bruker%20mange%20tusen%20kroner%20p%C3%A5%20personlig%20trener.%20%E2%80%93%20Un%C3%B8dvendig%2C%20mener%20%20eksperter.&mta=undefined&moti=undefined&moty=undefined&mou=undefined&moi=undefined&modes=undefined&moa=undefined&modet=undefined&mol=undefined&mola=undefined&mosn=undefined&mov=undefined&cust=%7B%22publication%22%3A%22Vgpluss%22%2C%22articleId%22%3A%2223306617%22%2C%22articlePublishDate%22%3A%222014-09-30T20%3A35%3A52%2B02%3A00%22%2C%22conversion%22%3A%22lhfzBgaPRkZ640faSUaspyCgZw1TgXDkuQdPQkxCbUQ%3D%22%7D&name=page_entry&r=1412152072424'

Apache Benchmark test results

LoadImpact Amazon SQS sent messages 2LoadImpact Amazon SQS sent messages 2

A peak of 224K messages over 5 min period.

LoadImpact server loadLoadImpact server load

Server load of the box running both the ab test and the tracking server.

Apache Benchmark results:

Concurrency Level:      30
Time taken for tests:   1346.329 seconds
Complete requests:      1000000
Failed requests:        0
Total transferred:      292000000 bytes
HTML transferred:       43000000 bytes
Requests per second:    742.76 [#/sec] (mean)
Time per request:       40.390 [ms] (mean)
Time per request:       1.346 [ms] (mean, across all concurrent requests)
Transfer rate:          211.80 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       3
Processing:     1   40  16.4     40     177
Waiting:        0   40  16.3     40     176
Total:          1   40  16.4     40     177

Summary

  • Running on one server instances our Node.js application integrated with SQS is able to handle about 740 req/s (in other words: 1 mill messages in 1346 seconds).
  • Mean response time for all requests are 40.39 milliseconds.
  • Our minimal setup contains 3 servers and an estimated capacity of 2200 req/s, but as we see above, SQS needs time to scale up and even after 50 minutes it can't handle our insert rate.

SQS benchmarking

After planning to launch a fleet of tracking servers and have them run our raw Apache Benchmark test simultaneously, I stopped for a second and turned to Google.

A guy called Adam Warski had already done this and written a great blog post about it called Benchmarking SQS.

To summarize the results from his blog post:

Results from his tests with 25 threads on each node:

Number of nodes 1 2 4 8
Sender pr node & thread 354,15 338,52 305,03 317,33
Sender total 8 853,75 16 925,83 30 503,33 63 466,00
Receiver pr node & thread 166,38 159,13 170,09 174,26
Receiver total 4 159,50 7 956,33 17 008,67 34 851,33

SQS performanceSQS performance

The highest results he managed to get was:

  • 108k msgs/second sent when using 50 threads and 8 nodes
  • 35k msgs/second received when using 25 threads and 8 nodes

Aug 3. 2013 Twitter had a peak of 143K tweets/sec.

Summary

Back to the questions I started this post with:

1. How much traffic can it handle?

In theory Amazon SQS seems to scale to the traffic amount you want. It's built upon EC2 instances with auto scaling. We just have to remember that auto scaling needs time to scale up. But why do my tests experience problems? Do SQS need more time to scale up?

With the total of 18 servers each capable of handling 742 req/s I should be able to handle 13356 req/s.

2. What is it intended for?

Amazon Simple Queue Service is intended as a simple queue for messages as they travel between computers. SQS queues can be created as topics and access to the queues are controlled by roles or direct user access.

3. How should you build your applications to get the most out of this service?

Use horizontal scaling as described in this appendix.

Pros and cons

Pros:

  • Amazon SQS is simple to use.
  • It's cost effective and reliable.
  • No setup and maintenance cost.
  • Messages are replicated.
  • It's a hosted service.

Cons:

  • It seems to have a problem with handling more than 2.3K message insert/sec.
  • It has limitations in message size, retention and other settings.
  • It's not a streaming service.
  • It's not a replacement for Kafka or Amazon Kinesis.

Links