What I've learned working with Amazon SQS

What I've learned working with Amazon SQS

Keep your retention high ("Message Retention Period")

 -----------       -----       ----------------       ----------
| web front | --> | SQS | --> | dispatcher app | --> | database |
 -----------       -----       ----------------       ----------

Shit happens and retention is your friend. If your dispatcher application stops working, the SQS queue will keep all you messages until you're back online. You don't need to worry about loosing any data while you're trying to fix the problem.

Oh, what about the web fronts? If they are down nothing is working and you are not loosing any data ;)

InvisibleInvisible

Use correct "Default Visibility Timeout"

If this setting is set too low, duplicates can occur. Visibility timeout is the time from where you fetch the message, do nothing with it and it becomes visible in the queue again. In other words if your application fetches messages and then crash all messages will become visible in the queue again after a certain amount of time.

InvisibleInvisible

Use batch to insert and fetch

Amazon SQS is a service where you pay for each request and the bandwidth you use. To save cost you should try to send messages as batches of 10 messages.

Cost example; 100 mill messages (2KB each) sent and fetched:

Method Request cost Bandwidth cost Total
Single messages $100 $45.78 $145.78
Batch of 10 $10 $45.78 $55.78

A potential cost saving of ~ 60%.

Save moneySave money

Max number of messages in flight

Messages in flight are messages your application is currently working on. Inside SQS there is a limit of max 120 000 messages in flight at the same time. You have to keep this in mind if you're working with large queues. I have experienced strange errors due to exceeding this limit.

This limit is also a limitation that has to be handled inside your application. You can't have too many servers polling from the same queue at the same time.

In flightIn flight

Scales well horizontally, but remeber duplicates

SQS queues seems to scale well as long as you give them time to heat up.

If you have multiple servers working against the same queue you have to handle duplicates. They will occur due to race conditions. To handle duplicates we're using the key value cache called Redis.

 -----------       -----       ----------------       ----------
| web front | --> | SQS | --> | dispatcher app | --> | database |
 -----------       -----       ----------------       ----------
                                      ^
                                      |
                                      v
                                   -------
                                  | Redis |
                                   -------

At the moment we are inserting approx 200 message/second (17.3 mill/day). To dispatch data we have from 1-2 data dispatchers for each queue depending on the growth rate.

Scales wellScales well

Cost examples

Cost example; 518.4 mill messages/month (17.3 mill/day and 2KB each) sent and fetched:

Method Request cost Bandwidth cost Total
Single messages $ 518,40 $ 237,30 $755,70
Batch of 10 $ 51,84 $ 237,30 $289,14

MoneyMoney

Pros

  • Easy to get started.
  • Easy to use.
  • No manual maintenance.
  • Good SDKs.
  • Very good monitoring via Amazon Cloudwatch or an external tools such as DataDog.

Cons

  • AWS can be overwhelming.
  • Duplicate messages can occur.
  • Expensive if traffic is huge.