Amazon ELB latency problems

Amazon ELB latency problems

Our setup

For a while we have been experiencing huge latencies with our server setup at Amazon. From the image below you get an impression of how our setup looks. Not all details has been included, but you get the picture.

Our Amazon setup

Our Amazon setup

  1. We’re using Amazon Route 53 to manage our domain.
  2. All our request are then DNS load balanced to one of the available zones (1a, 1b, 1c).
  3. Inside every zone there is a Amazon Elastic Load Balancer (ELB) to distribute the requests.
  4. Behind the ELB there are 2 Amazon EC2 instances to handle the requests.

Huge latencies

Ok, so when your trying to fetch a page from our servers it goes like this:

  1. Client resolves our DNS and connects to the resolving IP address on the randomly selected zone and ELB.
  2. The request is received by an ELB and distributed to one of the available backends.
  3. The request is then received by an EC2 instance, handled and response is sent back to the client.

According to Pingdom.com we experienced these response times:

Huge latencies

Huge latencies

An average response time of about 5-6 seconds!

Not good!

But what is the cause of the unacceptable latency?

Debugging

Application responses

First thing to check is the application. I did some manual testing to check the obvious handling of requests:

  • 200: OK
  • 404: Not found
  • 500: Server error

All requests where handled the way they where supposed to be handled. I didn’t suspect this would fail, but it is a good place to start the debugging process. Check the most basic stuff first.

Application refactoring and testing

During the manual tests above I realized that we where lacking code coverage with unit and integration tests. Too much in a hurry when we developed this application. So the natural next step was to fix this.

Refactoring

Refactoring

After some refactoring I was able to get sufficient code coverage with unit and integration tests. I could hold my head up high and say: “I’m sure my application is working fine.”.

Load testing

Ok, so far no strange bugs. Next step is to run some load tests to see if I could provoke the systems and reproduce the latency in our back ends.

I chose Apache Benchmark for the job. Yes, I know this is not the optimal load testing tool, but at least it shows the raw power of each server without the network latency.

Run the load tests:

ab -n 10000 -c 100 'http://mydomain.com/mypage'

The results:

Concurrency Level:      100
Time taken for tests:   12.066 seconds
Complete requests:      10000
Failed requests:        0
Total transferred:      2920000 bytes
HTML transferred:       430000 bytes
Requests per second:    828.77 [#/sec] (mean)
Time per request:       120.660 [ms] (mean)
Time per request:       1.207 [ms] (mean, across all concurrent requests)
Transfer rate:          236.33 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0       8
Processing:    14  120  45.3    117     296
Waiting:       14  120  45.3    117     296
Total:         14  120  45.3    117     297

Percentage of the requests served within a certain time (ms)
  50%    117
  66%    137
  75%    149
  80%    156
  90%    180
  95%    200
  98%    224
  99%    237
 100%    297 (longest request)

After several tests on every EC2 instance I found nothing. Performance seems ok. Nothing is fine tuned, but still no sign of our latency problems. Every server is able to handle approx 800 req/s with the real application.

Head scratching

What now? All other components inside the system is services provided by Amazon.

Bugs inside

Bugs inside

So the latency has to be inside one of the Amazon services:

Route 53

Route 53 is just pointing our domain to the ELB domain provided by Amazon. It shouldn’t be any issues with this part.

ELB

An ELB is assigned a domain automatically from Amazon. This domain is set up using DNS load balancing. When you do a DNS lookup you get something like this:

dig my-load-balancer.eu-west-1.elb.amazonaws.com

[snip]
;; ANSWER SECTION:
my-load-balancer.eu-west-1.elb.amazonaws.com. 60 IN A 54.172.xxx.xxx
my-load-balancer.eu-west-1.elb.amazonaws.com. 60 IN A 54.171.xxx.xxx
my-load-balancer.eu-west-1.elb.amazonaws.com. 60 IN A 54.176.xxx.xxx
[snip]

Ok, then we have to test this all the resolved IP-addresses several times to see how they behave.

I threw together a couple of perl scripts for this purpose.

First the dig output parser ‘dig_parse.pl’:

#!/usr/bin/perl

$format = 'TIMING:[%{time_namelookup};%{time_connect};%{time_appconnect};%{time_pretransfer};%{time_redirect};%{time_starttransfer};%{time_total}]';

$as = 0;
while (<>) {
    $as = 0 if ($as && /^$/);
    if ($as) {
        @a = split(/\s+/, $_);
        my $ip = $a[4];
        my $cmd = 'curl -w "'.$format.'" --connect-timeout 3 -I http://'.$a[4].'/pulse 2>/dev/null';
        my $curl = `$cmd`;
        my $timing = $1 if ($curl =~ m,TIMING:\[(.+?)\],s);
        my @t = split(/;/, $timing);
        my $http_code = $1 if ($curl =~ m,HTTP/1.1 (\d+),s);
        print join "\t", $ip, $t[1], $t[5], $t[6], ($http_code||'xxx');
        print "\n";
    }
    $as = 1 if (/^;; ANSWER SECTION/);
}

Then a loop and statistics counter ‘dig_loop.pl’:

#!/usr/bin/perl

BEGIN { $|= 1 }
$ip_stat = {};
for (1..30) {
    print '.';
    $res = `dig $ARGV[0] | perl ./dig_parse.pl`;
    chomp($res);
    foreach my $l (split(/\n/, $res)) {
        my @line = split(/\t/, $l);
        $ip_stat->{$line[0]}->{$line[4]}++;
        $ip_stat->{$line[0]}->{time_used}+= $line[3];
        $ip_stat->{$line[0]}->{cnt}++;
    }
}
print "\n";
use Data::Dumper;
print Dumper($ip_stat);

Then I ran it this way:

perl dig_loop.pl my-load-balancer.eu-west-1.elb.amazonaws.com

Guess what… I found a bug in the DNS setup!

Found a bug

Found a bug

$VAR1 = {
          '54.172.xxx.xxx' => {
                            '200' => 30,
                            'time_used' => '3.734',
                            'cnt' => 30
                          },
          '54.171.xxx.xxx' => {
                               '200' => 30,
                               'time_used' => '3.685',
                               'cnt' => 30
                             },
          '54.176.xxx.xxx' => {
                                'time_used' => '900.615',
                                'xxx' => 30,
                                'cnt' => 30
                              }
        };

One of the resolving ELB IP-addresses did not answer. Not a single response even with a timeout of 30 seconds. I’m not sure why this is happening or why Pingdom.com reports a high latency and not a downtime.

I’m going to send a bug report to both Amazon and Pingdom and update this blog post when and if I get the answers.

The solution

After I found this bug I setup a new load balancer with the same settings as the old one, waited for it to become available (about 10-15 min) and ran the same perl script to test it:

perl dig_loop.pl my-new-load-balancer.eu-west-1.elb.amazonaws.com

Now everything seems to be in perfect order:

$VAR1 = {
          '54.276.xxx.xxx' => {
                              '200' => 30,
                              'time_used' => '3.417',
                              'cnt' => 30
                            },
          '54.272.xxx.xxx' => {
                               '200' => 30,
                               'time_used' => '3.41',
                               'cnt' => 30
                             },
          '54.271.xxx.xxx' => {
                             '200' => 30,
                             'time_used' => '3.339',
                             'cnt' => 30
                           }
        };

Then it was time to move the live production traffic over to the new ELB using Route 53. TTLs are only 60 seconds, so this is a quick operation.

After a few days with the new ELB the response time graph from Pingdom.com is like this:

From huge to almost no latency

From huge to almost no latency

Victory

Victory

Afterwords

This findings shows that all services and systems can fail in the most mysterious ways. You should always investigate all parts of a system, even if it is a services provided by a well known company. We all make mistakes…