I have been running workloads on AWS since 2008, and have experienced a few either AZ or Region-based outages. Although rare various issues can arise which cause one or more services to have an outage, or as AWS typically terms it, “some customers are experiencing increased error rates….” or something like that.
The first such, ah, issue that I personally experienced was about 10 years ago when a thunderstorm rolled through Northern Virginia where AWS has their flagship region, and where we ran 100% of our operations at the time. One of their AZs in this region had issues with newly installed generators and weren’t able to sustain load. And although AWS tries to minimize the sound of the impact with flowery yet vague descriptions and calculating percentages of affected users across the entire region, it’s still a huge impact on most affected customers, particularly if you aren’t properly or fully using multiple AZs, or even multiple regions.
AWS tries to minimize the sound of the impact on running systems perhaps a bigger problem was that the control plane for various services was seriously affected, thus limiting customers (like me) from either starting replacement systems or being able to switch over to non-affected AZs.
The next major issue which directly affected me was in early 2017 when a “command and control” system behind S3 was being updated and someone (inadvertantly I’m sure) specified a number of systems to be restarted at a time to a number a bit too high. Effectively for about 47 minutes that day we (and our millions of customers) weren’t able to access content in our S3 buckets in the N. VA region, breaking our widely distributed video player.
While AWS does preach to not rely on a single AZ – deploy everything multi-AZ, or even multi-region – and that components/systems/etc. fail regularly it is easy to get complacent when things do run so smoothly for so long. For example, I’ve been running IT systems for a good friend of mine for decades, and we moved his systems to the Oregon region nearly 10 years ago. For mainly a cost saving reason we run all systems (~20 EC2 instances) in a single AZ, and have had virtually zero problems during that time. Until yesterday….
Around noon PDT yesterday he and his users reported they were having “Internet” issues. Since that could mean anything (or nothing) I dug a bit deeper. Soon I realized that any traffic outbound from our hosted RDS servers was either very slow, pages loading only partially, or failing altogether. First thing to look at are various monitoring data points, including on our newly commissioned NAT Gateway (we had been using a NAT instance for years….). Nothing seemed amiss there. When trying to run a simple speed test the page wouldn’t even load for me.
Not sure the exact cause just yet I started running DNS checks with nslookup. At this point I was wondering if it was a DNS issue, but what I found was that nslookup would retry or fail (after a few retries) to simply retrieve name servers for common domain names (amazon.com, google.com, microsoft.com for example). Yet these were all successful from my local computer in my home office.
Coincidentally I was attending a training class and running some labs for another product out of an AZ in Oregon and I, along with a few others in the class, had been having some intermittent issues connecting to the Internet from systems in Oregon. I started to suspect a culprit, and took a look at the AWS Service Health Dashboard. To my slight surprise I discovered there was indeed an issue which explained our problems.
According to AWS the problem was affecting only one availability zone. After a quick look at the Console I was able to easily confirm that our systems were running in that affected AZ. Although it’s somewhat new that AWS let’s us know which physical AZ (usw2-az2) a particular logical AZ (us-west-2a) maps to, that info is now easily obtainable through the Console, CLI, etc. Turns out my systems, and those in the lab were all sitting in the affected AZ.
I run my NAT Gateway in the same AZ as all of my EC2 instances. This makes sense as we would incur additional costs with traffic going inter-AZ otherwise. One recommendation from AWS is that we create things like NAT Gateways in each AZ, and route outbound traffic from systems in the same AZ through the respective NAT Gateway. So, I simply created a new, if temporary, NAT Gateway in another AZ, then updated my route tables for our few private subnets to route throuth this new NAT Gateway.
This solved the immediate issue of outbound connectivity for my users and most services – a few things were still broken as they rely on a specific IP address which is bound to my original NAT Gateway. Since you cannot detach an Elastic IP from a NAT Gateway, and I didn’t want to delete my original one so we just dealt with that minor inconvenience for a couple of hours until Amazon fixed the problem. Once they did I was able to switch my private subnet routes back to my original NAT Gateway and everything was back to normal.
Again, I’ve been running these system in a single AZ for nearly a decade now and this is the first significant problem we’ve had with a large-scale AWS issue in this AZ in Oregon. We will continue to run this way and if a problem like this arises again we can work around it easily. If it’s a bigger problem we’ll deploy our DR strategy of starting replacement instances in another AZ in that region, or in another region (thanks to cross-region replication)!