Implementing Cloud Design Patterns for AWS(Second Edition)
上QQ阅读APP看书,第一时间看更新

Fault tolerance

Power outages, hardware failures, and data center upgrades are just a few of the many problems that still bubble up to the engineering teams responsible for systems. Data center upgrades are common, and given enough time at AWS, your product team will get an email or notification stating that some servers will shut down, or experience brownouts, or small outages of power. We've shown that the best way to handle these is to span across data centers (AZs) so that, if a single location experiences issues, the systems will continue to respond. Your services should be configured in an N+1 configuration. If a single frontend is acceptable, then it should be configured for two. Spanning AZs gives us further protection from large-scale outages while keeping latency low. This allows for hiccups and brownouts, as well as an influx of traffic into the system with minimal impact to the end users.

An example of this architecture can be seen in the reference architecture for Cloud Foundry (http://www.cloudfoundry.org). Each subnet is in a different AZ. Components are deployed on each subnet to provide fault tolerance. A complete loss of two Amazon data centers would slow the system down, but it would continue to be available: https://docs.pivotal.io/pivotalcf/2-1/plan/aws/aws_ref_arch.html.

We can see how DNS is used for global traffic management and a set of load balancers creates a facade for LTM.