
201 lines
20 KiB
Raw Normal View History

# Awesome Scalability, Availability, and Stability Back-end Design Patterns
2017-12-26 22:47:31 -05:00
2018-01-10 12:51:03 -05:00
A curated list of selected readings to illustrate Scalability, Availability, and Stability Design Patterns in Back-end Development.
2017-12-26 22:47:31 -05:00
#### What if your backend went slow?
2018-01-10 11:35:46 -05:00
> Understand your problems: performance problem (slow for a single user) or scalability problem (fast for a single user but slow under heavy load) by reviewing some [basic design concepts](#basic).
2017-12-26 22:47:31 -05:00
#### What if your backend went down?
> "Even if you lose all one day, you can build all over again if you retain your calm!" - Thuan Pham, CTO at Uber Technologies Inc.
## Contributing
Please take a look at the [contribution guidelines]( first.
Contributions are always welcome!
## Contents
- [Basic](#basic)
- [Scalability](#scalability)
- [Availability](#availability)
- [Stability](#stability)
2018-01-10 13:13:38 -05:00
- [Other Design Aspects](#others)
- [Awesome Books](#books)
- [Awesome Talks](#talks)
2017-12-26 22:47:31 -05:00
## Basic
2018-01-10 11:41:26 -05:00
* [CAP theorem and the trade-offs](
2018-01-17 03:29:54 -05:00
* [Scaling Up versus Scaling Out](
* [Best Practices for Scaling Out](
2018-01-10 11:41:26 -05:00
* [How to deal with latency](
* [Dropbox case: Striving for maximal throughput with acceptable latency](
* [What is ACID?](
* [Architecture issues: Bottlenecks, Database, CPU, IO](
* [How eBay's Shopping Cart used compression techniques to solve network I/O bottlenecks](
* [Performance and Scalability patterns](
2018-01-10 11:41:26 -05:00
* [Advantages and drawbacks of Microservices](
2018-01-10 11:50:33 -05:00
* [Avoid Overengineering](
2018-01-10 11:35:46 -05:00
* [Don't Repeat Yourself (DRY)](
* [DRY in Django](
* [Design for Loose-coupling](
* [Design for Resiliency](
* [Design for Self-healing when failures occur](
* [Design for Scale out](
* [Design for Scale: Three best practices](
* [Design for Evolution](
2017-12-26 22:47:31 -05:00
## Scalability
2018-01-17 00:38:38 -05:00
* [Microservices](
* [Thinking Inside the Container - Riot Games (8 part series)](
* [Containerization at Pinterest](
* [The Evolution of Container Usage at Netflix](
* [Dockerizing MySQL at Uber](
2017-12-26 22:47:31 -05:00
* [Distributed Caching](
* [Write-behind and Write-through](
* [Eviction Policies](
* [Peer-To-Peer Caching](
2018-01-12 11:38:02 -05:00
* [Distributed Caching at Netflix with EVCache](
* [Robust Memcache Traffic Analyzer at](
* [How Etsy caches: Consistent Hashing and Cache Smearing](
* [Distributed Logging & Tracing](
* [Building DistributedLog at Twitter: High-performance replicated log service](
* [Distributed tracing at Pinterest with Pintrace](
* [Scalable and reliable log ingestion at Pinterest](
* [CERN Accelerator Logging Service with Spark](
2018-01-16 20:42:41 -05:00
* [Logging and Aggregation at Quora](
* [Distributed Messaging](
* [Understanding When to use RabbitMQ or Apache Kafka](
* [Delaying Asynchronous Message Processing with RabbitMQ at Indeed](
* [Yelp's Real-time Data Pipeline with Kafka](
* [Real-time Deduping at Scale with Kafka-based Pipleline at Tapjoy](
2017-12-26 22:47:31 -05:00
* [Storage](
* [In-memory Storage](
* [Optimizing Memcached Efficiency at Quora](
* [Real-Time Data Warehouse with MemSQL on Cisco UCS](
* [Moving to MemSQL for solving problems at Tapjoy: horizontally scalable, ACID compliant, MySQL compatibility](
2018-01-02 00:43:23 -05:00
* [Durable Storage (S3)](
* [Reasons for Choosing S3 over HDFS at Databricks](
* [S3 in the Data Infrastructure at Airbnb](
* [Quantcast File System on Amazon S3](
2018-01-02 01:06:47 -05:00
* [Using S3 in Netflix Chukwa](
2017-12-26 22:47:31 -05:00
* [NoSQL](
* [Key-Value Databases (DynamoDB, Voldemort, Manhattan)](
* [Scaling Mapbox infrastructure with DynamoDB Streams](
* [Manhattan: Twitters distributed key-value database](
2018-01-16 21:08:07 -05:00
* [Column Databases (Cassandra, HBase, Vertica, Sybase IQ)](
2017-12-27 19:47:33 -05:00
* [Consistent Hashing in Cassandra](
2018-01-02 20:29:38 -05:00
* [When NOT to use Cassandra?](
* [Storing Images in Cassandra at Walmart Scale](
2018-01-02 00:08:40 -05:00
* [Cassandra at Instagram](
* [How Yelp Scaled Ad Analytics with Cassandra](
* [How Discord Stores Billions of Messages with Cassandra](
2018-01-16 21:08:07 -05:00
* [Document Databases (MongoDB, CouchDB)](
* [eBay: Building Mission-Critical Multi-Data Center Applications with MongoDB](
* [MongoDB at Baidu: Multi-Tenant Cluster Storing 200+ Billion Documents across 160 Shards](
* [The AWS and MongoDB Infrastructure of Parse (acquired by Facebook)](
2018-01-16 21:08:07 -05:00
* [Graph Databases (Neo4j)](
2018-01-16 20:56:15 -05:00
* [Neo4j at Airbnb](
2018-01-02 22:02:41 -05:00
* [Datastructure Databases (Redis, Hazelcast)](
* [How Twitter Uses Redis To Scale](
* [How Twitter Uses Redis To Scale - Video](
* [Scaling Slacks Job Queue with Redis](
* [Moving persistent data out of Redis at Github](
2018-01-17 00:38:38 -05:00
* [RDBMS (MySQL, PostgreSQL)](
* [Why SQL is beating NoSQL, and what this means for the future of data](
* [Sharding MySQL at Pinterest](
* [How Airbnb Partitioned Main MySQL Database in Two Weeks](
2018-01-02 21:14:06 -05:00
* [Replication is the Key for Scalability & High Availability](
2018-01-02 21:11:04 -05:00
* [How Twitch uses PostgreSQL](
* [Scaling MySQL-based financial reporting system at Airbnb](
* [Scaling to 100M at Wix: MySQL is a Better NoSQL](
* [Why Uber Engineering Switched from Postgres to MySQL](
* [Handling Growth with Postgres at Instagram](
2017-12-26 22:47:31 -05:00
* [HTTP Caching](
* [Reverse Proxy (Nginx, Varnish, Squid, rack-cache)](
* [CDN (Akamai, Amazon CloudFront)](
* [NASA - Streaming 4K Live from the International Space Station Using CloudFront](
2017-12-26 22:47:31 -05:00
* [Concurrency](
* [Message-Passing Concurrency](
* [Software Transactional Memory](
* [Dataflow Concurrency](
* [Shared-State Concurrency](
* [Event-Driven Architecture](
* [Messaging](
* [Publish-Subscribe](
* [Autoscaling Pub/Sub Consumers at Spotify](
2017-12-26 22:47:31 -05:00
* [Point-to-Point](
* [Store-Forward](
* [Request-Reply](
* [Actors: Fire-forget and Fire-Receive-Eventually](
* [Enterprise Service Bus](
* [Domain Events](
* [Event Stream Processing](
2018-01-16 21:44:12 -05:00
* [Kafka Streams on Heroku](
2017-12-26 22:47:31 -05:00
* [Event Sourcing](
* [Command & Query Responsibility Segregation (CQRS)](
* [Load Balancing](
2017-12-26 22:47:31 -05:00
* [Round-robin Allocation](
* [Random Allocation](
* [Weighted Allocation](
* [Dynamic Load Balancing](
* [Work Stealing](!searchin/mechanical-sympathy/http/mechanical-sympathy/CWyAD-oF9Uw/ycO0vxGqMvsJ)
2017-12-26 22:47:31 -05:00
* [Consistent Hashing](
* [UDP Load Balancing](
* [Cloud Load Balancing](
2018-01-09 03:37:21 -05:00
* [AWS ELB: Application Load Balancer, Network Load Balancer, Classic Load Balancer](
* [AWS ELB issues at Asana in 2012](
2018-01-09 03:46:13 -05:00
* [Google Cloud Load Balancing](
2017-12-26 22:47:31 -05:00
* [Parallel Computing](
* [SPMD (Single Program Multiple Data): The Genetic Pattern](
* [Master/Worker Pattern](
* [Loop Parallelism Pattern: Extracting parallel tasks from loops](
* [Fork/Join Pattern: Good for recursive data processing](
* [MapReduce Pattern: Born for Big Data](
* [Parallelize the rendering of web pages: Use case of](
* [Distributed Machine Learning](
* [Scalable Deep Learning Platform On Spark In Baidu](
* [Horovod: Ubers Open Source Distributed Deep Learning Framework for TensorFlow](
* [Scaling Gradient Boosted Trees for Click-Through-Rate Prediction at Yelp](
2017-12-26 22:47:31 -05:00
## Availability
* [Fail-over](
* [Replication](
* [Master-Slave](
* [Tree Replication](
* [Master-Master](
* [Buddy Replication](
## Stability
* [Circuit Breaker](
* [Always use timeouts (if possible)](
* [Let it crash/Supervisors: Embrace failure as a natural state in the life-cycle of the application](
* [Crash early: An error now is better than a response tomorrow](
* [Bulkheads: Partition and tolerate failure in one part](
* [Steady state: Always put logs on separate disk](
* [Throttling: Maintain a steady pace](
* [Multi-clustering: Improving Resiliency and Stability of a Large-scale Monolithic API Service at LinkedIn](
2017-12-26 22:47:31 -05:00
## Others
2018-01-17 02:58:14 -05:00
* [Distributed Git Server at Palantir](
2018-01-17 03:29:54 -05:00
* [Seagull: Distributed system that helps running > 20 million tests per day at Yelp](
* [Scalable Gaming Patterns on AWS (Sep 2017)](
2018-01-16 21:23:12 -05:00
* [Building a Modern Bank Backend at Monzo](
* [Selecting a Cloud Provider at Etsy](
2018-01-10 13:13:38 -05:00
## Books
2018-01-09 03:30:23 -05:00
* [The Art of Scalability](
* [Designing Data-Intensive Applications](
2018-01-10 12:35:22 -05:00
* [Web Scalability for Startup Engineers](
* [Scalability Rules: 50 Principles for Scaling Web Sites](
2018-01-09 03:30:23 -05:00
2018-01-10 13:13:38 -05:00
## Talks
* [Harvard CS75 - Lecture 9: Scalability](
* [How We've Scaled Dropbox - Kevin Modzelewski, Back-end Engineer at Dropbox](
* [Lessons of Scale at Facebook - Bobby Johnson, Director of Engineering at Facebook](
* [Scaling Instagram Infrastructure - Lisa Guo, Instagram Engineering](
* [Scaling Pinterest - Marty Weiner, Pinterests founding engineer](
* [Designing for Failure: Scaling Uber's Backend by Breaking Everything - Matt Ranney, Chief Systems Architect at Uber](
2017-12-26 22:47:31 -05:00
## Special Thanks
* Jonas Bonér, CTO at Lightbend, for the [original inspiration](