2017-12-26 22:47:31 -05:00
# Awesome Scalability, Availability, and Stability Backend [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
2018-01-10 11:54:11 -05:00
A curated list of selected readings to illustrate Scalability, Availability, and Stability design patterns in backend development.
2017-12-26 22:47:31 -05:00
#### What if your backend went slow?
2018-01-10 11:35:46 -05:00
> Understand your problems: performance problem (slow for a single user) or scalability problem (fast for a single user but slow under heavy load) by reviewing some [basic design concepts](#basic).
2017-12-26 22:47:31 -05:00
#### What if your backend went down?
> "Even if you lose all one day, you can build all over again if you retain your calm!" - Thuan Pham, CTO at Uber Technologies Inc.
## Contributing
Please take a look at the [contribution guidelines ](CONTRIBUTING.md ) first.
Contributions are always welcome!
## Contents
- [Basic ](#basic )
- [Scalability ](#scalability )
- [Availability ](#availability )
- [Stability ](#stability )
## Basic
2018-01-10 11:41:26 -05:00
* [CAP theorem and the trade-offs ](http://robertgreiner.com/2014/08/cap-theorem-revisited/ )
2017-12-26 22:47:31 -05:00
* [Scale up vs Scale out ](https://blogs.technet.microsoft.com/admoore/2015/02/17/scaling-out-vs-scaling-up/ )
2018-01-10 11:41:26 -05:00
* [How to deal with latency ](http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it )
* [Dropbox case: Striving for maximal throughput with acceptable latency ](https://blogs.dropbox.com/tech/2017/09/optimizing-web-servers-for-high-throughput-and-low-latency/ )
* [What is ACID? ](http://highscalability.com/drop-acid-and-think-about-data )
* [Architecture issues: Bottlenecks, Database, CPU, IO ](http://highscalability.com/blog/2014/5/12/4-architecture-issues-when-scaling-web-applications-bottlene.html )
* [Advantages and drawbacks of Microservices ](https://cloudacademy.com/blog/microservices-architecture-challenge-advantage-drawback/ )
2018-01-10 11:50:33 -05:00
* [Avoid Overengineering ](https://hackernoon.com/how-to-accept-over-engineering-for-what-it-really-is-6fca9a919263 )
2018-01-10 11:35:46 -05:00
* [Don't Repeat Yourself (DRY) ](https://softwareengineering.stackexchange.com/questions/103233/why-is-dry-important )
* [DRY in Django ](https://www.webforefront.com/django/designprinciples.html )
2018-01-10 11:50:33 -05:00
* [Designing for loose coupling ](https://dzone.com/articles/the-importance-of-loose-coupling-in-rest-api-desig )
* [Designing for resiliency ](http://highscalability.com/blog/2012/12/31/designing-for-resiliency-will-be-so-2013.html )
2017-12-26 22:47:31 -05:00
## Scalability
* [Distributed Caching ](https://www.wix.engineering/single-post/scaling-to-100m-to-cache-or-not-to-cache )
* [Write-behind and Write-through ](https://docs.oracle.com/cd/E15357_01/coh.360/e15723/cache_rtwtwbra.htm#COHDG5177 )
* [Eviction Policies ](http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html )
* [Peer-To-Peer Caching ](https://en.wikipedia.org/wiki/P2P_caching )
2018-01-02 21:30:17 -05:00
* [Distributed Logging & Tracing ](https://blog.treasuredata.com/blog/2016/08/03/distributed-logging-architecture-in-the-container-era/ )
* [Building DistributedLog at Twitter: High-performance replicated log service ](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2015/building-distributedlog-twitter-s-high-performance-replicated-log-servic.html )
2018-01-02 22:11:10 -05:00
* [Distributed tracing at Pinterest with Pintrace ](https://medium.com/@Pinterest_Engineering/distributed-tracing-at-pinterest-with-new-open-source-tools-a4f8a5562f6b )
2018-01-02 22:52:35 -05:00
* [Scalable and reliable log ingestion at Pinterest ](https://medium.com/@Pinterest_Engineering/scalable-and-reliable-data-ingestion-at-pinterest-b921c2ee8754 )
* [Distributed Messaging ](https://arxiv.org/pdf/1704.00411.pdf )
* [Understanding When to use RabbitMQ or Apache Kafka ](https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka )
2017-12-26 22:47:31 -05:00
* [Storage ](http://highscalability.com/blog/2011/11/1/finding-the-right-data-solution-for-your-application-in-the.html )
* [In-memory Storage ](https://medium.com/@denisanikin/what-an-in-memory-database-is-and-how-it-persists-data-efficiently-f43868cff4c1 )
2018-01-01 20:28:28 -05:00
* [Optimizing Memcached Efficiency at Quora ](https://engineering.quora.com/Optimizing-Memcached-Efficiency )
2018-01-04 06:17:04 -05:00
* [Real-Time Data Warehouse with MemSQL on Cisco UCS ](https://blogs.cisco.com/datacenter/memsql )
2018-01-02 00:43:23 -05:00
* [Durable Storage (S3) ](https://aws.amazon.com/s3/ )
* [Reasons for Choosing S3 over HDFS at Databricks ](https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html )
* [S3 in the Data Infrastructure at Airbnb ](https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c )
* [Quantcast File System on Amazon S3 ](https://www.quantcast.com/blog/quantcast-file-system-on-amazon-s3/ )
2018-01-02 01:06:47 -05:00
* [Using S3 in Netflix Chukwa ](https://medium.com/netflix-techblog/evolution-of-the-netflix-data-pipeline-da246ca36905 )
2017-12-26 22:47:31 -05:00
* [NoSQL ](https://www.thoughtworks.com/insights/blog/nosql-databases-overview )
2018-01-02 21:23:02 -05:00
* [Key-Value Databases (DynamoDB, Voldemort, Manhattan) ](http://highscalability.com/anti-rdbms-list-distributed-key-value-stores )
2018-01-02 21:05:24 -05:00
* [Scaling Mapbox infrastructure with DynamoDB Streams ](https://blog.mapbox.com/scaling-mapbox-infrastructure-with-dynamodb-streams-d53eabc5e972 )
2018-01-02 21:23:02 -05:00
* [Manhattan: Twitter’ s distributed key-value database ](https://blog.twitter.com/engineering/en_us/a/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale.html )
2017-12-26 22:47:31 -05:00
* [Document Databases (Cassandra, Vertica, Sybase IQ) ](https://msdn.microsoft.com/en-us/magazine/hh547103.aspx )
2017-12-27 19:47:33 -05:00
* [Consistent Hashing in Cassandra ](https://blog.imaginea.com/consistent-hashing-in-cassandra/ )
2018-01-02 20:29:38 -05:00
* [When NOT to use Cassandra? ](https://stackoverflow.com/questions/2634955/when-not-to-use-cassandra )
2017-12-27 19:49:55 -05:00
* [Storing Images in Cassandra at Walmart Scale ](https://medium.com/walmartlabs/building-object-store-storing-images-in-cassandra-walmart-scale-a6b9c02af593 )
2018-01-02 00:08:40 -05:00
* [Cassandra at Instagram ](https://www.slideshare.net/DataStax/cassandra-at-instagram-2016 )
2018-01-02 00:17:12 -05:00
* [How Yelp Scaled Ad Analytics with Cassandra ](https://engineeringblog.yelp.com/2016/08/how-we-scaled-our-ad-analytics-with-cassandra.html )
2018-01-02 20:43:57 -05:00
* [How Discord Stores Billions of Messages with Cassandra ](https://blog.discordapp.com/how-discord-stores-billions-of-messages-7fa6ec7ee4c7 )
2017-12-26 22:47:31 -05:00
* [Graph Databases (MongoDB, CouchDB, Neo4j) ](https://neo4j.com/blog/neo4j-scalability-infographic/ )
2018-01-02 20:39:57 -05:00
* [eBay: Building Mission-Critical Multi-Data Center Applications with MongoDB ](https://www.mongodb.com/blog/post/ebay-building-mission-critical-multi-data-center-applications-with-mongodb )
2018-01-02 21:01:27 -05:00
* [MongoDB at Baidu: Multi-Tenant Cluster Storing 200+ Billion Documents across 160 Shards ](https://www.mongodb.com/blog/post/mongodb-at-baidu-powering-100-apps-across-600-nodes-at-pb-scale )
2018-01-02 22:02:41 -05:00
* [Datastructure Databases (Redis, Hazelcast) ](https://db-engines.com/en/system/Hazelcast%3BMemcached%3BRedis )
* [How Twitter Uses Redis To Scale ](http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis-to-scale-105tb-ram-39mm-qps-10000-ins.html )
2018-01-02 22:23:05 -05:00
* [How Twitter Uses Redis To Scale - Video ](https://www.youtube.com/watch?v=QznaOSk20nU )
2018-01-02 22:02:41 -05:00
* [Redis in Slack job queue ](https://slack.engineering/scaling-slacks-job-queue-687222e9d100 )
2018-01-02 22:13:42 -05:00
* [Moving persistent data out of Redis at Github ](https://githubengineering.com/moving-persistent-data-out-of-redis/ )
2017-12-26 22:47:31 -05:00
* [RDBMS ](https://www.mysql.com/products/cluster/scalability.html )
2018-01-02 21:53:21 -05:00
* [Why SQL is beating NoSQL, and what this means for the future of data ](https://blog.timescale.com/why-sql-beating-nosql-what-this-means-for-future-of-data-time-series-database-348b777b847a )
2018-01-02 22:27:40 -05:00
* [Sharding MySQL at Pinterest ](https://medium.com/@Pinterest_Engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6f )
* [How Airbnb Partitioned Main MySQL Database in Two Weeks ](https://medium.com/airbnb-engineering/how-we-partitioned-airbnb-s-main-database-in-two-weeks-55f7e006ff21 )
2018-01-02 21:14:06 -05:00
* [Replication is the Key for Scalability & High Availability ](http://basho.com/posts/technical/replication-is-the-key-for-scalability-high-availability/ )
2018-01-02 21:11:04 -05:00
* [How Twitch uses PostgreSQL ](https://blog.twitch.tv/how-twitch-uses-postgresql-c34aa9e56f58 )
2018-01-02 21:40:17 -05:00
* [Scaling MySQL-based financial reporting system at Airbnb ](https://medium.com/airbnb-engineering/tracking-the-money-scaling-financial-reporting-at-airbnb-6d742b80f040 )
2018-01-02 21:44:27 -05:00
* [Scaling to 100M at Wix: MySQL is a Better NoSQL ](https://www.wix.engineering/single-post/scaling-to-100m-mysql-is-a-better-nosql )
2018-01-02 21:50:01 -05:00
* [Why Uber Engineering Switched from Postgres to MySQL ](https://eng.uber.com/mysql-migration/ )
2018-01-08 21:26:22 -05:00
* [Handling Growth with Postgres at Instagram ](https://engineering.instagram.com/handling-growth-with-postgres-5-tips-from-instagram-d5d7e7ffdfcb )
2017-12-26 22:47:31 -05:00
* [HTTP Caching ](https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching )
* [Reverse Proxy (Nginx, Varnish, Squid, rack-cache) ](https://www.mertech.com/overview-reverse-proxying/ )
2018-01-08 21:52:18 -05:00
* [CDN (Akamai, Amazon CloudFront) ](https://building.coursera.org/blog/2015/07/09/improving-coursera-global-site-performance-a-head-to-head-cdn-battle-with-production-traffic/ )
* [NASA - Streaming 4K Live from the International Space Station Using CloudFront ](https://live.awsevents.com/nasa4k )
2017-12-26 22:47:31 -05:00
* [Concurrency ](https://lambda.grofers.com/open-sourcing-codon-workflow-framework-for-building-aggregator-apis-f8e591a158b4 )
* [Message-Passing Concurrency ](https://link.springer.com/chapter/10.1007/978-3-642-35170-9_11 )
* [Software Transactional Memory ](https://dl.acm.org/citation.cfm?id=3037750 )
* [Dataflow Concurrency ](http://www.marketwired.com/press-release/java-concurrency-and-scalability-platform-akka-celebrates-fifth-anniversary-1928674.htm )
* [Shared-State Concurrency ](https://common-lisp.net/project/ssc/darcs/spec/specification.pdf )
* [Event-Driven Architecture ](https://martinfowler.com/articles/201701-event-driven.html )
* [Messaging ](https://www.ibm.com/support/knowledgecenter/en/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/cjt1004_.html )
* [Publish-Subscribe ](https://aws.amazon.com/pub-sub-messaging/ )
2018-01-02 22:27:40 -05:00
* [Autoscaling Pub/Sub Consumers at Spotify ](https://labs.spotify.com/2017/11/20/autoscaling-pub-sub-consumers/ )
2017-12-26 22:47:31 -05:00
* [Point-to-Point ](https://content.pivotal.io/blog/understanding-when-to-use-rabbitmq-or-apache-kafka )
* [Store-Forward ](https://medium.com/netflix-techblog/announcing-suro-backbone-of-netflixs-data-pipeline-5c660ca917b6 )
* [Request-Reply ](http://edwardost.github.io/talend/camel/2015/05/15/Scalable-JMS-Request-Reply/ )
* [Actors: Fire-forget and Fire-Receive-Eventually ](https://doc.akka.io/docs/akka/2.5.5/scala/actors.html )
* [Enterprise Service Bus ](http://www.oracle.com/technetwork/articles/soa/ind-soa-esb-1967705.html )
* [Domain Events ](https://www.oreilly.com/ideas/the-evolution-of-scalable-microservices )
* [Event Stream Processing ](https://dl.acm.org/citation.cfm?id=2933288 )
* [Event Sourcing ](https://medium.com/lcom-techblog/scalable-microservices-with-event-sourcing-and-redis-6aa245574db0 )
* [Command & Query Responsibility Segregation (CQRS) ](https://docs.microsoft.com/en-us/azure/architecture/patterns/cqrs )
2018-01-09 03:20:50 -05:00
* [Load Balancing ](https://blog.vivekpanyam.com/scaling-a-web-service-load-balancing/ )
2017-12-26 22:47:31 -05:00
* [Round-robin Allocation ](https://www.citrix.com/blogs/2010/09/03/load-balancing-round-robin/ )
* [Random Allocation ](http://www.streetdirectory.com/travel_guide/192172/world_wide_web/load_balancing_and_yahoo.html )
* [Weighted Allocation ](https://medium.com/netflix-techblog/netflix-shares-cloud-load-balancing-and-failover-tool-eureka-c10647ef95e5 )
2018-01-09 03:20:50 -05:00
* [Dynamic Load Balancing ](https://engineeringblog.yelp.com/2017/05/taking-zero-downtime-load-balancing-even-further.html )
* [Work Stealing ](https://groups.google.com/forum/#!searchin/mechanical-sympathy/http/mechanical-sympathy/CWyAD-oF9Uw/ycO0vxGqMvsJ )
2017-12-26 22:47:31 -05:00
* [Consistent Hashing ](https://medium.com/vimeo-engineering-blog/improving-load-balancing-with-a-new-consistent-hashing-algorithm-9f1bd75709ed )
2018-01-09 03:20:50 -05:00
* [UDP Load Balancing ](https://developers.500px.com/udp-load-balancing-with-keepalived-167382d7ad08 )
* [Cloud Load Balancing ](https://www.nginx.com/resources/glossary/cloud-load-balancing/ )
2018-01-09 03:37:21 -05:00
* [AWS ELB: Application Load Balancer, Network Load Balancer, Classic Load Balancer ](https://aws.amazon.com/elasticloadbalancing/details/#details )
* [AWS ELB issues at Asana, 2012 ](https://blog.asana.com/2012/06/issues-moving-to-amazon%E2%80%99s-elastic-load-balancer/ )
2018-01-09 03:46:13 -05:00
* [Google Cloud Load Balancing ](https://cloud.google.com/load-balancing/ )
2017-12-26 22:47:31 -05:00
* [Parallel Computing ](https://blogs.msdn.microsoft.com/ddperf/2009/05/02/are-we-taking-advantage-of-parallelism/ )
* [SPMD (Single Program Multiple Data): The Genetic Pattern ](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-186.html )
* [Master/Worker Pattern ](https://docs.gigaspaces.com/sbp/master-worker-pattern.html )
* [Loop Parallelism Pattern: Extracting parallel tasks from loops ](https://www.cs.umd.edu/class/fall2001/cmsc411/projects/unroll/main.htm )
* [Fork/Join Pattern: Good for recursive data processing ](http://highscalability.com/learn-how-exploit-multiple-cores-better-performance-and-scalability )
* [MapReduce Pattern: Born for Big Data ](http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf )
2018-01-01 09:15:37 -05:00
* [Parallelize the rendering of web pages: Use case of Yelp.com ](https://engineeringblog.yelp.com/2017/07/generating-web-pages-in-parallel-with-pagelets.html )
2017-12-26 22:47:31 -05:00
## Availability
* [Fail-over ](https://activemq.apache.org/artemis/docs/1.0.0/ha.html )
* [Replication ](https://m.alphasights.com/a-primer-on-database-replication-381b319cd032 )
* [Master-Slave ](https://engineering.bitnami.com/articles/enabling-additional-nodes-to-bitnami-mysql-with-replication.html )
* [Tree Replication ](https://link.springer.com/chapter/10.1007/3-540-44863-2_47 )
* [Master-Master ](http://sabbour.me/highly-available-and-scalable-master-master-mysql-on-azure-virtual-machines/ )
* [Buddy Replication ](https://developer.jboss.org/wiki/JBossCacheBuddyReplicationDesign )
## Stability
* [Circuit Breaker ](https://doc.akka.io/docs/akka/current/common/circuitbreaker.html )
* [Always use timeouts (if possible) ](https://www.javaworld.com/article/2824163/application-performance/stability-patterns-applied-in-a-restful-architecture.html )
* [Let it crash/Supervisors: Embrace failure as a natural state in the life-cycle of the application ](http://erlang.org/doc/design_principles/sup_princ.html )
* [Crash early: An error now is better than a response tomorrow ](http://odino.org/better-performance-the-case-for-timeouts/ )
* [Bulkheads: Partition and tolerate failure in one part ](https://skife.org/architecture/fault-tolerance/2009/12/31/bulkheads.html )
* [Steady state: Always put logs on separate disk ](https://docs.microsoft.com/en-us/sql/relational-databases/policy-based-management/place-data-and-log-files-on-separate-drives )
* [Throttling: Maintain a steady pace ](http://www.sosp.org/2001/papers/welsh.pdf )
2018-01-05 06:40:04 -05:00
## Others
* [Scalable Gaming Patterns on AWS (Sep 2017) ](https://d0.awsstatic.com/whitepapers/aws-scalable-gaming-patterns.pdf )
2018-01-06 21:03:23 -05:00
* [Building a Modern Bank Backend ](https://monzo.com/blog/2016/09/19/building-a-modern-bank-backend/ )
2018-01-08 21:40:01 -05:00
* [Best Practices For Horizontal Application Scaling (by Shekhar Gulati - OpenShift) ](https://blog.openshift.com/best-practices-for-horizontal-application-scaling/ )
2018-01-05 06:40:04 -05:00
2018-01-09 03:37:21 -05:00
## Classic Books
2018-01-09 03:30:23 -05:00
* [The Art of Scalability ](http://theartofscalability.com/ )
* [Designing Data-Intensive Applications ](https://dataintensive.net/ )
2017-12-26 22:47:31 -05:00
## Special Thanks
2018-01-02 21:30:17 -05:00
* Jonas Bonér, CTO at Lightbend, for the [original inspiration ](https://www.slideshare.net/jboner/scalability-availability-stability-patterns )