merge files from the blockchain infra repo (#59)

This commit is contained in:
autistic-symposium-helper 2024-11-17 17:03:20 -08:00 committed by GitHub
parent 23f56ef195
commit 2a6449bb85
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
346 changed files with 29097 additions and 132 deletions

View file

@ -0,0 +1,327 @@
## 📡 communication design patterns
<br>
### Request Response model
<br>
#### used in
- the web, HTTP, DNS, SSH
- RPC (remote procedure call)
- SQL and database protocols
- APIs (REST/SOAP/GraphQL)
<br>
#### the basic idea
1. clients sends a request
- the request structure is defined by both client and server and has a boundary
2. server parses the request
- the parsing cost is not cheap (e.g. `json` vs. `xml` vs. protocol buffers)
- for example, for a large image, chunks can be sent, with a request per chunk
3. Server processes the request
4. Server sends a response
5. Client parse the Response and consume
<br>
#### an example in your terminal
* see how it always get the headers firsts:
```bash
curl -v --trace souza.xyz
== Info: Trying 76.76.21.21:80...
== Info: Connected to souza.xyz (76.76.21.21) port 80 (#0)
=> Send header, 79 bytes (0x4f)
0000: 47 45 54 20 2f 20 48 54 54 50 2f 31 2e 31 0d 0a GET / HTTP/1.1..
0010: 48 6f 73 74 3a 20 6d 61 72 69 6e 61 73 6f 75 7a Host: souz
0020: 61 2e 78 79 7a 0d 0a 55 73 65 72 2d 41 67 65 6e a.xyz..User-Agen
0030: 74 3a 20 63 75 72 6c 2f 37 2e 38 38 2e 31 0d 0a t: curl/7.88.1..
0040: 41 63 63 65 70 74 3a 20 2a 2f 2a 0d 0a 0d 0a Accept: */*....
== Info: HTTP 1.0, assume close after body
<= Recv header, 33 bytes (0x21)
0000: 48 54 54 50 2f 31 2e 30 20 33 30 38 20 50 65 72 HTTP/1.0 308 Per
0010: 6d 61 6e 65 6e 74 20 52 65 64 69 72 65 63 74 0d manent Redirect.
0020: 0a .
<= Recv header, 26 bytes (0x1a)
0000: 43 6f 6e 74 65 6e 74 2d 54 79 70 65 3a 20 74 65 Content-Type: te
0010: 78 74 2f 70 6c 61 69 6e 0d 0a xt/plain..
<= Recv header, 36 bytes (0x24)
0000: 4c 6f 63 61 74 69 6f 6e 3a 20 68 74 74 70 73 3a Location: https:
0010: 2f 2f 6d 61 72 69 6e 61 73 6f 75 7a 61 2e 78 79 //souza.xy
0020: 7a 2f 0d 0a z/..
<= Recv header, 41 bytes (0x29)
0000: 52 65 66 72 65 73 68 3a 20 30 3b 75 72 6c 3d 68 Refresh: 0;url=h
0010: 74 74 70 73 3a 2f 2f 6d 61 72 69 6e 61 73 6f 75 ttps://sou
0020: 7a 61 2e 78 79 7a 2f 0d 0a za.xyz/..
<= Recv header, 16 bytes (0x10)
0000: 73 65 72 76 65 72 3a 20 56 65 72 63 65 6c 0d 0a server: Vercel..
<= Recv header, 2 bytes (0x2)
0000: 0d 0a ..
<= Recv data, 14 bytes (0xe)
0000: 52 65 64 69 72 65 63 74 69
```
<br>
----
### Synchronous vs. Asynchronous workloads
<br>
#### Synchronous I/O: the basic idea
1. Caller sends a request and blocks
2. Caller cannot execute any code meanwhile
3. Receiver responds, Caller unblocks
4. Caller and Receiver are in sync
<br>
##### example (note the waste!)
1. program asks OS to read from disk
2. program main threads is taken off the CPU
3. read is complete and program resume execution (costly)
<br>
#### Asynchronous I/O: the basic idea
1. caller sends a request
2. caller can work until it gets a response
3. caller either:
- checks whether the response is ready (epoll)
- receiver calls back when it's done (io_uring)
- spins up a new thread that blocks
4. caller and receiver not in sync
<br>
#### Sync vs. Async in a Request Response
- synchronicity is a client property
- most modern client libraries are async
<br>
#### Async workload is everywhere
- async programming (promises, futures)
- async backend processing
- async commits in postgres
- async IO in Linux (epoll, io_uring)
- async replication
- async OS fsync (filesystem cache)
<br>
----
### Push
<br>
#### pros and coins
- real-time
- the client must be online (connected to the server)
- the client must be able to handle the load
- polling is preferred for light clients.
- used by RabbitMQ (clients consume the queues, and the messages are pushed to the clients)
<br>
#### the basic idea
1. client connects to a server
2. server sends data to the client
3. client doesn't have to request anything
4. protocol must be bidirectional
<br>
----
### Polling
<br>
* used when a request takes long time to process (e.g., upload a video) and very simple to build
* however, it can be too chatting, use too much network bandwidth and backend resources
<br>
#### the basic idea
1. client sends a request
2. server responds immediately with a handle
3. server continues to process the request
4. client uses that handle to check for status
5. multiple short request response as polls
<br>
---
### Long Polling
<br>
* a poll requests where the server only responds when the job is ready (used when a request takes long time to process and it's not real time)
* used by Kafka
<br>
#### the basic idea
1. clients sends a request
2. server responds immediately with a handle
3. server continues to process the request
4. client uses that handle to check for status
5. server does not reply until has the response (and there are some timeouts)
<br>
---
### Server Sent Events
<br>
* one request with a long response, but the client must be online and be able to handle the response
<br>
#### the basic idea
1. a response has start and end
2. client sends a request
3. server sends logical events as part of response
4. server never writes the end of the response
5. it's still a request but an unending response
6. client parses the streams data
7. works with HTTP
<br>
----
### Publish Subscribe (Pub/Sub)
<br>
* one publisher has many reader (and there can be many publishers)
* relevant when there are many servers (e.g., upload, compress, format, notification)
* great for microservices as it scales with multiple receivers
* loose coupling (clients are not connected to each other and works while clients not running)
* however, you cannot know if the consumer/subscriber got the message or got it twice, etc.
* also, it might result on network saturation and extra complexity
* used by RabbitQ and Kafka
<br>
---
### Multiplexing vs. Demultiplexing
<br>
* used by HTTP/2, QUIC, connection pool, MPTCP
* connection pooling is a technique where you can spin several backend connections and keep them "hot"
<br>
---
### Stateful vs. Stateless
<br>
* a very contentious topic: is state stored in the backend? how do you rely on the state of an application, system, or protocol?
* **stateful backend**: store state about clients in its memory and depends on the information being there
* **stateless backend**: client is responsible to "transfer the state" with every request (you may store but can safely lose it).
<br>
#### Stateless backends
* stateless backends can still store data somewhere else
* the backend remain stateless but the system is stateful (can you restart the backend during idle time while the client workflow continues to work?)
<br>
#### Stateful backend
* the server generate a session, store locally, and return to the user
* the client check if the session is in memory to authenticate and return
* if the backend is restarted, sessions are empty (it never relied on the databases)
<br>
#### Stateless vs. Stateful protocols
* the protocols can be designed to store date
* TCP is stateful: sequences, connection file descriptor
* UDP is stateless: DNS send queryID in UDP to identify queries
* QUIC is stateful but because it sends connectionID to identify connection, it transfer the state across the protocol
* you can build a stateless protocol on top of a stateful one and vice versa (e.g., HTTP on top of TCP, with cookies)
<br>
#### Complete stateless systems
* stateless systems are very rare
* state is carried with every request
* a backend service that relies completely on the input
* **JWT (JSON Web Token)**, everything is in the token and you cannot mark it as invalid
<br>
---
### Sidecar Pattern
<br>
* every protocol requires a library, but changing the library is hard as the app is entrenched to it and breaking changes backward compatibility
* sidecar pattern is the idea of delegating communication through a proxy with a rich library (and the client has a thin library)
* in this case, every client has a sidecar proxy
* pros: it's language agnostic, provides extra security, service discovery, caching.
* cons: complexity, latency
<br>
#### Examples
* service mesh proxies (Linkerd, Istio, Envoy)
* sidecar proxy container (must be layer 7 proxy)
<br>

Binary file not shown.

Binary file not shown.

View file

@ -0,0 +1,35 @@
## data engineering
<br>
### articles
* [machine learning system design](https://medium.com/@ricomeinl/machine-learning-system-design-f2f4018f2f8)
* [how to code neat ml pipelines](https://www.neuraxio.com/en/blog/neuraxle/2019/10/26/neat-machine-learning-pipelines.html)
<br>
### enterprise solutions
* [netflix data pipeline](https://medium.com/netflix-techblog/evolution-of-the-netflix-data-pipeline-da246ca36905)
* [netlix data videos](https://www.youtube.com/channel/UC00QATOrSH4K2uOljTnnaKw)
* [yelp data pipeline](https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html)
* [gusto data pipeline](https://engineering.gusto.com/building-a-data-informed-culture/)
* [500px data pipeline](https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83)
* [twitter data pipeline](https://blog.twitter.com/engineering/en_us/topics/insights/2018/ml-workflows.html)
* [coursera data pipeline](https://medium.com/@zhaojunzhang/building-data-infrastructure-in-coursera-15441ebe18c2)
* [cloudfare data pipeline](https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-queries-per-second/)
* [pandora data pipeline](https://engineering.pandora.com/apache-airflow-at-pandora-1d7a844d68ee)
* [heroku data pipeline](https://medium.com/@damesavram/running-airflow-on-heroku-ed1d28f8013d)
* [zillow data pipeline](https://www.zillow.com/data-science/airflow-at-zillow/)
* [airbnb data pipeline](https://medium.com/airbnb-engineering/https-medium-com-jonathan-parks-scaling-erf-23fd17c91166)
* [walmart data pipeline](https://medium.com/walmartlabs/how-we-built-a-data-pipeline-with-lambda-architecture-using-spark-spark-streaming-9d3b4b4555d3)
* [robinwood data pipeline](https://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8)
* [lyft data pipeline](https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff)
* [slack data pipeline](https://speakerdeck.com/vananth22/operating-data-pipeline-with-airflow-at-slack)
* [remind data pipeline](https://medium.com/@RemindEng/beyond-a-redshift-centric-data-model-1e5c2b542442)
* [wish data pipeline](https://medium.com/wish-engineering/scaling-analytics-at-wish-619eacb97d16)
* [databrick data pipeline](https://databricks.com/blog/2017/03/31/delivering-personalized-shopping-experience-apache-spark-databricks.html)

View file

@ -0,0 +1,130 @@
## airflow and luigi
<br>
### airflow
<br>
* **[apache airflow](https://github.com/apache/airflow)** was a tool **[developed by airbnb in 2014 and later open-sourced](https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8)**
* it is a platform to programmatically author, schedule, and monitor workflows. when workflows are defined as code, they become more maintainable, versionable, testable, and collaborative
* you can use airflow to author workflows as directed acyclic graphs (DAGs) of tasks: the airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
* here is **[a very simple toy example of an airflow job](https://gist.github.com/robert8138/c6e492d00cd7b7e7626670ba2ed32e6a)** that simply prints the date in bash every day after waiting for one second to pass, after the execution date is reached:
<br>
```python
from datetime import datetime, timedelta
from airflow.models import DAG # Import the DAG class
from airflow.operators.bash_operator import BashOperator
from airflow.operators.sensors import TimeDeltaSensor
default_args = {
'owner': 'you',
'depends_on_past': False,
'start_date': datetime(2018, 1, 8),
}
dag = DAG(
dag_id='anatomy_of_a_dag',
description="This describes my DAG",
default_args=default_args,
schedule_interval=timedelta(days=1)) # This is a daily DAG.
# t0 and t1 are examples of tasks created by instantiating operators
t0 = TimeDeltaSensor(
task_id='wait_a_second',
delta=timedelta(seconds=1),
dag=dag)
t1 = BashOperator(
task_id='print_date_in_bash',
bash_command='date',
dag=dag)
t1.set_upstream(t0)
```
<br>
---
### luigi
<br>
- **[luigi data pipelining](https://github.com/spotify/luigi)** is spotify's python module that helps you build complex pipelines of batch jobs. it handles dependency resolution, workflow management, visualization, etc.
- the basic units of Luigi are task classes that model an atomic ETL operation, in three parts: a requirements part that includes pointers to other tasks that need to run before this task, the data transformation step, and the output. All tasks can be feed into a final table (e.g. on Redshift) into one file.
- here is **[an example of a simple workflow in luigi](https://towardsdatascience.com/data-pipelines-luigi-airflow-everything-you-need-to-know-18dc741449b7)**:
<br>
```python
import luigi
class WritePipelineTask(luigi.Task):
def output(self):
return luigi.LocalTarget("data/output_one.txt")
def run(self):
with self.output().open("w") as output_file:
output_file.write("pipeline")
class AddMyTask(luigi.Task):
def output(self):
return luigi.LocalTarget("data/output_two.txt")
def requires(self):
return WritePipelineTask()
def run(self):
with self.input().open("r") as input_file:
line = input_file.read()
with self.output().open("w") as output_file:
decorated_line = "My "+line
output_file.write(decorated_line)
```
<br>
----
### airflow vs. luigi
<br>
| | airflow | luigi |
|---------------------------------------|-----------------------|------------------------|
| web dashboard | very nice | minimal |
| Built-in scheduler | yes | no |
| Separates output data and task state | yes | no |
| calendar scheduling | yes | no, use cron |
| parallelism | yes, workers | threads per workers |
| finds new deployed tasks | yes | no |
| persists state | yes, to db | sort of |
| sync tasks to workers | yes | no |
| scheduling | yes | no |
<br>
---
### cool resources
<br>
* **[incubator airflow data pipelining](https://github.com/apache/incubator-airflow)**
* **[awesome airflow Resources](https://github.com/jghoman/awesome-apache-airflow)**
* **[airflow in kubernetes](https://github.com/rolanddb/airflow-on-kubernetes)**
* **[astronomer: airflow as a service](https://github.com/astronomer/astronomer)**

View file

@ -0,0 +1,13 @@
## the arrow project
<br>
* the [arrow project](https://arrow.apache.org/) is an open-source, cross-language columnar in-memory data representation that is designed to accelerate big data processing. it was initially developed by the Apache Software Foundation and is now a top-level project of the foundation.
* arrow provides a standard for representing data in a columnar format that can be used across different programming languages and different computing platforms. this enables more efficient data exchange between different systems, as well as faster processing of data using modern hardware such as CPUs, GPUs, and FPGAs.
* one of the key benefits of Arrow is its memory-efficient design. because data is stored in a columnar format, it can be compressed more effectively than with traditional row-based storage methods. this can result in significant reductions in memory usage and faster processing times.
* arrow is also designed to be extensible, with support for a wide range of data types and operations. it supports many programming languages, including C++, Java, Python, and Rust, among others. Arrow also integrates with popular big data frameworks such as Apache Spark, Apache Kafka, and Apache Flink.
* arrow is a powerful tool for accelerating big data processing across different systems and programming languages. its columnar data format and memory-efficient design make it an attractive option for data-intensive applications that require fast and efficient data processing.

View file

@ -0,0 +1,12 @@
## google's or-tools
<br>
* the goal of optimization is to find the best solution to a problem out of a large set of possible solutions (or any feasible solution)
* all optimization problems have the following elements:
* the **objective**: the quantity you want to optimize. an optimal solution is one for which the value of the objective function is the best, i.e. max or min
* the **constraints**: restrictions on the set of possible solutions, based on the specific requirements of the problem. a feasible solution is one that satisfies all the given constraints for the problem, without necessarily being optimal
* **[google's or-tools](https://developers.google.com/optimization/introduction)** is an open-source software for combinatorial optimization, which seeks to find the best solution to a problem out of a very large set of possible solutions

View file

@ -0,0 +1,11 @@
## databases
<br>
* **[database overview](database_overview.pdf)**
* **[caching best practices](caching.pdf)**
* **[distributed caching](distributed_caching.pdf)**
* **[database loadbalancing](database_loadbalancing.pdf)**
* **[database partitioning](database_partitioning.pdf)**
* **[failure detection: heartbeats, pings, gossip](failure_detection.pdf)**

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View file

@ -0,0 +1,210 @@
## 🪡 Protocols
<br>
#### What's a protocol
* a protocol is a system that allows two parties to communicate
* they are designed with a set of properties, depending on their purpose
<br>
#### Protocol design properties
* **data format**
- text based (plain text, JSON, XML)
- binary (protobuf, RESP, h2, h3)
* **transfer mode**
- message based (UDP, HTTP)
- stream (TCP, WebRTC)
* **addressing system**
- DNS name, IP, MAC
* **directionality**
- bidirectional (TCP)
- unidirectional (HTTP)
- full/half duplex
* **state**
* stateful
* stateless
* **routing**
* proxies, gateways
<br>
#### Why do you need a communication model?
* you want to build agnostic applications
* without a standard model, upgrading network equipments become difficult
* innovations can be done in each layer separately without affecting the rest of the models
* the OSI model is 7 layers, each describing a specific networking component
<br>
#### What's the OSI model?
* **layer 7**, application: HTTP, FTP, gRCP
* **layer 6**, presentation: encoding, serialization
* **layer 5**, session: connection establishment, TLS
* **layer 4**, transport: UDP, TCP
* **layer 3**, network: IP
* **layer 2**, data link: frames, mac address ethernet
* **layer 1**, physical: electric signals, fiber or radio waves
<br>
##### An example sending a POST request
* **layer 7:** POST request with JSON data to HTTP server
* **layer 6:** serialize JSON to flat byte strings
* **layer 5:** request to establish TCP connection/TLS
* **layer 4:** send SYN request target port 443
* **layer 3:** SYN is placed an IP packet(s) and adds the source/dest IPs
* **layer 2:** each packet goes into a single frame and adds the source/dest MAC addresses
* **layer 1:** each frame becomes a string of bits which converted into either radio signal (wifi), electric signal (ethernet), or light (fiber)
<br>
---
### HTTP/1.1, 2, 3
<br>
* clients example: browser, apps that make http request
* server examples: IIS, Apache TomCat, Python Tornado, NodeJS
<br>
#### What's an HTTP request
* a method (GET, POST, etc.)
* a path (the URL)
* a protocol (HTTP/1.1, 2, 3 etc.)
* headers (key-values)
* body
<br>
#### HTTP/2
* by google, called SPDY
* support compression in both head and body
* multiplexing
* server push
* secure by default
* protocol negotiation during TLS (NPN/ALPN)
<br>
#### HTTP/3
<br>
* HTTP over QUIC and multiplexed streams over UDP
* merges connection setup + TLS in one handshake
* has congestion control at stream level
<br>
----
### WebSockets (ws://, wss://)
<br>
* bidirectional communications on the web
* use cases: chatting, live feed, multiplayer gaming, showing client progress/logging
* apps: twitch, whatsapp
* **pros**: full-duplex (no polling), http compatible, firewall friendly
* **cons**: proxying is tricky, layer 7 load balancing is challenging (timeouts), stateful and difficult to horizontally scale
* long polling and side server events might be better solutions
<br>
----
### gRPC
<br>
* built on top of HTTP/2 (as a hidden implementation), adding several features
* any communication protocol needs client library for the language of choice, but with gRPC you only have one client library
* message format is protocol buffers
* the gRPC modes are: unary, server streaming, client streaming, and bidirectional streaming
* **pros**: fast and compact, one client library, progress feedback (upload), cancel request (H2), H2/protobuf
* **cons**: schema, thick client (libraries have bugs), proxies, no native error handling, no native browser support, timeouts (pub/sub)
<br>
---
### WebRTC (web real-time communication)
<br>
* find a p2p path to exchange video and audio in an efficient and low-latency manner
* standardized API
* enables rich communication browsers, mobile, IOT devices
* **pros**: p2p is great (low latency for high bandwidth content), standardized api
* **cons**: maintaining STUN and TURN servers, p2p falls apart in case of multiple participants (e.g., discord)
<br>
#### WebRTC overview
1. A wants to connect to B
2. A finds out all possible ways the public can connect to it
3. B finds out all possible ways the public can connect to it
4. A and B signal this session information via other means (whatsapp, QR, tweet, etc.)
5. A connects to B via the most optimal path
6. A and B also exchange their supported media and security
<br>
-----
### Proxies
<br>
#### What's a proxy
* a server that makes requests on your behalf (you as a client)
* this means that your tcp connection is being established not with the server, but with the proxy
* in other words, the proxy has the role of layer 4, but layer 7 content gets forwarded (there are exceptions when the proxy adds a header such as with `X-Forwarded-For`)
* **uses**: caching, anonymity, logging, block sites, microservices
<br>
#### What's a reverse proxy
* the client does not know the "final destination server", meaning that the server thar serves the url requested could be a reverse proxy that will forward the request to the underline server
* **uses**: load balancing, caching, CDN, api gateway/ingress, canary deployment, microservices
<br>
#### Layer 4 vs. Layer 7 Load Balancers
<br>
* load balancers, also known as fault tolerant systems, is a reverse proxy talking to many backends
* a **layer 4 load balancer** starts with several TCP connection and keep them "warm"
- when a user starts a connection, this connection will have a state, the LB chooses one server and all segments for that connection go to that server and through ONE connection (layer 4 is stateful)
- the LB almost acts like a router
- **pros**: simpler load balancing, efficient, more secure, works with any protocol, one TCP connection (NAT)
- **cons**: no smart load balancing, NA services, sticky per connection, no caching, protocol unaware (can be risky) bypass rules
* a **layer 7 load balancer** starts with several TCP connection and keep them "warm", but in this case, when a client connects to the L7 LB, it becomes protocol specific
- any logical request will be buffered, parsed, and then forwarded to a new backend server
- this could be one or more segments
- certificates, private keys, all need to live in the load balancer
- **pros**: smart LB, caching, great for microservices, API gateway logic, authentication
- **cons**: expensive (because it's looking at the data), decrypts (terminates TLS), two TCP connections, needs to buffer, needs to understand protocol