merge files from the blockchain infra repo (#59)

2025-07-24 07:20:58 -04:00 · 2024-11-17 17:03:20 -08:00 · 2024-11-17 17:03:20 -08:00 · 2a6449bb85
commit 2a6449bb85
parent 23f56ef195
346 changed files with 29097 additions and 132 deletions
--- a/resources/data_engineering/README.md
+++ b/resources/data_engineering/README.md
@ -0,0 +1,35 @@
+## data engineering
+
+<br>
+
+### articles
+
+* [machine learning system design](https://medium.com/@ricomeinl/machine-learning-system-design-f2f4018f2f8)
+* [how to code neat ml pipelines](https://www.neuraxio.com/en/blog/neuraxle/2019/10/26/neat-machine-learning-pipelines.html)
+
+<br>
+
+### enterprise solutions
+
+* [netflix data pipeline](https://medium.com/netflix-techblog/evolution-of-the-netflix-data-pipeline-da246ca36905)
+* [netlix data videos](https://www.youtube.com/channel/UC00QATOrSH4K2uOljTnnaKw)
+* [yelp data pipeline](https://engineeringblog.yelp.com/2016/07/billions-of-messages-a-day-yelps-real-time-data-pipeline.html)
+* [gusto data pipeline](https://engineering.gusto.com/building-a-data-informed-culture/)
+* [500px data pipeline](https://medium.com/@samson_hu/building-analytics-at-500px-92e9a7005c83)
+* [twitter data pipeline](https://blog.twitter.com/engineering/en_us/topics/insights/2018/ml-workflows.html)
+* [coursera data pipeline](https://medium.com/@zhaojunzhang/building-data-infrastructure-in-coursera-15441ebe18c2)
+* [cloudfare data pipeline](https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-queries-per-second/)
+* [pandora data pipeline](https://engineering.pandora.com/apache-airflow-at-pandora-1d7a844d68ee)
+* [heroku data pipeline](https://medium.com/@damesavram/running-airflow-on-heroku-ed1d28f8013d)
+* [zillow data pipeline](https://www.zillow.com/data-science/airflow-at-zillow/)
+* [airbnb data pipeline](https://medium.com/airbnb-engineering/https-medium-com-jonathan-parks-scaling-erf-23fd17c91166)
+* [walmart data pipeline](https://medium.com/walmartlabs/how-we-built-a-data-pipeline-with-lambda-architecture-using-spark-spark-streaming-9d3b4b4555d3)
+* [robinwood data pipeline](https://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8)
+* [lyft data pipeline](https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff)
+* [slack data pipeline](https://speakerdeck.com/vananth22/operating-data-pipeline-with-airflow-at-slack)
+* [remind data pipeline](https://medium.com/@RemindEng/beyond-a-redshift-centric-data-model-1e5c2b542442)
+* [wish data pipeline](https://medium.com/wish-engineering/scaling-analytics-at-wish-619eacb97d16)
+* [databrick data pipeline](https://databricks.com/blog/2017/03/31/delivering-personalized-shopping-experience-apache-spark-databricks.html)
+
+
+
--- a/resources/data_engineering/airflow_and_luigi.md
+++ b/resources/data_engineering/airflow_and_luigi.md
@ -0,0 +1,130 @@
+## airflow and luigi
+
+<br>
+
+### airflow
+
+<br>
+
+* **[apache airflow](https://github.com/apache/airflow)** was a tool **[developed by airbnb in 2014 and later open-sourced](https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8)**
+
+* it is a platform to programmatically author, schedule, and monitor workflows. when workflows are defined as code, they become more maintainable, versionable, testable, and collaborative
+
+* you can use airflow to author workflows as directed acyclic graphs (DAGs) of tasks: the airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
+
+* here is **[a very simple toy example of an airflow job](https://gist.github.com/robert8138/c6e492d00cd7b7e7626670ba2ed32e6a)** that simply prints the date in bash every day after waiting for one second to pass, after the execution date is reached:
+
+<br>
+
+```python
+
+from datetime import datetime, timedelta
+from airflow.models import DAG  # Import the DAG class
+from airflow.operators.bash_operator import BashOperator
+from airflow.operators.sensors import TimeDeltaSensor
+
+default_args = {
+    'owner': 'you',
+    'depends_on_past': False,
+    'start_date': datetime(2018, 1, 8),
+}
+
+dag = DAG(
+    dag_id='anatomy_of_a_dag',
+    description="This describes my DAG",
+    default_args=default_args,
+    schedule_interval=timedelta(days=1))   # This is a daily DAG.
+
+# t0 and t1 are examples of tasks created by instantiating operators
+t0 = TimeDeltaSensor(
+    task_id='wait_a_second',
+    delta=timedelta(seconds=1),
+    dag=dag)
+
+t1 = BashOperator(
+    task_id='print_date_in_bash',
+    bash_command='date',
+    dag=dag)
+
+t1.set_upstream(t0)
+```
+
+<br>
+
+---
+
+### luigi
+
+<br>
+
+- **[luigi data pipelining](https://github.com/spotify/luigi)** is spotify's python module that helps you build complex pipelines of batch jobs. it handles dependency resolution, workflow management, visualization, etc. 
+
+- the basic units of Luigi are task classes that model an atomic ETL operation, in three parts: a requirements part that includes pointers to other tasks that need to run before this task, the data transformation step, and the output. All tasks can be feed into a final table (e.g. on Redshift) into one file.
+
+- here is **[an example of a simple workflow in luigi](https://towardsdatascience.com/data-pipelines-luigi-airflow-everything-you-need-to-know-18dc741449b7)**:
+
+<br>
+
+```python
+import luigi
+
+class WritePipelineTask(luigi.Task):
+
+    def output(self):
+        return luigi.LocalTarget("data/output_one.txt")
+
+    def run(self):
+        with self.output().open("w") as output_file:
+            output_file.write("pipeline")
+
+
+class AddMyTask(luigi.Task):
+
+    def output(self):
+        return luigi.LocalTarget("data/output_two.txt")
+
+    def requires(self):
+        return WritePipelineTask()
+
+    def run(self):
+        with self.input().open("r") as input_file:
+            line = input_file.read()
+
+        with self.output().open("w") as output_file:
+            decorated_line = "My "+line
+            output_file.write(decorated_line)
+```
+
+<br>
+
+----
+
+### airflow vs. luigi
+
+<br>
+
+|                                       |        airflow        |           luigi        |
+|---------------------------------------|-----------------------|------------------------|
+| web dashboard                            | very nice             |  minimal               |
+| Built-in scheduler                    | yes                   |    no                  |
+| Separates output data and task state  | yes                   |    no                  |
+| calendar scheduling                   | yes                   | no, use cron           |
+| parallelism                           | yes, workers          | threads per workers    |
+| finds new deployed tasks              | yes                   | no                     |
+| persists state                        | yes, to db            | sort of                |
+| sync tasks to workers                    | yes                   | no                     |
+| scheduling                            | yes                   | no                     |
+
+
+<br>
+
+---
+
+### cool resources
+
+<br>
+
+* **[incubator airflow data pipelining](https://github.com/apache/incubator-airflow)**
+* **[awesome airflow Resources](https://github.com/jghoman/awesome-apache-airflow)**
+* **[airflow in kubernetes](https://github.com/rolanddb/airflow-on-kubernetes)**
+* **[astronomer: airflow as a service](https://github.com/astronomer/astronomer)**
--- a/resources/data_engineering/arrow_project.md
+++ b/resources/data_engineering/arrow_project.md
@ -0,0 +1,13 @@
+## the arrow project
+
+<br>
+
+* the [arrow project](https://arrow.apache.org/) is an open-source, cross-language columnar in-memory data representation that is designed to accelerate big data processing. it was initially developed by the Apache Software Foundation and is now a top-level project of the foundation.
+
+* arrow provides a standard for representing data in a columnar format that can be used across different programming languages and different computing platforms. this enables more efficient data exchange between different systems, as well as faster processing of data using modern hardware such as CPUs, GPUs, and FPGAs.
+
+* one of the key benefits of Arrow is its memory-efficient design. because data is stored in a columnar format, it can be compressed more effectively than with traditional row-based storage methods. this can result in significant reductions in memory usage and faster processing times.
+
+* arrow is also designed to be extensible, with support for a wide range of data types and operations. it supports many programming languages, including C++, Java, Python, and Rust, among others. Arrow also integrates with popular big data frameworks such as Apache Spark, Apache Kafka, and Apache Flink.
+
+* arrow is a powerful tool for accelerating big data processing across different systems and programming languages. its columnar data format and memory-efficient design make it an attractive option for data-intensive applications that require fast and efficient data processing.
--- a/resources/data_engineering/or_tools.md
+++ b/resources/data_engineering/or_tools.md
@ -0,0 +1,12 @@
+## google's or-tools
+
+<br>
+
+
+* the goal of optimization is to find the best solution to a problem out of a large set of possible solutions (or any feasible solution)
+
+* all optimization problems have the following elements:
+  * the **objective**: the quantity you want to optimize. an optimal solution is one for which the value of the objective function is the best, i.e. max or min
+  * the **constraints**: restrictions on the set of possible solutions, based on the specific requirements of the problem. a feasible solution is one that satisfies all the given constraints for the problem, without necessarily being optimal
+
+* **[google's or-tools](https://developers.google.com/optimization/introduction)** is an open-source software for combinatorial optimization, which seeks to find the best solution to a problem out of a very large set of possible solutions