PK's Notes

Search

Data Architecture

1 min read

#Computing/System-Design

def. Extract, Transform, Load. (ETL) is the standard workflow to process data (no shit)

E: pull data from source system
T: clean, standardize, merge, reshape, process data
L: store data into target system Motivation. The biggest problem to designing data systems is scalability and reliability; out of those scalability can only be solved by architecture (as opposed to reliability, which can be managed in HW/driver leve). There are two main ways data is stored and processed:

1. Database Management System (DBMS)

Traditional SQL databases using Relational Algebra

Types:
- Relational: PostgreSQL, MySQL
- Object/NoSQL: MongoDB
- Vector: Pinecone, Faiss
Scaling:
- Vertical: Add more CPU/RAM to machine
- Horizontal
  1. Sharding: store partial rows on each machine, with a shared key
  2. Replication: One write master, many read replicas

2. File + Distributed File System (DFS)

Create a DFS cluster, and store data on specialized row/columnar storage. Processing is done in a distributed way.

Types:
- MapReduce: Hadoop with Avro (row store file format)
- DAG execution: Spark with Parquet (column store file format)
- DFS Tech: Apache HDFS
Scaling:
- Horizontal Scaling: Add more nodes to the cluster

Graph View

1. Database Management System (DBMS)
2. File + Distributed File System (DFS)

Backlinks

(LLL) Nomura Traning
Nomura Traning & LLL

Created with Quartz v4.1.4, © 2025

Homepage