Motivation. The biggest problem to designing data systems is scalability and reliability; out of those scalability can only be solved by architecture (as opposed to reliability, which can be managed in HW/driver leve). There are two main ways data is stored and processed:
1. Database Management System (DBMS)
Traditional SQL databases using Relational Algebra
- Types:
- Relational: PostgreSQL, MySQL
- Object/NoSQL: MongoDB
- Vector: Pinecone, Faiss
- Scaling:
- Vertical: Add more CPU/RAM to machine
- Horizontal
- Sharding: store partial rows on each machine, with a shared key
- Replication: One write master, many read replicas
2. File + Distributed File System (DFS)
Create a DFS cluster, and store data on specialized row/columnar storage. Processing is done in a distributed way.
- Types:
- MapReduce: Hadoop with Avro (row store file format)
- DAG execution: Spark with Parquet (column store file format)
- DFS Tech: Apache HDFS
- Scaling:
- Horizontal Scaling: Add more nodes to the cluster