Motivation. The biggest problem to designing data systems is scalability and reliability; out of those scalability can only be solved by architecture (as opposed to reliability, which can be managed in HW/driver leve). There are two main ways data is stored and processed:

1. Database Management System (DBMS)

Traditional SQL databases using Relational Algebra

  • Types:
    • Relational: PostgreSQL, MySQL
    • Object/NoSQL: MongoDB
    • Vector: Pinecone, Faiss
  • Scaling:
    • Vertical: Add more CPU/RAM to machine
    • Horizontal
      1. Sharding: store partial rows on each machine, with a shared key
      2. Replication: One write master, many read replicas

2. File + Distributed File System (DFS)

Create a DFS cluster, and store data on specialized row/columnar storage. Processing is done in a distributed way.

  • Types:
    • MapReduce: Hadoop with Avro (row store file format)
    • DAG execution: Spark with Parquet (column store file format)
    • DFS Tech: Apache HDFS
  • Scaling:
    • Horizontal Scaling: Add more nodes to the cluster