Open Source Data and Analytics Architecture

Technology

Planning

Author

Chris Kornaros

Published

December 12, 2024

Introduction

I will update this when I begin the project. The goal here is to explore and create a tech stack to support modern data and analytical workloads, using entirely open source software. Ideally, I’ll be able to scale it to terabytes and then share that template and the guide as a public resource.

Currently, I’m thinking of the following tools, as part of a non-exhaustive list of the stack:

OS/Environment: zsh/bash
Project and Package Management: uv
Collaboration and Source Control: Github
Documentation: Quarto
Data Modeling: dbt
Containerization: Docker
Container Orchestration: Kubernetes
OLTP Database: PostgreSQL
OLAP Database: DuckDB
Batch Ingestion: Python
ETL: dbt
Testing: pytest
Data Quality: Great Expectations
Metadata: Unity Catalog
ETL Orchestration: Airflow and/or Dagster
Streaming Ingestion: Kafka

General workflow I’m envisioning:

Initialize project with uv, add basic dependencies for the environment
Create the repo with the GitHub CLI
Set the remote as the upstream and do the initial commit
Initialize the quarto and dbt projects as subdirectories of the main, uv project directory
Create the postgres container with docker, use this to initialize the postgres database (Prod)
In your uv envionrment, initialize the duckdb (Dev/Test) persistent database
- Simpler to work quickly with duckdb, postgres has more configurations/overhead, but is better for long term persistent
Use python and duckdb to ingest the initial batch of raw data
Use dbt to define the data model, pytest to define the basic tests, and great expectations to define data quality
Initialize the unity catalog instance, add the connection information (Dev/Test/Prod)
Generate metadata and lineage
Start scheduling and orchestrating jobs
Potentially scale system up to handle stremaing data