/projects$cat elt-data-platform.html
ELT Data Platform
SLTC · final-year research · 2025
the problem
Nine years of daily central-bank reports — locked inside unstructured PDFs. Rich economic data, completely unqueryable. The research question: can you build a platform that turns that archive into clean, structured, analyzable data, automatically and continuously?
what i built
An automated data platform orchestrated by Apache Airflow:
- Ingest — pull 9 years of daily PDF reports into a MinIO data lake.
- Extract — use the Gemini APIs to read each report and emit structured JSON.
- Load — land the structured records in PostgreSQL, ready to query.
The whole thing runs as an orchestrated DAG, so it's repeatable and extensible — point it at new reports and it keeps going.