Invoice ETL Pipeline
An automated ETL pipeline that extracted and transformed invoice data from PDFs into structured inventory records, eliminating repetitive manual processing.
The problem
The client was manually processing invoice PDFs and transferring information between systems by hand. The process consumed hours every week, introduced avoidable human error, and scaled poorly as invoice volume increased.
Key decisions
Pandas for transformation and edge-case handling
The invoices contained inconsistent formatting and messy real-world data. Pandas provided enough flexibility to handle cleaning, mapping, and transformation logic cleanly without introducing unnecessary tooling.
Simple architecture over unnecessary complexity
The problem did not require distributed infrastructure or enterprise-scale orchestration. A focused, maintainable Python pipeline with strong validation and logging was the correct engineering decision for the scale of the workload.
Outcome
The automation reduced a process that previously took roughly 10 hours per week down to minutes, saving the client approximately $10,000 annually in labour costs while improving consistency and reliability.
What I learned
One of the most valuable engineering lessons is knowing when not to over-engineer. Reliable software that solves the actual operational problem is more valuable than technically impressive architecture that introduces unnecessary maintenance burden.