What I Learned From Building Two Data Pipelines

2025-05-09

Over the past year, I worked on two very different data pipelines — one as a Data Engineering Intern at Spark Amplify, and the other on a side project processing messy electric grid data from electricity grid providers. Each pipeline came with its own headaches, surprises, and lessons, but together they taught me a lot about working with data.

Contributing to Spark Amplify

At Spark Amplify, my work involved transforming literary text for different reading levels. The process began with raw text, which was then revised by large language models (LLMs). The revised content underwent several validation steps using NLP techniques such as BERT similarity and word count alignment. This pipeline was designed for production use and scalability, allowing for the rapid processing of entire documents. Creating a multi-stage pipeline that revised and validated over a hundred texts required a focus on resilience. To assess the quality of the generated content, I established measurable metrics. During each pipeline stage, it was crucial that these metrics remained above predefined quality thresholds. This highlighted the importance of automated validation. We implemented checks using BERT embeddings to measure semantic similarity, ensuring the revised text retained its original meaning, and monitored word count to prevent significant changes in length. This resulted in a system capable of intelligent text revision while also identifying instances where the model strayed from the intended output.

Building the Grid Info Tracker

The second data pipeline, building the Grid Info Tracker, presented significantly more challenges due to the nature of the data. In collaboration with another engineer, I worked on tracking energy infrastructure data, which involved ingesting raw datasets from sources like ERCOT. These datasets were released monthly but lacked consistency in several key areas: file naming conventions varied, data schemas changed frequently, and occasionally, the data we needed came in the form of scanned legal agreements in PDF format. My work on this pipeline focused on two main projects: developing a method for extracting dates from scanned legal agreements that lacked any uniform formatting; and, merging approximately three years' worth of data from disparate monthly reports, a task that required careful attention to data quality and consistency across the different sources.

Lessons Learned

Working with literary works

The Spark Amplify project was especially engaging because the input data, mostly well-structured text, was already clean. This allowed us to focus on the quality of the output, which was critical since we were working with educational materials. One key takeaway was the importance of having checks and balances when using AI to revise content. It wasn't enough for the rewritten text to read well. It also had to preserve the original meaning and stay appropriate for the intended reading level. For example, when revising excerpts from Huckleberry Finn, we had to ensure that character traits, clothing, and appearances remained consistent. Huck couldn’t suddenly wear something he never would, or speak in a tone that didn’t fit the original voice. To manage this, we created automated evaluation methods. For example, we used BERT similarity scores to ensure the meaning stayed intact and tracked word counts to avoid major changes in length.

Overall, this project helped me understand how nuanced NLP evaluation can be. Language is not binary, and success isn't always clear-cut. We had to define what "good" meant in measurable terms. To guide this process, I read research papers like Malik et al. (2024) and Reimers and Gurevych (2019), which helped me design more reliable evaluation systems. The experience reinforced a core principle: AI won’t always get it right on its own, and we have to have systems in place to catch AI’s mistake and help it reiterate its work. We always have to have human input in the loop.

Working with Scanned Legal Documents

One part of the project focused on extracting legal enforcement dates from scanned legal agreements across various U.S. courts. Because these documents were scans rather than digital PDFs, basic text extraction wasn't an option—I had to rely on OCR. I tried several OCR tools, including Tesseract and a few cloud APIs. In the end, the most reliable setup involved pdfplumber for handling the initial PDF extraction and the Python Imaging Library (PIL) for pre-processing the scans—sharpening, deskewing, and converting to suitable formats. It wasn’t plug-and-play. It took a lot of trial and error to get consistent results, especially given how poor the scan quality was in many cases. T here’s no universal OCR solution. I manually annotated about 20 documents to benchmark accuracy, then ran various OCR methods to see if they could extract the correct enforcement dates. Dates are fragile. If even one digit is wrong, the output becomes useless. That meant I needed an OCR setup that could deliver near-perfect precision—not just decent accuracy. Pdfplumber + PIL turned out to be the most robust combo.

Once OCR was sorted, I turned to OpenAI’s LLMs to identify the correct legal enforcement date from the text. These documents often stretched into the hundreds of pages, so I focused only on the first and last 10 pages—where I had noticed, through manual inspection, that the relevant date usually appeared.

Why use an LLM at all? Two reasons:

  1. The documents weren’t standardized. Each had its own structure and quirks, so rule-based extraction failed fast.
  2. There were often multiple dates mentioned: signing dates, amendment dates, effective dates. The model needed to understand context to correctly pick the legally enforceable one.

Using AI made sense here. But it wasn’t an automatic decision. I weighed the cost, reliability, and complexity tradeoffs. Eventually, we migrated to Google's OCR and text-processing stack. Because the codebase was modular, this switch was easy. Still, the core lessons remained: preprocessing matters, OCR quality is everything, and AI is most useful when the rules are too fuzzy for traditional logic.

ERCOT data

Another major area I worked on involved integrating ERCOT (Electric Reliability Council of Texas) datasets, especially merging GIS data releases into a consistent, clean, and queryable format. On paper, this sounded like a mechanical task: grab the monthly CSVs, stack them up, and call it a day. In reality, it took some time to get it right, and there were many chances to make a mistake.

The raw data wasn’t always very structured. Each month had its own quirks: column names changed slightly, extra rows appeared, battery-related fields would be null in one month and populated in the next, and authors would include comments in some rows, while skipping them in the next file. Another thing that made things more complicated is that we needed several scripts to do a similar job. First, we needed two scripts to populate the BigQuery datasets for GIS and Battery files in the first run. Then, more scripts to scrape off new data each month, and merge them with existing datasets. Some of this had to be written in dbt, which I have never used before. It was actually a great opportunity to put my pandas skills to a test, push some SQL code and learn dbt.

Google BigQuery became my main environment over time. I leaned heavily on its performance for exploratory joins, version comparisons, and running some tests and checks across many rows. It allowed me to iterate quickly, which was crucial because the edge cases never stopped coming.

The biggest lesson? You often don’t get to work with clean data unless you make it yourself. This is even more true since what clean means in your use/ application really depends on what you need personally. Also, the only way to trust your datasets is to treat the pipeline like production code. You have to assume every new file is going to break something, and you have to build for that from the start. This mindset helped me a lot when designing the merger scripts.

Final Reflections

These two projects couldn’t have been more different, one was polished and production-ready, the other raw and constantly evolving. But they both taught me the same thing: data work is about control, not just pipelines:

  • Control for ambiguity with validation layers.
  • Control for messiness with preprocessing and tests.
  • Control for change by writing modular, swappable systems.

I relied on AI a lot throughout these projects, but in all honesty AI alone wasn’t enough. You need the right guardrails, fallback systems, and human insight to build something reliable. And, I think, that is not going to change, no matter how advanced LLMs become.