Data Engineering

From 279 Gigabytes to "What Are These 49 Million Rows?"

7 min read
Hydraulic Press
Raw CSV
279 GB
Parquet
35 GB

The Hardware Question Nobody Talks About

Most guides on clinical machine learning start with a clean dataset and a model architecture. This one starts where the actual work begins: staring at 279 gigabytes of CSV files wondering if your computer can even open them.

Our team recently set up infrastructure to work with OMOP-formatted clinical data—1.4 billion measurement records, 160 million drug exposures, and 371,000 patients. The goal was straightforward: build a local analytics environment capable of supporting machine learning workflows without waiting for IT tickets or cloud compute approvals.

Clinical researchers often ask what software to use for ML. Fewer ask about hardware, assuming a standard workstation will suffice. It won't—not for real EHR data at scale. A single vitals table can exceed what fits in memory on a standard machine.

CSV is Not a Database

OMOP exports typically arrive as CSV files. Ours totaled 279GB across 17 tables. The measurement table alone—194GB—contained every vital sign, lab result, and clinical observation recorded for our patient population.

CSV has one advantage: universality. It has many disadvantages: no compression, no indexing, slow to query, and prone to parsing errors. The solution is columnar storage. We converted everything to Parquet format using DuckDB. The entire 279GB compressed to 35GB—an 8x reduction—and query times dropped from minutes to under a second.

One example: counting 1.4 billion measurements grouped by year took 0.82 seconds after conversion.

The 49 Million Rows with No Identity

After conversion, we ran basic quality checks. The measurement table showed an unexpected pattern: 49 million rows had a concept ID of zero, meaning they weren't mapped to any standard terminology.

These weren't junk data. They had units. By examining the unit codes, we could infer what these measurements likely represented:

  • 3.4 million with units in milliseconds—probably ECG intervals
  • 3.0 million as percentages—likely hematocrit or oxygen saturation
  • 1.3 million in femtoliters—consistent with mean corpuscular volume
  • 1.3 million in mg/dL—standard chemistry values

This affects 36% of patients in our dataset. Any analysis that filters to mapped concepts only—which is standard practice—will systematically exclude real clinical signal from over a third of the population.

What We'd Tell a Colleague Starting This Work

1. Budget for Hardware

A machine that can hold your largest table in memory will save time on every subsequent analysis. Disk space is cheap; waiting for queries is expensive.

2. Convert to Columnar Storage

Convert to columnar storage before doing anything else. The upfront investment of a few hours eliminates friction for every future project.

3. Run Quality Checks

The unmapped data issue we found would have silently biased any model trained on "complete" records.

4. Document Everything

Six months from now, someone will ask why 49 million rows are excluded from analysis. A clear record of the investigation turns a mystery into an answered question.

Stay ahead of the curve

Get the latest insights on AI infrastructure delivered to your inbox.