I build and operate real-time data/ML platforms, and one recurring pain I see inside any data org is this: “Why does this attribute have this value?” When a company name, industry, or revenue looks wrong, investigations often stall without engineering help. I wanted a way for analysts, support, and product folks to self-serve the “why,” with history and evidence, without waking up an engineer.
This is the blueprint I shipped: A serverless, low‑maintenance traceability service that queries terabytes in seconds and costs peanuts.
Amazon API Gateway: a secure front door for the API
AWS Lambda: stateless request handlers (no idle compute to pay for)
Amazon S3 + Apache Hudi: cheap storage with time travel and upserts
AWS Glue Data Catalog: schema and partition metadata
Amazon Athena: SQL over S3/Hudi, pay-per-data‑scanned, zero infra
\
:::info Data layout: performance is a data problem (not a compute problem) Athena is fast when it reads almost nothing, and slow when it plans or scans too much. The entire project hinged on getting partitions and projection right.
:::
This limits data scanned and, more importantly, narrows what partition metadata Athena needs to consider.
Partitioning:
Put only frequently filtered columns in the partition spec.
Use integer bucketing (mod) for high‑cardinality keys like entity_id.
\
Partition Indexing (first win, partial):
We enabled partition indexing so Athena could prune partition metadata faster during planning.
This helped until the partition count grew large; planning was still the dominant cost.
\
Partition Projection (the actual game‑changer):
Instead of asking Glue to store millions of partitions, we taught Athena how partitions are shaped.
Result: planning time close to zero; queries jumped from “slow-ish with growth” to consistently 1–2 seconds for typical workloads.
\ Athena TBLPROPERTIES (minimal example)
TBLPROPERTIES ( 'projection.enabled'='true', 'projection.attribute_name.type'='enum', 'projection.attribute_name.values'='employees,revenue,linkedin_url,founded_year,industry', 'projection.entity_id_mod.type'='integer', 'projection.entity_id_mod.interval'='1', 'projection.entity_id_mod.range'='0,9', 'projection.created_date.type'='date', 'projection.created_date.format'='yyyy-MM-dd', 'projection.created_date.interval'='1', 'projection.created_date.interval.unit'='days', 'projection.created_date.range'='2022-01-01,NOW' )
\
\
\
\
\
Model your top 3 queries; pick partitions that match those predicates
Use enum projection for low‑cardinality fields; date projection for time; integer ranges for buckets
Store data in columnar formats (Parquet/ORC) via Hudi to keep scans small and enable time travel
Add a thin API (API Gateway + Lambda) to turn traceability SQL into JSON for your UI
Measure planning vs. scan time; optimize the former first
\
\
Great performance is usually a data layout story. Before you scale compute, fix how you store and find bytes. In serverless analytics, the fastest query is the one that plans instantly and reads almost nothing, and partition projection is the lever that gets you there.


