Back to Playbooks
Data Platforms

Designing zero-ETL intake on AWS

Contrast modern lakehouse patterns with legacy batch ETL to pick the right delivery model.

What this covers

Use this guide to evaluate when zero-ETL architectures unlock fresher insights, how to stage curated objects for federated engines, and the guardrails required to keep auditors comfortable.

Implementation trail

  • Decision matrix: zero-ETL vs. traditional ETL
  • Landing zone and catalog design
  • Serverless analytics surfaces
  • Automation and observability
  • Downstream integration patterns

Zero-ETL or traditional ETL? Start with a decision matrix

Balance latency, compliance, and operational complexity before ripping out existing ETL chains. Zero-ETL favors direct analytics on curated object storage, while traditional ETL still shines for heavy reshaping or mainframe extractions.

  • Latency & freshness: Zero-ETL (EventBridge Step Functions Athena/Redshift Spectrum) streams curated files within minutes; batch ETL often waits on nightly jobs but allows heavyweight reshaping offline.
  • Price & storage efficiency: Querying Iceberg/Parquet in place avoids warehouse hot storage charges and duplicate staging copies. Legacy ETL doubles storage footprints but can exploit reserved warehouse capacity for predictable spend.
  • Visibility & logging: Zero-ETL centralizes CloudWatch Logs, Lake Formation access audits, and Step Functions execution history. Traditional ETL may retain clearer job-by-job lineage in tools like Airflow but often lacks real-time traces.
  • Ease of integration: Zero-ETL exposes data via Athena, Redshift Serverless, and API Gateway without additional load windows. Traditional ETL excels when downstream systems only support relational imports or require flattened schemas.
  • Data extraction workflows: Modern zero-ETL stacks lean on DataSync, Database Migration Service, or managed connectors that land change streams straight into S3. Traditional ETL may rely on stored procedures and manual extracts but guarantees deterministic pulls from brittle legacy systems.
  • Visualization & consumption: QuickSight or Tableau can point to curated Iceberg tables immediately. Batch ETL pipelines may pre-aggregate marts that business teams are already trained on.
  • Operational overhead: Serverless zero-ETL minimizes patching; however, policy drift or schema-on-read surprises require strong data contracts. Traditional ETL demands infrastructure upkeep but offers deterministic transformations for regulated datasets.

Design the landing and curated zones

Partition S3 for immutable landing data and governed curated outputs so every consumer-Athena, Redshift Spectrum, EMR-reads the same truth without duplicate loads.

  • Separate s3://<env>-raw/ and s3://<env>-curated/ prefixes with lifecycle policies that purge transient staging while retaining curated Iceberg tables for governance.
  • Use Glue databases per domain (finance_curated, product_curated) and tag them with Lake Formation LF-Tags to control zero-ETL read access.
  • Automate Glue crawlers or Blueprints so raw objects register within minutes; treat schema drift as a deployment that requires versioned contracts.

Pattern: Redshift Serverless + Iceberg

Leverage Redshift Serverless external schemas to query Iceberg tables without copy commands while still joining with existing Redshift models.

  • Create an AWS Glue catalog database and share it with Redshift via Lake Formation so analysts can join Iceberg tables and native Redshift materialized views.
  • Grant Redshift a service IAM role limited to curated prefixes and enable audit log exports to keep regulators comfortable with zero-ETL reads.
  • Use Data Shares or data APIs for partners that still require SQL endpoints, avoiding nightly unload/reload jobs.

Pattern: Glue, Step Functions, and Lambda for continuous curation

Instead of monolithic Spark clusters, orchestrate serverless transformations that enrich but do not relocate data.

  • Trigger Step Functions from EventBridge notifications on object creation; first stage runs Glue jobs for light normalization, second stage refreshes materialized views or Iceberg snapshots.
  • Embed schema validation and Great Expectations checks inside Glue jobs to quarantine anomalies without blocking upstream producers.
  • Publish execution metrics and lineage events to the Data Catalog, so consumers understand freshness without scanning ETL logs.

Visualization and BI connectivity

Treat reporting as another zero-ETL consumer: tools attach directly to curated tables with row-level governance.

  • Provision Athena workgroups dedicated to BI tools with enforced result output locations and per-query cost controls.
  • Expose Redshift Serverless endpoints for teams requiring JDBC/ODBC connectivity while still avoiding duplicate batch loads.
  • Use QuickSight SPICE for frequently accessed dashboards when concurrency peaks would otherwise spike Athena costs.

Operational guardrails and hybrid coexistence

Most enterprises run zero-ETL and traditional ETL side by side-governance determines which datasets qualify for the lighter-weight path.

  • Log every read through Lake Formation, CloudTrail Lake, and Step Functions execution history to maintain auditor-friendly visibility.
  • Keep a “landing zone to warehouse” fallback path (e.g., Glue job that loads Redshift tables) for workloads needing heavy reshaping or legacy downstream feeds.
  • Periodically review total cost of ownership: compare Athena/Redshift Serverless consumption versus the compute hours saved by avoiding nightly batch clusters.

Need a zero-ETL adoption roadmap?

We help teams graduate from brittle batch jobs to governed, event-driven analytics planes-without breaking the reporting systems they rely on today.

Plan your zero-ETL rollout