#Snowflake#AWS#CDC#Data Engineering#Cost Optimization

Achieving Low Cost Snowflake Sync: Handling High Volume Updates from an External System

Learn how to build a cost-effective, reliable data synchronization pipeline for Snowflake using serverless AWS services and CDC, handling millions of daily changes.

MargallaAI Team
March 20, 2024
11 min read

Achieving Low Cost Snowflake Sync: Handling High Volume Updates from an External System

A client approached our team with a tricky synchronization puzzle. Their external system churned out updates all day at unpredictable speeds, but they needed Snowflake tables updated just once daily. While daily batches worked most days, occasional drifts left systems out of sync. They also demanded proof that every change was captured and all at minimal cost.

The scale was daunting: one table held 1 billion historical records, with busy days dumping 500,000 to 1 million changes. Our goal: a reliable, verifiable sync under 2 hours daily, slashing Snowflake bills.

The Challenge

The external system updated continuously during business hours. Snowflake acted as the analytics hub, but clients mandated a single daily refresh to control warehouse usage.

Issues piled up:

  • **Drifts:** Network hiccups, failures, or timing gaps caused mismatches.
  • **Verification:** No easy way to confirm 100% capture without manual audits.
  • **Scale:** 1B row tables meant even deltas strained resources.
  • **Costs:** Snowflake credits burned fast on large MERGEs or SCREAMING tasks.

Traditional ETL (e.g., Fivetran/Airbyte) hit limits on volume and price. We needed clever architecture.

Our Solution Architecture

We flipped the script: Do heavy lifting in the source system's AWS account, minimizing Snowflake's role. Here's the flow:

  • **Capture Changes:** Used AWS services (e.g., DMS or database logs) to land daily CDC files from both source and Snowflake into S3 buckets. Partitions by date/hour kept it manageable.
  • **Diff in Lambda:** Triggered functions compared files to identify net inserts, updates, and deletes. Serverless scaled to 1M rows effortlessly.
  • **Stage Results:** Output compact Parquet files to S3 (gzipped, columnar).
  • **Snowflake Load:** Simple COPY INTO staging, then MERGE (using natural keys).
  • **Orchestrate:** EventBridge for scheduling and CloudWatch for logs and metrics.

Snowflake only touched data at the very end, keeping things lightweight and cheap.

flowchart LR A[External System DB] --> B[AWS DMS / CDC] B --> C[S3 Source CDC Files] D[Snowflake Tables] --> E[S3 Snowflake Extract] C --> F[Lambda Diff Engine] E --> F F --> G[S3 Diff Results Parquet] G --> H[Snowflake COPY INTO Staging] H --> I[Snowflake MERGE]

Implementation Details

We built in Python Lambdas for the diff logic:

# Pseudo code snippet
source_df = pd.read_parquet(source_s3_key)
target_df = pd.read_parquet(target_s3_key)
updates = source_df[~source_df.key.isin(target_df.key)]  # Simplified
deletes = target_df[~target_df.key.isin(source_df.key)]
# Write to results S3
  • **Scale Handling:** S3 partitioning by date and hour, with Spark available if Lambdas maxed out (which rarely happened).
  • **Deletes and Verification:** We tracked tombstones with post merge row count diffs and hash checks.
  • **Edge Cases:** Idempotent runs, retry logic, and schema evolution via dynamic DDL.
  • **Runtime:** Capture took 20 minutes, diff took 25 minutes, and load took 15 minutes, for a total of about 1 hour.

CloudWatch dashboards gave us visibility into sync lag, error rates, and bytes processed. Proactive alerts fired if drifts exceeded zero.

Results and Wins

  • **Speed:** Consistent 1 hour runs, even on days with 1M deltas.
  • **Cost Slash:** Shifted 80 to 90% of compute to AWS, which cost pennies per run compared to Snowflake dollars.
  • **Reliability:** Zero drifts after launch with a full audit trail.
  • **Client Love:** We met the "all updates captured" requirement via logs and scaled seamlessly.

This was a massive win with budget intact, compliance happy.

Lessons Learned and Next Steps

Serverless CDC with S3 is gold for cross cloud syncs. Our top tips are to start with proofs of concept on samples and monitor Lambda memory.

Low cost sync doesn't mean low quality. We proved it. If you have similar technical challenges, let's connect.

Tags:SnowflakeAWSCDCData EngineeringCost Optimization