Kafka

Implement near real-time eventing from Snowflake to RDS - part 1

This two-part series will explore an exciting data integration architecture and apply various skills and diverse knowledge to deliver a robust solution.

Gustavo Gama

Dec 2, 2024 • 2 min read

Photo by Google DeepMind / Unsplash

Part 1: Understanding the Architecture and Key Concepts in Data Integration

This two-part series will explore an exciting data integration architecture and apply various skills and diverse knowledge to deliver a robust solution. In the first part, we focus on understanding the developed architecture and its unique aspects. The second part will provide a comprehensive breakdown of resources, goals, and considerations involved in its implementation.

The Current Architecture: Snowflake as a Source

Typically, Snowflake is used as a destination in data pipelines due to its strengths as a data warehouse for analytics. However, in this architecture, Snowflake is not the sink but the source. This deviation from the norm exposes the flexibility of modern data integration pipelines.

Objective: Move data changes from Snowflake to an AWS RDS Aurora cluster via Kafka with the following components:

Snowflake Streams: Captures data changes efficiently.
Kafka: Processes events and sinks the data into the data store.
AWS RDS Aurora: The final destination for data storage.
Avro Format with Schema Registry: Ensures structured, schema-validated data transfer.

Below is a diagram illustrating this architecture:

Snowflake to RDS.webp

Key Architecture Goals

The primary objective is to move data from Snowflake to AWS RDS Aurora with the following considerations:

Streamlined Data Transfer: Utilize Snowflake Streams to capture changes, removing the need for a connector and bypassing traditional ETL pulling.
Event-Driven Processing: Kafka ensures that each event (message) is processed in near real-time.
Validation and Structuring: Use the Avro format (Schema Registry) to verify message structures before delivery.

Typical Architecture vs. Current Architecture

In a typical data pipeline:

Data sources feed into Kafka using native or connectors.
Kafka acts as an intermediary, processing and pushing data to Snowflake as the destination.
RDS or other operational databases often act as upstream sources of the transactions.

Key Differences in the Current Architecture:

Snowflake acts as the source, not the sink.
AWS RDS Aurora becomes the final sink.
The pushing model eliminates the need for traditional batch pulls, enabling near real-time processing.

Pushing vs. Pulling Models

Pulling Model:
- The connectors query the source at intervals to retrieve changes, introducing a mechanism of uniqueness identity and/or timestamp.
- Pros:
  - Simple setup for legacy systems.
  - Controlled data flow based on pull frequency.
- Cons:
  - Higher latency due to periodic batch pulls.
  - Increased workload on the source system during pulls.
  - Lost granularity of data changes (squash all changes between pulls in a single event)
  - Risk of missing incremental changes if not carefully managed.
Pushing Model:
- The source actively sends changes to downstream systems as they occur as events (messages).
- Pros:
  - Near real-time updates.
  - Reduced source system workload as changes are streamed, not queried.
  - Ensures no loss of incremental changes.
- Cons:
  - Requires robust error handling and retry mechanisms.
  - Potentially more complex initial setup.

Conclusion of Part 1

This architecture offers an innovative way to manage near real-time data integration by leveraging Snowflake Streams, Kafka, and AWS RDS Aurora. Choosing a pushing model over pulling enables real-time processing and efficiency but requires careful orchestration of multiple technologies.

In Part 2, we will:

Deep dive into each pipeline component and its role.
Loot into the advantages and challenges of using Avro and Schema Registry.
Provide implementation challenges.

Stay tuned for the under-the-hood explanation of how these components fit into a technological stack and perform together to deliver a modern, scalable data pipeline.