Apache Samza

Apache Samza processes data as it happens.

No delays. No batches. Just real-time pipelines that work.

Originally built at LinkedIn, Samza helps developers handle massive data streams using Apache Kafka and YARN. It supports stateful processing, fault isolation, and horizontal scaling without overengineering the stack.

If you're already running Kafka and need a framework that can process streams reliably, Samza is still worth a look.

‍

What Is Apache Samza?

Apache Samza is a distributed stream processing framework built to process real-time data at scale.

It was created at LinkedIn to address limitations in traditional batch systems like Apache Hadoop. Samza became part of the Apache Software Foundation in 2013 and has since supported high-throughput, fault-tolerant, stateful applications in production.

Unlike batch frameworks that process data in chunks, Samza processes it as it arrives. This enables applications to react in milliseconds instead of minutes or hours.

What defines Samza:

Stream-first architecture. Samza treats all data as continuous streams. It integrates with Apache Kafka for messaging, so systems can process events as they happen.
Distributed execution. Samza jobs run across multiple containers on YARN, Kubernetes, or in standalone mode. This supports resource isolation and horizontal scale.
Stateful processing. Many stream processors can't handle state. Samza colocates state on the same machine as the processing job and writes updates to Kafka changelogs for recovery.
Built-in fault tolerance. If a node fails, Samza uses the latest Kafka offset and restores state from the changelog to resume work without losing data.
Unified batch and stream support. The same code can process both batch and streaming data.
Pluggable components. While optimized for Kafka and YARN, Samza supports custom integrations across transport, state, storage, and execution layers.

Samza is written in Java and Scala. It offers multiple APIs, from low-level callbacks to a Streams DSL, SQL, and Apache Beam. Whether you're building real-time analytics, operational monitoring, ETL pipelines, or production features, Samza provides the building blocks.

‍

+

Samza is built to handle large, continuous data streams with low latency and high throughput. It breaks complex workloads into smaller units and runs them in parallel across a distributed environment.

‍

Streams and Samza Jobs

Processing starts with a stream.

A stream is a partitioned sequence of immutable messages, usually from Kafka. Each message has an offset and is routed to a partition by key. This allows massive parallelism.

Samza jobs process streams. Each job is an application that consumes one or more streams. Jobs are split into tasks. Tasks are assigned to containers, which are JVM processes.

Tasks run independently on different stream partitions. This isolation improves scalability and failure recovery.

‍

Execution Environment

Samza runs on standard cluster managers. It doesn't require a custom scheduler.

You can deploy jobs using Apache YARN, Kubernetes, or standalone mode. YARN is common in Hadoop environments. Kubernetes is supported and fully functional for containerized deployments.

Each container can run multiple tasks, but they run one at a time. This model simplifies resource use and avoids task contention.

‍

Stateful Stream Processing

Some workloads need memory of past events.

Samza supports this with local state. Each task gets a local store, often RocksDB, colocated on disk. This avoids remote latency.

Every change is also written to a Kafka changelog. If the job crashes, Samza replays the changelog to rebuild the state.

This makes it possible to build joins, windowed aggregations, and other stateful logic without external storage.

‍

Fault Tolerance and Checkpoints

Samza recovers quickly from failures.

It checkpoints Kafka offsets every few seconds. If a job crashes, it resumes from the last checkpoint.

Kafka ensures message ordering and durability, so Samza doesn’t lose data or reprocess it incorrectly.

‍

Pluggability and Extensibility

Samza was built to adapt.

You can replace the default messaging system, execution engine, or state store. While most use Kafka and YARN, you can integrate S3, HDFS, other data stores, or different deployment tools.

LinkedIn used this flexibility to run Samza in hybrid setups. For open source users, this flexibility adds some configuration overhead but enables custom integration.

‍

Why Apache Samza Was Built

Samza was built to address a very specific gap.

LinkedIn had Kafka to move data, but no framework to process streams with reliability, performance, and state management.

The Batch Bottleneck

Before Samza, LinkedIn used batch systems like Apache Hadoop. These were fine for reporting or nightly ETL, but the delay between data ingestion and insight was too long.

They needed a faster response time for:

Real-time monitoring
User personalization
Fraud detection

Batch processing wasn't enough.

‍

Kafka Was Not the Full Solution

Kafka solved data transport, but not data processing.

Teams began writing services to consume Kafka topics. These services repeated logic like buffering, windowing, joining, and aggregating.

Each team built custom processors. Samza was created to centralize these patterns.

‍

Local State for Stream Processing

Early stream processors were stateless. For anything involving state—aggregates, joins, deduplication—they had to use external databases. That added latency and risk.

Samza introduced a local state store. State is colocated on the same node as the task and backed up to Kafka. When a task restarts, it replays the changelog and restores state without a database.

‍

Pluggable by Design

LinkedIn had diverse infrastructure. Different teams used different tooling. Some used YARN. Others needed standalone deployments. Kafka was common, but other sources needed support.

Samza made every layer pluggable. Messaging. Storage. Metrics. Orchestration.

This gave teams flexibility but also made onboarding harder for new users.

‍

Lessons from Production

Some design choices created friction over time.

YARN dependency limited adoption
Pluggability introduced too much complexity
Overlap with Kafka APIs led to unclear abstractions

These pain points led to simpler successors like Kafka Streams and Kafka Connect.

Still, Samza's architecture remained robust. It handled scale, failures, and state better than most frameworks at the time.

‍

Real-World Use Cases for Apache Samza

Samza runs in production at large companies.

It may not have buzz, but it powers high-throughput systems reliably. These are some of its use cases.

‍

Real-Time Features and Ranking

At LinkedIn, Samza powers user-facing features.

When a member opens their homepage, Samza computes dozens of signals. This includes ranking logic, recommendation models, session state, and user preferences.

These features are built in real time with:

Stream joins
User-based aggregations
Low-latency processing
Local state and model integration

This processing isn't optional. It's part of the user experience.

‍

Monitoring and Anomaly Detection

Samza supports real-time alerting pipelines.

For example:

Detecting anomalies in service metrics
Spotting fraud in ad systems
Validating operational data on the fly

Samza keeps a local cache of state and evaluates each event as it comes in. Because the state is local and backed by Kafka, detection continues even during container restarts.

‍

Request Tracing and Dependency Graphs

LinkedIn uses Samza for request tracing.

Each service logs events to Kafka using a shared request ID. Samza consumes these events, partitions by ID, and reconstructs the full request path in real time.

This shows what services were involved, where latency occurred, and what failed.

Similar workflows are used for distributed tracing, root-cause analysis, and service mapping.

‍

Stream ETL and Data Cleanup

Samza is often used for ETL, especially when the data needs to be cleaned before landing in storage.

Tasks include:

Deduplicating input records
Masking sensitive data
Reformatting schemas
Enriching events with metadata

These workflows write to HDFS, data lakes, or downstream queues. The same jobs can run in batch or stream mode, using the same logic.

‍

Metrics Aggregation and Summarization

Samza also processes internal metrics.

Instead of storing raw metrics and aggregating them later, Samza aggregates on the fly. This reduces load on downstream systems and enables faster alerts.

Jobs can produce histograms, counters, and time-based windows at scale.

‍

FAQ

‍

What is Apache Samza?

A distributed framework for real-time stream processing. It uses Apache Kafka and Apache YARN, supports stateful applications, and scales reliably in production.

‍

How is it different from Apache Hadoop?

Hadoop processes data in large batches. Samza processes it in real time. Samza also supports batch processing using the same code, but it’s built for responsiveness.

‍

What are the main components?

Streams, jobs, tasks, containers, local state stores, and Kafka changelogs.

‍

What kind of data can it process?

Any kind of event data. Logs, metrics, transactions, API events, sensor data.

‍

What languages does it support?

Primarily Java and Scala. It can also run Apache Beam pipelines.

‍

How does it handle state?

State is stored on disk, local to the processing job, and backed by a Kafka changelog for durability.

‍

Is it fault tolerant?

Yes. Jobs resume from Kafka offsets. State is restored from the changelog.

‍

Can it handle batch and streaming?

Yes. Samza supports the same codebase for both.

‍

How do you deploy it?

YARN, Kubernetes, or standalone. No special scheduler required.

‍

Is it fast?

Yes. Samza can handle millions of events per second with millisecond latency.

‍

Who uses it?

LinkedIn, Slack, Uber, Expedia, and others. Especially where Kafka is central.

‍

How does it compare to Kafka Streams?

Kafka Streams is a library for embedding stream processing into applications. Samza is a standalone framework that manages resources and state externally.

‍

Can I use it with a data lake?

Yes. Samza works well for ingesting data into HDFS, lakehouses, or warehouses.

‍

Is it open source?

Yes. Samza is an Apache project. Code and documentation are available on GitHub.

‍

What are its limitations?

Pluggability adds complexity. It was historically tied to YARN. Best suited for Kafka-based setups.

‍

Summary

Apache Samza is a production-grade framework for stateful stream processing. Built at LinkedIn and maintained by the Apache Software Foundation, it integrates with Kafka for messaging and YARN or Kubernetes for execution.

It handles fault tolerance, state recovery, and stream transformations at scale. While newer frameworks have gained attention, Samza remains reliable for teams that need correctness, speed, and control.

If you're already running Kafka and want something stable, flexible, and proven, Samza is still a solid choice.

Glossary