Glossary

Catalyst Optimizer

Spark SQL does more than run your query. It rewrites it first.

The Catalyst Optimizer sits in the middle of that process. It reads your query, checks what it needs, finds a better way to do it, and then hands off a plan that runs faster and uses fewer resources.

Catalyst is built in Scala. It uses functional programming to analyze, change, and improve queries before they hit your data. It handles column resolution, rule rewriting, and even generates bytecode so Spark can run your logic efficiently across machines.

What makes Catalyst unique is not just the speed. It is the way it blends rule-based and cost-based methods, uses pattern matching to scan query trees, and exposes public extension points for custom logic. You can tweak how Spark runs your query without needing to fork the engine.

If you care about performance in Spark SQL, learning Catalyst is not optional.

‍

What Is Catalyst Optimizer?

Catalyst Optimizer is Spark SQL’s query planning and execution framework. It rewrites logical plans into optimized strategies using a mix of rules and code generation.

At the base, Catalyst provides a general library for turning queries into trees. These trees are not abstract ideas. Every operation becomes a node. Filters, joins, aggregates, and expressions are all represented as nodes with children.

Once the tree is built, Catalyst starts rewriting it. Rules written in Scala look for patterns in the tree and replace them with better versions.

There are two optimization modes:

Rule-based: prewritten rules that simplify logic or push filters closer to the data
Cost-based: generates multiple plans and chooses the one with the lowest estimated cost

Catalyst does this in four steps:

Analysis: resolve column names and types
Logical optimization: clean up and simplify the structure
Physical planning: pick an execution plan using cost estimates
Code generation: produce Java bytecode so the plan runs fast

Catalyst also supports extension. You can add support for custom types, storage formats, or optimizations without changing core code.

This modular setup, with clean rewrites and static trees, makes Catalyst easier to test, debug, and extend. It gives Spark SQL the flexibility to handle both standard queries and edge cases in complex data workflows.

‍

How Catalyst Optimizer Works

Catalyst runs every query through four main stages. Each one focuses on a specific part of the job.

Analysis

This step resolves missing details. When a query comes in, Catalyst may not yet know what col or table a reference points to. The plan at this stage is called an unresolved logical plan.

Catalyst uses a catalog to match column and table names. It fills in data types and adds unique IDs to each column. This helps later optimizations treat repeated expressions more precisely.

By the end of this phase, Catalyst has a fully checked logical plan that knows where each field comes from and what type it has.

Logical Optimization

Next, Catalyst applies rule-based rewrites. These rules do things like:

Combine constants: 1 + 2 becomes 3
Push filters closer to data: run filters before joins
Drop unused columns: remove extra fields early
Simplify expressions: remove unnecessary logic

Each rule works as a simple tree transformation. Rules are grouped into batches. Each batch runs until the plan stops changing. This avoids deep logic in any single rule but still allows complex behavior through repeated passes.

Physical Planning

Now that the plan is clean, Catalyst translates it into one or more execution strategies. These are called physical plans.

It uses a cost model to estimate which plan is cheaper. It checks things like:

Will this join require a shuffle?
Can we broadcast this table?
Is there a better way to scan the source?

Catalyst also checks if the source supports filter or column pushdown. If it does, the plan is rewritten so the filtering happens during the scan, not after.

Code Generation

In the final step, Catalyst compiles expressions into Java bytecode. This skips interpretation and removes function calls from the inner loop.

It uses Scala quasiquotes to write code templates. These templates are turned into Scala abstract syntax trees and then compiled at runtime into Java bytecode.

For example, an expression like (x + 1) is turned into row.get("x") + 1 and then compiled to bytecode.

This makes execution fast. No virtual calls. No branching. Just tight loops over the data.

‍

Trees and Rules

Catalyst models queries as trees. Each node is a specific action, like a filter, join, or column. Nodes can have children, which form the tree.

These trees are immutable. When Catalyst rewrites a plan, it builds a new tree using rules.

‍

Rules are functions that match certain patterns. For example:

‍

tree.transform {
  case Add(Literal(a), Literal(b)) => Literal(a + b)
  case Add(left, Literal(0)) => left
  case Add(Literal(0), right) => right
}

‍

This rule simplifies addition. If both sides are constants, it adds them. If one side is zero, it removes the operation.

Rules are grouped into batches and applied repeatedly until no changes happen. This fixed-point logic lets simple rules scale up to large rewrites.

Rules can be used to resolve columns, fold constants, apply cost-based planning, or push logic into external sources.

Catalyst also exposes rule hooks. Developers can write their own and register them. This is how you can change Catalyst without changing the engine.

‍

Custom Extensions in Catalyst

Catalyst is not locked. You can extend it by plugging in your own logic. There are four ways to do this.

External Source Pushdown

Catalyst can push filters or column selections into the data source. Instead of reading all data into Spark, the plan is rewritten to read only what is needed.

You can define how your source handles filters. Catalyst checks if the source can apply the filter, then rewrites the tree to push it down.

This avoids extra scanning, saves memory, and reduces shuffle.

Custom Optimization Rules

You can write your own optimization rules in Scala. These rules can clean up plans, simplify patterns, or apply business-specific logic.

Rules are grouped into named batches. You choose when to run them: during analysis, logical rewrite, or planning. Catalyst will repeat each batch until the tree no longer changes.

For example, you can write a rule to:

Replace repeated joins with cached reads
Drop unused projections
Rewrite slow filters into faster expressions

Custom Data Types

Catalyst supports user-defined types (UDTs). You can create new types and define how they are serialized and evaluated.

Once added, Catalyst treats them like built-in types. You can match them in rules, plan joins, or optimize filters based on them.

This is useful when your domain uses structured values that Spark doesn’t natively support.

Bytecode Hooks

Catalyst uses quasiquotes to build compiled code. You can override or extend this by writing your own code generation logic.

If you want a new type or expression to generate faster code, you can write a quasiquote that compiles the expression directly.

Catalyst will merge your logic into the compiled output, avoiding extra steps or interpretation.

‍

FAQ

What is Catalyst Optimizer?

Catalyst is Spark SQL’s query planner. It turns user queries into optimized plans using tree rewrites, rules, and cost models.

‍

What does Catalyst optimize?

Catalyst rewrites how a query runs. It simplifies the logic, reorders operations, removes extra steps, and generates code that runs faster.

‍

How are queries represented?

Each query is a tree. Nodes are actions like filters or joins. Catalyst rewrites the tree by applying rules to these nodes.

‍

What types of optimization does Catalyst support?

Rule-based: constant folding, filter pushdown, projection pruning
Cost-based: choose the cheapest plan by estimating memory and CPU usage

‍

How does Catalyst find places to optimize?

Catalyst uses pattern matching to scan the tree for known structures. Each match triggers a rule that rewrites the node.

‍

Why run rules in batches?

Some rules depend on earlier changes. Running in batches until the tree stops changing allows small rules to build into big improvements.

‍

Can developers extend Catalyst?

Yes. Catalyst allows plugins for rules, data types, code generation, and pushdown logic.

‍

What is filter or projection pushdown?

Catalyst rewrites the plan to apply filters and selects during the scan, not after. This cuts down how much data Spark has to load and

process.

‍

What are quasiquotes?

Quasiquotes are a Scala feature that lets you write code as a template. Catalyst uses them to generate bytecode for expressions at runtime.

‍

Is Catalyst used outside Spark SQL?

No. It is built for Spark SQL and the DataFrame API, but the design is reusable in other query engines.

‍

Summary

Catalyst is Spark SQL’s query compiler. It rewrites queries into faster plans using a clean structure of trees, rules, and phases.

It works in four steps: resolve names, rewrite logic, pick a physical plan, and generate bytecode. Each step uses trees and pattern-based rules to stay clean and modular.

You can extend Catalyst with custom rules, types, pushdown logic, or bytecode generation. These changes do not require touching core internals.

Catalyst gives you control over how queries run without making Spark harder to use. It is predictable, testable, and built for real work at scale.

A wide array of use-cases

Advertising

Discover how we can help your data into your most valuable asset.

We help businesses boost revenue, save time, and make smarter decisions with Data and AI