Glossary
Apache Hive
Apache Hive
Apache Hive lets you work with massive datasets using a language most teams already know.
Built on Hadoop, it gives you a SQL-like interface (HiveQL) to query, write, and manage structured data stored in systems like HDFS or Amazon S3.
Hive isn’t built for real-time analytics.
But if you need to run batch queries over terabytes or petabytes of data, it still does the job well.
What Is Apache Hive?
Apache Hive is an open-source data warehouse built for querying large volumes of data across distributed storage. It lets you write SQL-style queries that get translated into scalable execution plans.
With Hive, you don’t need to write low-level MapReduce code. You write in HiveQL, and the system compiles your queries to run on engines like Apache Tez, Spark, or legacy MapReduce.
Hive works best when:
- You store large datasets across HDFS, S3, or other compatible systems
- You need to run recurring transformations or reports
- You want to manage schema centrally without tightly coupling storage
It follows a schema-on-read model. That means data is ingested without strict formatting and interpreted during query time.
Core Components of Apache Hive
Hive is made up of several services and layers that coordinate how a query is parsed, optimized, and executed.
- Hive Metastore
Stores metadata: table names, columns, data types, file locations, and partitioning details.
Used during query planning by Hive and external engines like Spark and Trino.
Runs on relational databases like MySQL or Postgres.
- Hive Driver
Manages query sessions. Receives HiveQL, tracks state, and collects output.
- Compiler
Takes HiveQL input and builds an execution plan. Parses it into an abstract syntax tree, validates metadata, then produces a DAG of tasks.
- Optimizer
Restructures the DAG for performance. Pushes filters early, reduces I/O, and eliminates redundant operations.
Powered by Apache Calcite.
- Execution Engine
Runs the physical tasks. Hive supports:
- Apache Tez (low-latency batch execution)
- Apache Spark (distributed in-memory)
- MapReduce (older, slower)
Tasks run across the cluster using YARN as the resource manager.
- Storage Layer
Hive reads and writes to external storage systems:
- HDFS
- Amazon S3
- Azure Data Lake
- Google Cloud Storage
It supports formats like ORC, Parquet, Avro, and text-based files.
Format selection affects compression, performance, and compatibility.
- HiveQL
HiveQL resembles standard SQL. You can SELECT, JOIN, GROUP BY, and filter.
It also supports:
- Scalar and aggregate UDFs
- Table-generating functions
- Partitioned and bucketed tables
How Hive Executes a Query
- Query is submitted using CLI, Beeline, JDBC, or API
- Compiler builds a logical plan using metadata from the Hive Metastore
- Optimizer rewrites the plan to reduce cost
- Execution engine runs the plan across distributed workers
- Results are returned or stored as new tables
If ACID tables are used, Hive applies transaction logic and compaction in the background.
Why Apache Hive Is Still in Use
Batch Workloads at Scale
Hive runs best on long-running queries and scheduled transformations. It is commonly used for:
- Daily or hourly ETL pipelines
- Analytical reporting on historical data
- Backfill jobs across archived partitions
HiveQL lets teams reuse SQL knowledge. That reduces training time and keeps analysts productive.
Cloud Object Store Support
Hive works with S3, ADLS, and other distributed file systems. This allows cost-effective compute separation.
Examples:
- FINRA processes over 90 billion trade events daily using Hive on EMR
- Vanguard supports 150 analysts querying S3 using a shared Hive Metastore
- Guardian uses Hive to power batch analytics for insurance products
Interoperability with the Ecosystem
Hive sits at the center of many data stacks. It integrates with:
- Apache Ranger for access control
- Apache Atlas for lineage
- Apache Iceberg for modern table management
- BI tools via JDBC or ODBC
It also supports tools like Presto and Trino that rely on the Hive Metastore for schema access.
FAQ
What is Apache Hive used for?
To run batch queries, manage structured metadata, and process large datasets stored in distributed file systems.
Is Hive a database?
No. Hive is a SQL engine that sits on top of distributed storage. It manages metadata and delegates execution to engines like Tez or Spark.
What is HiveQL?
A declarative language similar to SQL. It supports filtering, joins, grouping, and extensions like user-defined functions.
Does Hive support cloud storage?
Yes. Hive works with Amazon S3, Azure Data Lake, and Google Cloud Storage.
Which file formats are supported?
ORC, Parquet, Avro, RCFile, and plain text files.
What is the Hive Metastore?
A catalog service that stores metadata about tables, partitions, and file paths. Shared across Hive, Spark, and Presto.
What execution engines does Hive use?
Apache Tez, Apache Spark, and MapReduce. Tez is the most common today.
Does Hive support transactions?
Yes. Hive supports ACID operations including INSERT, UPDATE, and DELETE on compliant tables.
Can I use Hive with BI tools?
Yes. Hive supports JDBC and ODBC connections used by Tableau, Power BI, and similar tools.
Is Hive still relevant?
Yes. For structured batch processing over large datasets, Hive remains stable, extensible, and widely adopted.
Summary
Apache Hive gives you a practical way to query large datasets using SQL syntax.
It compiles HiveQL into distributed jobs that run on Hadoop-compatible engines. It supports schema-on-read, flexible storage formats, and centralized metadata. It connects to cloud services and works well with other components in the open-source data ecosystem.
It is not designed for real-time dashboards or low-latency apps. But if your workload involves batch querying, ETL, or structured reporting at scale, Hive remains a strong option.
A wide array of use-cases
Discover how we can help your data into your most valuable asset.
We help businesses boost revenue, save time, and make smarter decisions with Data and AI