Glossary
BigQuery ML
BigQuery ML removes the need to move data or switch tools.
You train and use models where the data already lives, in BigQuery. No Python scripts. No exports. No separate ML platforms. Just SQL.
You can use BigQuery ML to predict outcomes, group users, recommend products, or forecast trends. Everything happens in place, with queries you already know how to write.
What is BigQuery ML?
BigQuery ML lets you create, train, and use machine learning models directly in BigQuery using standard SQL.
You define models with CREATE MODEL
, evaluate them with ML.EVALUATE
, and use them for predictions with ML.PREDICT
.
No extra tools. No data movement. No switching environments.
BigQuery ML supports many model types:
- Logistic regression for classification
- Linear regression for numeric prediction
- K-means clustering for grouping records
- ARIMA_PLUS for time series forecasting
- Matrix factorization for recommendations
- PCA for feature reduction
You can also run predictions on models built in TensorFlow, XGBoost, or ONNX by importing them or connecting to Vertex AI.
This setup keeps everything in your warehouse. It uses the SQL skills your team already has.
Supported Model Types in BigQuery ML
Each model type is trained using SQL and stored in BigQuery like a table or view.
Classification and Regression
- Logistic regression: For binary or multi-class classification. Example: predict if a visitor will make a transaction.
- Linear regression: For predicting continuous values. Example: estimate revenue per user.
Both models use labeled training data. They are a good fit when you know the outcome you want to predict.
Clustering and Dimensionality Reduction
- K-means: Finds patterns and groups records. Useful for audience segmentation.
- PCA (Principal Component Analysis): Reduces wide datasets by keeping the structure and cutting down the number of features.
Time Series Forecasting
- ARIMA_PLUS: For predicting values over time. It handles missing data, seasonal effects, and anomalies.
Recommendations
- Matrix factorization: Creates recommendation models using past user-item interactions.
External and Imported Models
- Use REMOTE MODEL to call models hosted on Vertex AI.
- Import models trained in TensorFlow, XGBoost, or ONNX using Cloud Storage.
These models support advanced use cases while keeping prediction logic in SQL.
When to Use Each Model
- Use logistic regression for yes/no outcomes like transactions or clicks.
- Use linear regression for numeric estimates like price or sales volume.
- Use clustering to find structure in unlabeled data.
- Use ARIMA_PLUS for time-based data like demand forecasts.
All models run in BigQuery using SQL. No extra infrastructure or scripting is required.
How Model Training Works in BigQuery ML
Training a model starts with one SQL statement.
You set the model type, label, and features. BigQuery handles the split, training, and storage.
Example: Create the Model
CREATE OR REPLACE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type = 'logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, '') AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, '') AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630';
This model is trained to predict if a visitor will make a transaction. It uses operating system, mobile status, country, and pageviews as inputs.
Data Splitting
BigQuery ML splits your data automatically:
- Under 500 rows: all for training
- 500 to 50,000 rows: 80% training, 20% evaluation
- Over 50,000 rows: 10,000 rows for evaluation
You can set your own split using DATA_SPLIT_METHOD
.
Training Optimization
BigQuery trains the model by minimizing a loss function. For logistic regression, this is log loss.
You can monitor training with [ML.TRAINING](<http://ml.training/>)_INFO
, which shows how loss changes over each step.
Evaluating Model Accuracy
After training, use ML.EVALUATE
to see how the model performs.
SELECT *
FROM ML.EVALUATE(
MODEL `bqml_tutorial.sample_model`,
(
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, '') AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, '') AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
)
);
This uses data from a different time range than training. It helps check how well the model works on new data.
Evaluation includes:
- Accuracy
- Precision
- Recall
- F1 score
- Log loss
- ROC AUC
Making Predictions with BigQuery ML
Use ML.PREDICT
to run the model on new data.
Example: Country-Level Predictions
SELECT
country,
SUM(predicted_label) AS total_predicted_purchases
FROM
ML.PREDICT(
MODEL `bqml_tutorial.sample_model`,
(
SELECT
IFNULL(device.operatingSystem, '') AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, '') AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
)
)
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10;
Example: User-Level Predictions
SELECT
fullVisitorId,
SUM(predicted_label) AS total_predicted_purchases
FROM
ML.PREDICT(
MODEL `bqml_tutorial.sample_model`,
(
SELECT
IFNULL(device.operatingSystem, '') AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, '') AS country,
IFNULL(totals.pageviews, 0) AS pageviews,
fullVisitorId
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
)
)
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10;
Make sure the structure of your prediction query matches your training query. Use the same column names, types, and logic.
Managing Models in BigQuery ML
Models are stored in your datasets like tables or views.
You can reference them like this:
`project_id.dataset_name.model_name`
Use IAM roles to control access. You can check model details with [ML.TRAINING](<http://ml.training/>)_INFO
.
Billing and Costs
You are billed for:
- Training: based on bytes processed
- Prediction: based on input rows
- Storage: based on model size
Remote and imported models only incur costs when used.
Model Deployment
You can:
- Keep the model in BigQuery for batch queries
- Export it to Vertex AI for live predictions
- Save it as TensorFlow, XGBoost, or ONNX
- Archive it in Cloud Storage
Versioning
Overwriting a model deletes the old version. If you want versioning, include dates or tags in the model name.
There’s no built-in registry, but you can track versions manually using metadata or external tracking.
When to Use BigQuery ML
Use BigQuery ML when:
- Your data is in BigQuery
- Your team works in SQL
- Your use case fits regression, classification, clustering, or forecasting
- You don’t need low-latency or GPU-based inference
It is not designed for deep learning or unstructured data, but works well for structured, tabular use cases.
FAQ
Can I use BigQuery ML without Python?
Yes. All modeling is done in SQL.
What models are supported?
Logistic regression, linear regression, k-means, matrix factorization, ARIMA_PLUS, PCA, and imported models.
How is training data split?
Automatically, based on size. You can also set your own split ratio.
What does ML.EVALUATE return?
Classification models return accuracy, precision, recall, F1, log loss, and ROC AUC. Regression models return error metrics and R-squared.
Can I use predictions in other queries?
Yes. Prediction output is a standard table you can join or use in reporting.
Does input schema need to match training?
Yes. Keep names, types, and preprocessing logic the same.
Can I forecast time series data?
Yes. Use the ARIMA_PLUS model.
Can I export or deploy models?
Yes. You can export to Cloud Storage or register with Vertex AI.
Is this production ready?
Yes, for batch workloads. For real-time inference, consider Vertex AI.
How is billing handled?
Based on data scanned during training, rows used in prediction, and model storage.
Can I version models?
Not by default. Use naming conventions and track metadata separately.
Can I join predictions with other data?
Yes. Use standard SQL joins to combine predictions with other tables.
When should I avoid BigQuery ML?
When you need GPU-based inference, custom neural networks, or low-latency responses.
Summary
BigQuery ML lets you build machine learning models with SQL, using data already in your warehouse.
It supports common model types like logistic regression, linear regression, clustering, forecasting, and recommendations. You can also run imported models trained outside BigQuery.
Models are defined with SQL, trained on in-place data, and used in queries. This simplifies workflows for teams working in BigQuery.
If you work with structured data, need fast iteration, and want to avoid switching tools, BigQuery ML is a practical solution for building models and using predictions where your data already lives.
A wide array of use-cases
Discover how we can help your data into your most valuable asset.
We help businesses boost revenue, save time, and make smarter decisions with Data and AI