BigQuery ML

BigQuery ML removes the need to move data or switch tools.

You train and use models where the data already lives, in BigQuery. No Python scripts. No exports. No separate ML platforms. Just SQL.

You can use BigQuery ML to predict outcomes, group users, recommend products, or forecast trends. Everything happens in place, with queries you already know how to write.

‍

What is BigQuery ML?

BigQuery ML lets you create, train, and use machine learning models directly in BigQuery using standard SQL.

You define models with CREATE MODEL, evaluate them with ML.EVALUATE, and use them for predictions with ML.PREDICT.

No extra tools. No data movement. No switching environments.

BigQuery ML supports many model types:

Logistic regression for classification
Linear regression for numeric prediction
K-means clustering for grouping records
ARIMA_PLUS for time series forecasting
Matrix factorization for recommendations
PCA for feature reduction

You can also run predictions on models built in TensorFlow, XGBoost, or ONNX by importing them or connecting to Vertex AI.

This setup keeps everything in your warehouse. It uses the SQL skills your team already has.

‍

Supported Model Types in BigQuery ML

Each model type is trained using SQL and stored in BigQuery like a table or view.

‍

Classification and Regression

Logistic regression: For binary or multi-class classification. Example: predict if a visitor will make a transaction.
Linear regression: For predicting continuous values. Example: estimate revenue per user.

Both models use labeled training data. They are a good fit when you know the outcome you want to predict.

Clustering and Dimensionality Reduction

K-means: Finds patterns and groups records. Useful for audience segmentation.
PCA (Principal Component Analysis): Reduces wide datasets by keeping the structure and cutting down the number of features.

Time Series Forecasting

ARIMA_PLUS: For predicting values over time. It handles missing data, seasonal effects, and anomalies.

Recommendations

Matrix factorization: Creates recommendation models using past user-item interactions.

External and Imported Models

Use REMOTE MODEL to call models hosted on Vertex AI.
Import models trained in TensorFlow, XGBoost, or ONNX using Cloud Storage.

These models support advanced use cases while keeping prediction logic in SQL.

When to Use Each Model

Use logistic regression for yes/no outcomes like transactions or clicks.
Use linear regression for numeric estimates like price or sales volume.
Use clustering to find structure in unlabeled data.
Use ARIMA_PLUS for time-based data like demand forecasts.

All models run in BigQuery using SQL. No extra infrastructure or scripting is required.

‍

How Model Training Works in BigQuery ML

Training a model starts with one SQL statement.

You set the model type, label, and features. BigQuery handles the split, training, and storage.

Example: Create the Model

‍

CREATE OR REPLACE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type = 'logistic_reg') AS
SELECT
  IF(totals.transactions IS NULL, 0, 1) AS label,
  IFNULL(device.operatingSystem, '') AS os,
  device.isMobile AS is_mobile,
  IFNULL(geoNetwork.country, '') AS country,
  IFNULL(totals.pageviews, 0) AS pageviews
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
  _TABLE_SUFFIX BETWEEN '20160801' AND '20170630';

‍This model is trained to predict if a visitor will make a transaction. It uses operating system, mobile status, country, and pageviews as inputs.

‍

Data Splitting

BigQuery ML splits your data automatically:

Under 500 rows: all for training
500 to 50,000 rows: 80% training, 20% evaluation
Over 50,000 rows: 10,000 rows for evaluation

You can set your own split using DATA_SPLIT_METHOD.

‍

Training Optimization

BigQuery trains the model by minimizing a loss function. For logistic regression, this is log loss.

You can monitor training with [ML.TRAINING](<http://ml.training/>)_INFO, which shows how loss changes over each step.

‍

Evaluating Model Accuracy

After training, use ML.EVALUATE to see how the model performs.

‍

SELECT *
FROM ML.EVALUATE(
  MODEL `bqml_tutorial.sample_model`,
  (
    SELECT
      IF(totals.transactions IS NULL, 0, 1) AS label,
      IFNULL(device.operatingSystem, '') AS os,
      device.isMobile AS is_mobile,
      IFNULL(geoNetwork.country, '') AS country,
      IFNULL(totals.pageviews, 0) AS pageviews
    FROM
      `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE
      _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
  )
);

This uses data from a different time range than training. It helps check how well the model works on new data.

Evaluation includes:

Accuracy
Precision
Recall
F1 score
Log loss
ROC AUC

‍

Making Predictions with BigQuery ML

Use ML.PREDICT to run the model on new data.

‍

Example: Country-Level Predictions

SELECT
  country,
  SUM(predicted_label) AS total_predicted_purchases
FROM
  ML.PREDICT(
    MODEL `bqml_tutorial.sample_model`,
    (
      SELECT
        IFNULL(device.operatingSystem, '') AS os,
        device.isMobile AS is_mobile,
        IFNULL(geoNetwork.country, '') AS country,
        IFNULL(totals.pageviews, 0) AS pageviews
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_*`
      WHERE
        _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
    )
  )
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10;

‍

Example: User-Level Predictions

SELECT
  fullVisitorId,
  SUM(predicted_label) AS total_predicted_purchases
FROM
  ML.PREDICT(
    MODEL `bqml_tutorial.sample_model`,
    (
      SELECT
        IFNULL(device.operatingSystem, '') AS os,
        device.isMobile AS is_mobile,
        IFNULL(geoNetwork.country, '') AS country,
        IFNULL(totals.pageviews, 0) AS pageviews,
        fullVisitorId
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_*`
      WHERE
        _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
    )
  )
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10;

‍Make sure the structure of your prediction query matches your training query. Use the same column names, types, and logic.

‍

Managing Models in BigQuery ML

Models are stored in your datasets like tables or views.

You can reference them like this:

`project_id.dataset_name.model_name`

Use IAM roles to control access. You can check model details with [ML.TRAINING](<http://ml.training/>)_INFO.

‍

Billing and Costs

You are billed for:

Training: based on bytes processed
Prediction: based on input rows
Storage: based on model size

Remote and imported models only incur costs when used.

‍

Model Deployment

You can:

Keep the model in BigQuery for batch queries
Export it to Vertex AI for live predictions
Save it as TensorFlow, XGBoost, or ONNX
Archive it in Cloud Storage

‍

Versioning

Overwriting a model deletes the old version. If you want versioning, include dates or tags in the model name.

There’s no built-in registry, but you can track versions manually using metadata or external tracking.

‍

When to Use BigQuery ML

Use BigQuery ML when:

Your data is in BigQuery
Your team works in SQL
Your use case fits regression, classification, clustering, or forecasting
You don’t need low-latency or GPU-based inference

It is not designed for deep learning or unstructured data, but works well for structured, tabular use cases.

‍

FAQ

Can I use BigQuery ML without Python?

Yes. All modeling is done in SQL.

‍

What models are supported?

Logistic regression, linear regression, k-means, matrix factorization, ARIMA_PLUS, PCA, and imported models.

‍

How is training data split?

Automatically, based on size. You can also set your own split ratio.

‍

What does ML.EVALUATE return?

Classification models return accuracy, precision, recall, F1, log loss, and ROC AUC. Regression models return error metrics and R-squared.

‍

Can I use predictions in other queries?

Yes. Prediction output is a standard table you can join or use in reporting.

‍

Does input schema need to match training?

Yes. Keep names, types, and preprocessing logic the same.

‍

Can I forecast time series data?

Yes. Use the ARIMA_PLUS model.

‍

Can I export or deploy models?

Yes. You can export to Cloud Storage or register with Vertex AI.

‍

Is this production ready?

Yes, for batch workloads. For real-time inference, consider Vertex AI.

‍

How is billing handled?

Based on data scanned during training, rows used in prediction, and model storage.

‍

Can I version models?

Not by default. Use naming conventions and track metadata separately.

‍

Can I join predictions with other data?

Yes. Use standard SQL joins to combine predictions with other tables.

‍

When should I avoid BigQuery ML?

When you need GPU-based inference, custom neural networks, or low-latency responses.

‍

Summary

BigQuery ML lets you build machine learning models with SQL, using data already in your warehouse.

It supports common model types like logistic regression, linear regression, clustering, forecasting, and recommendations. You can also run imported models trained outside BigQuery.

Models are defined with SQL, trained on in-place data, and used in queries. This simplifies workflows for teams working in BigQuery.

If you work with structured data, need fast iteration, and want to avoid switching tools, BigQuery ML is a practical solution for building models and using predictions where your data already lives.

Glossary