Cluster columns in BigQuery

In BigQuery, a clustered table physically organizes data based on the values of one or more selected columns. Clustering helps BigQuery optimize how data is stored and scanned during query execution.

Physically organizes data based on the values of selected columns
Improves performance of WHERE, GROUP BY, and ORDER BY queries
Reduces the amount of data scanned, resulting in lower query cost

Important Points about Clustering

You can cluster a table on up to 4 columns
Clustering works best with low-to-medium cardinality columns
Clustering is different from partitioning

Note: When clustering a table using multiple columns, the column order matters. The order determines which columns take precedence when BigQuery sorts and groups the data into storage blocks.

Create an Empty Clustered Table (UI)

Follow the steps below to create an empty clustered table using the BigQuery Console:

Step 1: Go to the BigQuery page here in the Google Cloud Console.

Step 2: In the Explorer pane, expand the project and select a dataset.

Step 3: In the Dataset info section, click + Create table.

Step 4: In the Create table panel, specify the following details:

In the Source section, select Empty table.
In the Destination section:
- Select the target dataset
- Enter the table name
- Ensure the Table type is set to Native table

Step 5: Define the schema of the table.

Step 6: In the Clustering order field, enter between one and four comma-separated column names.

Click Create table. After creation, the clustered columns are visible in the table details.

Create a Clustered Table using SQL

We can also create a partitioned and clustered table using SQL.

Example: Create a sales table clustered by country and customer_id, and partitioned by order_date.

SQL

CREATE TABLE `ashishcoder.Coding_Dataset.clustered_table`
(
  order_id     INT64,
  customer_id  INT64,
  country      STRING,
  order_date   DATE,
  total_amount FLOAT64
)
PARTITION BY order_date
CLUSTER BY country, customer_id;

What’s Happening Here

PARTITION BY order_date – splits data into daily partitions.
CLUSTER BY country, customer_id – organizes data inside each partition.

Verify Clustering on a Table

We can verify clustering information using the INFORMATION_SCHEMA.

SQL

SELECT
  table_name,
  column_name,
  ordinal_position,
  clustering_ordinal_position
FROM `ashishcoder.Coding_Dataset.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'my_clustered_table'
ORDER BY clustering_ordinal_position;

The output of the above query is shown below: