Introducing partial ordering mode for BigQuery DataFrames (bigframes)

Tim Sweña on September 4, 2024

Set the bigframes.pandas.options.bigquery.ordering_mode = "partial" option to prevent BigQuery DataFrames from creating a deterministic ordering and sequential index. Later in this post, I show an example of where this has a 4,000,000x speed up in terms of bytes scanned.

Without this option set, BigQuery DataFrames creates a sequential index over all rows to mimic pandas behavior. On large tables this can get expensive, as it forces a full table scan, unless row and column filters are provided as arguments to read_gbq. Partial ordering mode allows BigQuery DataFrames to push down many more row and column filters. On large clustered and partitioned tables, this can greatly reduce the number of bytes scanned and computation slots used.

A brief primer on indexes in pandas and BigQuery DataFrames

In pandas_gbq, when you download a BigQuery table as a DataFrame, pandas assigns a default index to the rows base on the order they were downloaded.

import pandas_gbq

df = pandas_gbq.read_gbq(
  
"bigquery-public-data.ml_datasets.penguins",
  
 project_id=my_project)

df

index	species	island	...length...	...depth...	flipper...	body...	sex
0	Adelie Penguin (Pygoscelis adeliae)	Dream	36.6	18.4	184.0	3475.0	FEMALE
1	Adelie Penguin (Pygoscelis adeliae)	Dream	39.8	19.1	184.0	4650.0	MALE
2	Adelie Penguin (Pygoscelis adeliae)	Dream	40.9	18.9	184.0	3900.0	MALE
3	Chinstrap penguin (Pygoscelis antarctica)	Dream	46.5	17.9	192.0	3500.0	FEMALE
4	Adelie Penguin (Pygoscelis adeliae)	Dream	37.3	16.8	192.0	3000.0	FEMALE
5	Adelie Penguin (Pygoscelis adeliae)	Dream	43.2	18.5	192.0	4100.0	MALE
6	Chinstrap penguin (Pygoscelis antarctica)	Dream	46.9	16.6	192.0	2700.0	FEMALE
7	Chinstrap penguin (Pygoscelis antarctica)	Dream	50.5	18.4	200.0	3400.0	FEMALE
8	Chinstrap penguin (Pygoscelis antarctica)	Dream	49.5	19.0	200.0	3800.0	MALE
9	Adelie Penguin (Pygoscelis adeliae)	Dream	40.2	20.1	200.0	3975.0	MALE
10	Adelie Penguin (Pygoscelis adeliae)	Dream	40.8	18.9	208.0	4300.0	MALE
…

The index for a particular row doesn’t change, even if you apply a row filter to that DataFrame.

df[df["species"].str.contains("Chinstrap")]

index	species	island	...length...	...depth...	flipper...	body...	sex
3	Chinstrap penguin (Pygoscelis antarctica)	Dream	46.5	17.9	192.0	3500.0	FEMALE
6	Chinstrap penguin (Pygoscelis antarctica)	Dream	46.9	16.6	192.0	2700.0	FEMALE
7	Chinstrap penguin (Pygoscelis antarctica)	Dream	50.5	18.4	200.0	3400.0	FEMALE
8	Chinstrap penguin (Pygoscelis antarctica)	Dream	49.5	19.0	200.0	3800.0	MALE
13	Chinstrap penguin (Pygoscelis antarctica)	Dream	47.0	17.3	185.0	3700.0	FEMALE
…

Why does pandas do this? A big reason is that in pandas, when you combine two objects, you are actually joining on an index.

s1 = pd.Series([1, 2, 3], index=[0, 2, 4], dtype="Int64")
s2 = pd.Series([4, 5, 6], index=[0, 1, 2], dtype="Int64")

# Align the two Series by their respective indexes and then add.
s1 + s2
# 0       5
# 1    <NA>
# 2       8
# 4    <NA>
# dtype: Int64

By preserving the index after the filter, pandas can still join the results back to the original DataFrame. In a SQL system like BigQuery, to do the same one needs a unique key to join by. pandas provides such a key implicitly for all objects. (Aside: it is possible to create a pandas object where the index is not unique. The behavior when joining on the index can get a bit unpredictable in this case.)

As a pandas-like API, BigQuery DataFrames (bigframes) also creates a sequential index based on the row order if no other index, such as a primary key, is available. Just like pandas, BigQuery DataFrames uses this index to implicitly join DataFrame and Series objects.

To get determinism, BigQuery DataFrames hashes all columns to provide a total ordering of the rows¹. To get a sequential index, BigQuery DataFrames uses the ROW_NUMBER() analytic operation. Notably, in BigQuery DataFrames and pandas, the index values remain stable, even after row filters are applied.

s = pd.Series([1, 2, 3, 4, 5, 6], dtype="Int64")
s[s % 2 == 0]
# 1    2
# 3    4
# 5    6
# dtype: Int64

This allows two objects with different filters to be implicitly joined together, matching on the same rows as if the filter hadn’t been applied.

s = pd.Series([1, 2, 3, 4, 5, 6], dtype="Int64")
s[(1 < s) & (s < 6)]
# 1    2
# 3    4
# 5    6
# dtype: Int64

s[s % 2 == 0] + s[(1 < s) & (s < 6)]
# 1       4
# 2    <NA>
# 3       8
# 4    <NA>
# 5    <NA>
# dtype: Int64

While useful in some contexts, creating a sequential index is an expensive operation, especially on clustered and partitioned tables. The hash for deterministic ordering means that column filters don’t work as expected and the analytical ROW_NUMBER() operation over the whole table means that row filters don’t work as expected.

Partial ordering mode and NULL indexes

To prevent unexpected full table scans, BigQuery DataFrames warns with a bigframes.exceptions.DefaultIndexWarning when reading from partitioned and clustered tables with no index_col or filters defined. For tables without a suitable set of columns to use as the index_col, as of BigQuery DataFrames (bigframes) version 1.7.0, you can set index_col=bigframes.enums.DefualtIndexKind.NULL in read_gbq().

When setting this option, the analytic ROW_NUMBER operation is removed, allowing row filters to be pushed down. This has the potential to save you a lot of bytes scanned in workloads on clustered and partitioned tables. The tradeoff is that now unrelated DataFrame and Series objects can no longer be implicitly joined. In some cases, BigQuery DataFrames can determine that two objects are derived from the same table expression, but this isn’t always possible.

As of BigQuery DataFrames (bigframes) version 1.12.0, you can also disable a default ordering by setting the bigframes.pandas.options.bigquery.ordering_mode = “partial” option. This turns off default sequential indexes and also sets the default index kind to NULL. This allows both row filters and column filters to work as expected.

Trying it out

Query the last few days of data from the bigquery-public-data.pypi.file_downloads table, which is partitioned by the timestamp column and clustered by the project column with ordering_mode = “partial”. Do not try this with the default sequential index, as the table is 375+ TB total logical bytes large and growing.

import datetime
import warnings

import bigframes.exceptions
import bigframes.pandas as bpd

# IMPORTANT: use partial ordering mode to allow filters to work as expected.
bpd.options.bigquery.ordering_mode = "partial"

# Filter out the relevant warnings for preview features used.
warnings.simplefilter("ignore", category=bigframes.exceptions.NullIndexPreviewWarning)
warnings.simplefilter("ignore", category=bigframes.exceptions.OrderingModePartialPreviewWarning)

# Show a preview of the previous day's downloads.
pypi = bpd.read_gbq("bigquery-public-data.pypi.file_downloads")

last_1_days = (
    datetime.datetime.now(datetime.timezone.utc)
    - datetime.timedelta(days=1)
)
bigframes_downloads = pypi[
    (pypi["timestamp"] > last_1_days)
    & (pypi["project"] == "bigframes")
]
bigframes_downloads[["timestamp", "project", "file"]].peek()

index	timestamp	project	file
0	2024-09-04 01:56:51+00:00	bigframes	{‘filename’: ‘bigframes-0.22.0-py2.py3-none-any.whl.metadata’, ‘project’: ‘bigframes’, ‘version’: ‘0.22.0’, ‘type’: ‘bdist_wheel’}
1	2024-09-04 01:49:51+00:00	bigframes	{‘filename’: ‘bigframes-1.15.0-py2.py3-none-any.whl’, ‘project’: ‘bigframes’, ‘version’: ‘1.15.0’, ‘type’: ‘bdist_wheel’}
2	2024-09-04 01:42:03+00:00	bigframes	{‘filename’: ‘bigframes-1.15.0-py2.py3-none-any.whl’, ‘project’: ‘bigframes’, ‘version’: ‘1.15.0’, ‘type’: ‘bdist_wheel’}
3	2024-09-04 01:57:42+00:00	bigframes	{‘filename’: ‘bigframes-1.15.0-py2.py3-none-any.whl’, ‘project’: ‘bigframes’, ‘version’: ‘1.15.0’, ‘type’: ‘bdist_wheel’}
4	2024-09-04 01:57:24+00:00	bigframes	{‘filename’: ‘bigframes-1.15.0-py2.py3-none-any.whl.metadata’, ‘project’: ‘bigframes’, ‘version’: ‘1.15.0’, ‘type’: ‘bdist_wheel’}

If you remove bpd.options.bigquery.ordering_mode = "partial" (please don’t), then the query BigQuery DataFrames generates processes 376.27 TB. But with this option set, the query BigQuery DataFrames generated only processes 84.8 MB processed. This is 4,000,000+ times speedup in terms of cost for on-demand queries!

Check out the Analyzing package downloads from PyPI with BigQuery DataFrames notebook for a more detailed look at these data using ordering_mode = “partial”. Also, learn more about BigQuery DataFrames and its features at the Use BigQuery DataFrames guide.

Partial ordering mode and NULL indexes are currently in Preview. Is this feature helpful to you? Is there some functionality you need that is missing? The BigQuery DataFrames team would love to hear from you. Share your feedback at [email protected], bit.ly/bigframes-feedback, or by filing an issue at the open source BigQuery DataFrames repository.

If the table contains duplicate rows, the order of duplicate rows isn’t deterministic, but since the data is identical, the order of the identical rows won’t affect the results. BigQuery DataFrames performs a double hash to prevent collisions.↩︎