Set the bigframes.pandas.options.bigquery.ordering_mode = "partial"
option to prevent BigQuery DataFrames from creating a deterministic
ordering and sequential index. Later in this post, I show an example of where this has a 4,000,000x
speed up in terms of bytes scanned.
Without this option set, BigQuery DataFrames creates a sequential
index over all rows to mimic pandas behavior. On large tables this can get expensive, as it
forces a full table scan, unless row and column filters are provided as
arguments to read_gbq
.
Partial ordering mode allows BigQuery DataFrames to push down many more row
and column filters. On large clustered and partitioned tables, this can
greatly reduce the number of bytes scanned and computation slots used.
A
brief primer on indexes in pandas and BigQuery DataFrames
In pandas_gbq
, when you download a BigQuery table as a DataFrame,
pandas assigns a default index to the rows base on the order they were
downloaded.
import pandas_gbq
df = pandas_gbq.read_gbq(
"bigquery-public-data.ml_datasets.penguins",
project_id=my_project)
df
index |
species |
island |
...length... |
...depth... |
flipper... |
body... |
sex |
0 |
Adelie Penguin (Pygoscelis adeliae) |
Dream |
36.6 |
18.4 |
184.0 |
3475.0 |
FEMALE |
1 |
Adelie Penguin (Pygoscelis adeliae) |
Dream |
39.8 |
19.1 |
184.0 |
4650.0 |
MALE |
2 |
Adelie Penguin (Pygoscelis adeliae) |
Dream |
40.9 |
18.9 |
184.0 |
3900.0 |
MALE |
3 |
Chinstrap penguin (Pygoscelis antarctica) |
Dream |
46.5 |
17.9 |
192.0 |
3500.0 |
FEMALE |
4 |
Adelie Penguin (Pygoscelis adeliae) |
Dream |
37.3 |
16.8 |
192.0 |
3000.0 |
FEMALE |
5 |
Adelie Penguin (Pygoscelis adeliae) |
Dream |
43.2 |
18.5 |
192.0 |
4100.0 |
MALE |
6 |
Chinstrap penguin (Pygoscelis antarctica) |
Dream |
46.9 |
16.6 |
192.0 |
2700.0 |
FEMALE |
7 |
Chinstrap penguin (Pygoscelis antarctica) |
Dream |
50.5 |
18.4 |
200.0 |
3400.0 |
FEMALE |
8 |
Chinstrap penguin (Pygoscelis antarctica) |
Dream |
49.5 |
19.0 |
200.0 |
3800.0 |
MALE |
9 |
Adelie Penguin (Pygoscelis adeliae) |
Dream |
40.2 |
20.1 |
200.0 |
3975.0 |
MALE |
10 |
Adelie Penguin (Pygoscelis adeliae) |
Dream |
40.8 |
18.9 |
208.0 |
4300.0 |
MALE |
… |
|
|
|
|
|
|
|
The index for a particular row doesn’t change, even if you apply a
row filter to that DataFrame.
df[df["species"].str.contains("Chinstrap")]
index |
species |
island |
...length... |
...depth... |
flipper... |
body... |
sex |
3 |
Chinstrap penguin (Pygoscelis antarctica) |
Dream |
46.5 |
17.9 |
192.0 |
3500.0 |
FEMALE |
6 |
Chinstrap penguin (Pygoscelis antarctica) |
Dream |
46.9 |
16.6 |
192.0 |
2700.0 |
FEMALE |
7 |
Chinstrap penguin (Pygoscelis antarctica) |
Dream |
50.5 |
18.4 |
200.0 |
3400.0 |
FEMALE |
8 |
Chinstrap penguin (Pygoscelis antarctica) |
Dream |
49.5 |
19.0 |
200.0 |
3800.0 |
MALE |
13 |
Chinstrap penguin (Pygoscelis antarctica) |
Dream |
47.0 |
17.3 |
185.0 |
3700.0 |
FEMALE |
… |
|
|
|
|
|
|
|
Why does pandas do this? A big reason is that in pandas, when you
combine two objects, you are actually joining on an index.
s1 = pd.Series([1, 2, 3], index=[0, 2, 4], dtype="Int64")
s2 = pd.Series([4, 5, 6], index=[0, 1, 2], dtype="Int64")
# Align the two Series by their respective indexes and then add.
s1 + s2
# 0 5
# 1 <NA>
# 2 8
# 4 <NA>
# dtype: Int64
By preserving the index after the filter, pandas can still join the
results back to the original DataFrame. In a SQL system like BigQuery,
to do the same one needs a unique key to join by. pandas provides such a
key implicitly for all objects. (Aside: it is possible to create a
pandas object where the index is not unique. The behavior when joining
on the index can get a bit unpredictable in this case.)
As a pandas-like API, BigQuery DataFrames (bigframes) also creates a
sequential index based on the row order if no other index, such as a
primary key, is available. Just like pandas, BigQuery DataFrames
uses this index to implicitly join DataFrame and Series objects.
To get determinism, BigQuery DataFrames hashes all columns to provide
a total ordering of the rows. To get a sequential
index, BigQuery DataFrames uses the ROW_NUMBER()
analytic operation. Notably, in BigQuery DataFrames and pandas, the
index values remain stable, even after row filters are applied.
s = pd.Series([1, 2, 3, 4, 5, 6], dtype="Int64")
s[s % 2 == 0]
# 1 2
# 3 4
# 5 6
# dtype: Int64
This allows two objects with different filters to be implicitly
joined together, matching on the same rows as if the filter hadn’t been
applied.
s = pd.Series([1, 2, 3, 4, 5, 6], dtype="Int64")
s[(1 < s) & (s < 6)]
# 1 2
# 3 4
# 5 6
# dtype: Int64
s[s % 2 == 0] + s[(1 < s) & (s < 6)]
# 1 4
# 2 <NA>
# 3 8
# 4 <NA>
# 5 <NA>
# dtype: Int64
While useful in some contexts, creating a sequential index is an
expensive operation, especially on clustered and partitioned tables. The
hash for deterministic ordering means that column filters don’t work as
expected and the analytical ROW_NUMBER() operation over the whole table
means that row filters don’t work as expected.
Partial ordering mode
and NULL indexes
To prevent unexpected full table scans, BigQuery DataFrames warns
with a bigframes.exceptions.DefaultIndexWarning when reading from
partitioned and clustered tables with no index_col or filters defined.
For tables without a suitable set of columns to use as the index_col, as
of BigQuery DataFrames (bigframes) version 1.7.0, you can set
index_col=bigframes.enums.DefualtIndexKind.NULL in read_gbq().
When setting this option, the analytic ROW_NUMBER operation is
removed, allowing row filters to be pushed down. This has the potential
to save you a lot of bytes scanned in workloads on clustered and
partitioned tables. The tradeoff is that now unrelated DataFrame and
Series objects can no longer be implicitly joined. In some cases,
BigQuery DataFrames can determine that two objects are derived from the
same table expression, but this isn’t always possible.
As of BigQuery DataFrames (bigframes) version 1.12.0, you can also
disable a default ordering by setting the
bigframes.pandas.options.bigquery.ordering_mode = “partial” option. This
turns off default sequential indexes and also sets the default index
kind to NULL. This allows both row filters and column filters to work as
expected.
Trying it out
Query the last few days of data from the bigquery-public-data.pypi.file_downloads
table, which is partitioned by the timestamp column and clustered by the
project column with ordering_mode = “partial”. Do
not try this with the default sequential index, as
the table is 375+ TB total logical bytes large and growing.
import datetime
import warnings
import bigframes.exceptions
import bigframes.pandas as bpd
# IMPORTANT: use partial ordering mode to allow filters to work as expected.
bpd.options.bigquery.ordering_mode = "partial"
# Filter out the relevant warnings for preview features used.
warnings.simplefilter("ignore", category=bigframes.exceptions.NullIndexPreviewWarning)
warnings.simplefilter("ignore", category=bigframes.exceptions.OrderingModePartialPreviewWarning)
# Show a preview of the previous day's downloads.
pypi = bpd.read_gbq("bigquery-public-data.pypi.file_downloads")
last_1_days = (
datetime.datetime.now(datetime.timezone.utc)
- datetime.timedelta(days=1)
)
bigframes_downloads = pypi[
(pypi["timestamp"] > last_1_days)
& (pypi["project"] == "bigframes")
]
bigframes_downloads[["timestamp", "project", "file"]].peek()
index |
timestamp |
project |
file |
0 |
2024-09-04 01:56:51+00:00 |
bigframes |
{‘filename’: ‘bigframes-0.22.0-py2.py3-none-any.whl.metadata’,
‘project’: ‘bigframes’, ‘version’: ‘0.22.0’, ‘type’: ‘bdist_wheel’} |
1 |
2024-09-04 01:49:51+00:00 |
bigframes |
{‘filename’: ‘bigframes-1.15.0-py2.py3-none-any.whl’, ‘project’:
‘bigframes’, ‘version’: ‘1.15.0’, ‘type’: ‘bdist_wheel’} |
2 |
2024-09-04 01:42:03+00:00 |
bigframes |
{‘filename’: ‘bigframes-1.15.0-py2.py3-none-any.whl’, ‘project’:
‘bigframes’, ‘version’: ‘1.15.0’, ‘type’: ‘bdist_wheel’} |
3 |
2024-09-04 01:57:42+00:00 |
bigframes |
{‘filename’: ‘bigframes-1.15.0-py2.py3-none-any.whl’, ‘project’:
‘bigframes’, ‘version’: ‘1.15.0’, ‘type’: ‘bdist_wheel’} |
4 |
2024-09-04 01:57:24+00:00 |
bigframes |
{‘filename’: ‘bigframes-1.15.0-py2.py3-none-any.whl.metadata’,
‘project’: ‘bigframes’, ‘version’: ‘1.15.0’, ‘type’: ‘bdist_wheel’} |
If you remove
bpd.options.bigquery.ordering_mode = "partial"
(please
don’t), then the query BigQuery DataFrames generates processes 376.27
TB. But with this option set, the query BigQuery DataFrames generated
only processes 84.8 MB processed. This is 4,000,000+ times speedup in
terms of cost for on-demand queries!
Check out the Analyzing
package downloads from PyPI with BigQuery DataFrames notebook for a
more detailed look at these data using ordering_mode = “partial”. Also,
learn more about BigQuery DataFrames and its features at the Use
BigQuery DataFrames guide.
Share your feedback
Partial ordering mode and NULL indexes are currently in Preview.
Is this feature helpful to you? Is there some functionality you need
that is missing? The BigQuery DataFrames team would love to hear from
you. Share your feedback at [email protected],
bit.ly/bigframes-feedback,
or by filing an issue at the open
source BigQuery DataFrames repository.