Learning Parquet with DuckDB
I was interested in learning more about Parquet and picked DuckDB as a tool to explore Parquet datasets.
It wasn't easy to find good Parquet datasets, but I found two blog posts from MotherDuck (post one and two) which helped with that. These posts contain links to a Stack Overflow dataset, which you can easily explore with DuckDB.
To do what's described in this blog post, you'll need to install the AWS CLI and DuckDB. To use the AWS CLI you'll need an AWS account. Once you're done with that, you can immediately start to explore the datasets:
$ aws s3 ls s3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/ --human-readable
2023-07-15 07:01:48 0 Bytes
2023-07-15 11:16:11 517.2 MiB badges.parquet
2023-07-15 11:16:11 6.9 GiB comments.parquet
2023-07-15 11:16:11 163.3 MiB post_links.parquet
2023-08-16 07:55:13 4.1 GiB posts.parquet
2023-07-15 11:16:11 1.5 MiB tags.parquet
2023-07-15 11:17:01 733.7 MiB users.parquet
2023-07-15 11:23:34 2.2 GiB votes.parquet
I'm interested in the times of the queries I'll be trying, so before starting to run them, I turn on DuckDB's timer with:
.timer on
Let's use the users.parquet
dataset. You can see its schema like this:
describe (select * from 's3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/users.parquet');
┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ Id │ BIGINT │ YES │ │ │ │
│ Reputation │ BIGINT │ YES │ │ │ │
│ CreationDate │ TIMESTAMP │ YES │ │ │ │
│ DisplayName │ VARCHAR │ YES │ │ │ │
│ LastAccessDate │ TIMESTAMP │ YES │ │ │ │
│ AboutMe │ VARCHAR │ YES │ │ │ │
│ Views │ BIGINT │ YES │ │ │ │
│ UpVotes │ BIGINT │ YES │ │ │ │
│ DownVotes │ BIGINT │ YES │ │ │ │
└────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 0.793 user 0.015590 sys 0.007591
Running queries over S3 is going to be slow, especially from my laptop, so I copied the dataset locally. users.parquet
is 734MB.
Let's run a simple query:
D select SUM(Views) from 'users.parquet' where DisplayName like 'Petko%';
┌──────────────┐
│ sum("Views") │
│ int128 │
├──────────────┤
│ 573 │
└──────────────┘
Run Time (s): real 0.207 user 1.364285 sys 0.099568
To talk in more depth about this query, I need to mention that Parquet files are organized in row groups. Row groups contain rows of data. The data inside row group is stored in column chunks and Parquet keeps track of statistics for each column chunk like min and max values.
This query took 0.207s
. Note that the user
+ sys
time is more than this number.
That's because the query was running on multiple threads, which is possible
due to the data being stored in separate row groups.
We can explain this query too:
D explain analyze select SUM(Views) from 'users.parquet'
where DisplayName like 'Petko%';
┌─────────────────────────────┐
│┌───────────────────────────┐│
│└───────────────────────────┘│
└─────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Query Profiling Information ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
explain analyze select SUM(Views) from 'users.parquet' where DisplayName like 'Petko%';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││ Total Time: 0.202s ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│ EXPLAIN_ANALYZE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 0 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ UNGROUPED_AGGREGATE │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ sum(#0) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 1 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ Views │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 119 │
│ (0.00s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│prefix(DisplayName, 'Petko'│
│ ) │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 3988557 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 119 │
│ (0.01s) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ DisplayName │
│ Views │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│Filters: DisplayName>=Petko│
│ AND DisplayName<Petkp AND│
│ DisplayName IS NOT NULL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 3988557 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 119 │
│ (1.57s) │
└───────────────────────────┘
Run Time (s): real 0.209 user 1.357049 sys 0.104349
Look at this section:
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ DisplayName │
│ Views │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│Filters: DisplayName>=Petko│
│ AND DisplayName<Petkp AND│
│ DisplayName IS NOT NULL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 3988557 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 119 │
│ (1.57s) │
└───────────────────────────┘
To understand what's going on here and how DuckDB queries Parquet, it'll be helpful to read Querying Parquet with Precision using DuckDB.
What we see here is a projection and filter pushdown, as described here. DuckDB is not going to fetch the entire Parquet file, it will only fetch two columns (projection pushdown). It can also use min/max statistics and skip certain row groups (filter pushdown).
After it has applied these pushdowns, it will execute a FILTER
and PROJECTION
, which
happen within DuckDB.
This query however doesn't benefit much from the filter pushdown, since the data
is not sorted by DisplayName so the filter probably returns most of the data. The data
for users.parquet
is sorted by Id:
D select Id from read_parquet('users.parquet') limit 10 offset 2000;
┌───────┐
│ Id │
│ int64 │
├───────┤
│ 2808 │
│ 2811 │
│ 2812 │
│ 2813 │
│ 2815 │
│ 2818 │
│ 2819 │
│ 2820 │
│ 2821 │
│ 2822 │
└───────┘
Run Time (s): real 0.009 user 0.016330 sys 0.003842
You can also explore the Parquet Metadata to see this as well. DuckDB offers a way to query the Parquet Metadata (link).
D select path_in_schema, row_group_id, stats_min_value, stats_max_value
from parquet_metadata('users.parquet') where (path_in_schema='Id') limit 10;
┌────────────────┬──────────────┬─────────────────┬─────────────────┐
│ path_in_schema │ row_group_id │ stats_min_value │ stats_max_value │
│ varchar │ int64 │ varchar │ varchar │
├────────────────┼──────────────┼─────────────────┼─────────────────┤
│ Id │ 0 │ -1014 │ 248850 │
│ Id │ 1 │ 248853 │ 412339 │
│ Id │ 2 │ 412342 │ 573874 │
│ Id │ 3 │ 573875 │ 735344 │
│ Id │ 4 │ 735345 │ 889114 │
│ Id │ 5 │ 889115 │ 1049570 │
│ Id │ 6 │ 1049572 │ 1211500 │
│ Id │ 7 │ 1211501 │ 1370352 │
│ Id │ 8 │ 1370353 │ 1479833 │
│ Id │ 9 │ 1479835 │ 1588348 │
├────────────────┴──────────────┴─────────────────┴─────────────────┤
│ 10 rows 4 columns │
└───────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.025 user 0.023375 sys 0.000999
Unsurprisingly queries which filter by Id are going to be much faster.
Here's a query that returns 119 rows and filters on DisplayName
:
D select count(*), sum(UpVotes) from 'users.parquet' where DisplayName like 'Petko%';
┌──────────────┬──────────────┐
│ count_star() │ sum(UpVotes) │
│ int64 │ int128 │
├──────────────┼──────────────┤
│ 119 │ 330 │
└──────────────┴──────────────┘
Run Time (s): real 0.201 user 1.363518 sys 0.100677
And here's a query that returns 3110 rows and filters on Id:
D select count(*), sum(UpVotes) from 'users.parquet' where Id >= 1000 * 1000 and Id <= 1000 * 1000 + 5000;
┌──────────────┬──────────────┐
│ count_star() │ sum(UpVotes) │
│ int64 │ int128 │
├──────────────┼──────────────┤
│ 3110 │ 235823 │
└──────────────┴──────────────┘
Run Time (s): real 0.012 user 0.014406 sys 0.003434
The latter is much faster. We can look at the EXPLAIN
too:
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ UpVotes │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│Filters: Id>=1000000 AND Id│
│ <=1005000 AND Id IS NOT │
│ NULL │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 3988557 │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ 3110 │
│ (0.02s) │
└───────────────────────────┘
As you can see this took only 0.02s. The Parquet reader likely skipped most row groups and picked one or two.
Conclusion
There are a few points here for me:
- Querying Parquet is fastest when you filter on rows which are sorted (or at least mostly sorted).
- Parquet queries can be quite fast even if they're done directly on the Parquet data.
TODO
There are a couple of things I'm curious about exploring:
- Can I achieve faster execution times for some queries by splitting the data into
directories? For example, I can split by date and sort on
DisplayName
. This way queries which filter on date and aDisplayName
prefix could possibly be made faster. - If I copy the Parquet file into a DuckDB table and create indexes on it, what kind of queries are going to become faster?