Learning Parquet with DuckDB

I was interested in learning more about Parquet and picked DuckDB as a tool to explore Parquet datasets.

It wasn't easy to find good Parquet datasets, but I found two blog posts from MotherDuck (post one and two) which helped with that. These posts contain links to a Stack Overflow dataset, which you can easily explore with DuckDB.

To do what's described in this blog post, you'll need to install the AWS CLI and DuckDB. To use the AWS CLI you'll need an AWS account. Once you're done with that, you can immediately start to explore the datasets:

$ aws s3 ls s3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/ --human-readable
2023-07-15 07:01:48    0 Bytes 
2023-07-15 11:16:11  517.2 MiB badges.parquet
2023-07-15 11:16:11    6.9 GiB comments.parquet
2023-07-15 11:16:11  163.3 MiB post_links.parquet
2023-08-16 07:55:13    4.1 GiB posts.parquet
2023-07-15 11:16:11    1.5 MiB tags.parquet
2023-07-15 11:17:01  733.7 MiB users.parquet
2023-07-15 11:23:34    2.2 GiB votes.parquet

I'm interested in the times of the queries I'll be trying, so before starting to run them, I turn on DuckDB's timer with:

.timer on

Let's use the users.parquet dataset. You can see its schema like this:

describe (select * from 's3://us-prd-motherduck-open-datasets/stackoverflow/parquet/2023-05/users.parquet');
┌────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│  column_name   │ column_type │  null   │   key   │ default │  extra  │
│    varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ Id             │ BIGINT      │ YES     │         │         │         │
│ Reputation     │ BIGINT      │ YES     │         │         │         │
│ CreationDate   │ TIMESTAMP   │ YES     │         │         │         │
│ DisplayName    │ VARCHAR     │ YES     │         │         │         │
│ LastAccessDate │ TIMESTAMP   │ YES     │         │         │         │
│ AboutMe        │ VARCHAR     │ YES     │         │         │         │
│ Views          │ BIGINT      │ YES     │         │         │         │
│ UpVotes        │ BIGINT      │ YES     │         │         │         │
│ DownVotes      │ BIGINT      │ YES     │         │         │         │
└────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Run Time (s): real 0.793 user 0.015590 sys 0.007591

Running queries over S3 is going to be slow, especially from my laptop, so I copied the dataset locally. users.parquet is 734MB.

Let's run a simple query:

D select SUM(Views) from 'users.parquet' where DisplayName like 'Petko%';
┌──────────────┐
│ sum("Views") │
│    int128    │
├──────────────┤
│          573 │
└──────────────┘
Run Time (s): real 0.207 user 1.364285 sys 0.099568

To talk in more depth about this query, I need to mention that Parquet files are organized in row groups. Row groups contain rows of data. The data inside row group is stored in column chunks and Parquet keeps track of statistics for each column chunk like min and max values.

This query took 0.207s. Note that the user + sys time is more than this number. That's because the query was running on multiple threads, which is possible due to the data being stored in separate row groups.

We can explain this query too:

D explain analyze select SUM(Views) from 'users.parquet'
    where DisplayName like 'Petko%';

┌─────────────────────────────┐
│┌───────────────────────────┐│
│└───────────────────────────┘│
└─────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
explain analyze select SUM(Views) from 'users.parquet' where DisplayName like 'Petko%';
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 0.202s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          sum(#0)          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             1             │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           Views           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            119            │
│          (0.00s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│prefix(DisplayName, 'Petko'│
│             )             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 3988557        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            119            │
│          (0.01s)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        DisplayName        │
│           Views           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│Filters: DisplayName>=Petko│
│  AND DisplayName<Petkp AND│
│   DisplayName IS NOT NULL │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 3988557        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            119            │
│          (1.57s)          │
└───────────────────────────┘                             
Run Time (s): real 0.209 user 1.357049 sys 0.104349

Look at this section:

┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        DisplayName        │
│           Views           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│Filters: DisplayName>=Petko│
│  AND DisplayName<Petkp AND│
│   DisplayName IS NOT NULL │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 3988557        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            119            │
│          (1.57s)          │
└───────────────────────────┘      

To understand what's going on here and how DuckDB queries Parquet, it'll be helpful to read Querying Parquet with Precision using DuckDB.

What we see here is a projection and filter pushdown, as described here. DuckDB is not going to fetch the entire Parquet file, it will only fetch two columns (projection pushdown). It can also use min/max statistics and skip certain row groups (filter pushdown).

After it has applied these pushdowns, it will execute a FILTER and PROJECTION, which happen within DuckDB.

This query however doesn't benefit much from the filter pushdown, since the data is not sorted by DisplayName so the filter probably returns most of the data. The data for users.parquet is sorted by Id:

D select Id from read_parquet('users.parquet') limit 10 offset 2000;
┌───────┐
│  Id   │
│ int64 │
├───────┤
│  2808 │
│  2811 │
│  2812 │
│  2813 │
│  2815 │
│  2818 │
│  2819 │
│  2820 │
│  2821 │
│  2822 │
└───────┘
Run Time (s): real 0.009 user 0.016330 sys 0.003842

You can also explore the Parquet Metadata to see this as well. DuckDB offers a way to query the Parquet Metadata (link).

D select path_in_schema, row_group_id, stats_min_value, stats_max_value
  from parquet_metadata('users.parquet') where (path_in_schema='Id') limit 10;
┌────────────────┬──────────────┬─────────────────┬─────────────────┐
│ path_in_schema │ row_group_id │ stats_min_value │ stats_max_value │
│    varchar     │    int64     │     varchar     │     varchar     │
├────────────────┼──────────────┼─────────────────┼─────────────────┤
│ Id             │            0 │ -1014           │ 248850          │
│ Id             │            1 │ 248853          │ 412339          │
│ Id             │            2 │ 412342          │ 573874          │
│ Id             │            3 │ 573875          │ 735344          │
│ Id             │            4 │ 735345          │ 889114          │
│ Id             │            5 │ 889115          │ 1049570         │
│ Id             │            6 │ 1049572         │ 1211500         │
│ Id             │            7 │ 1211501         │ 1370352         │
│ Id             │            8 │ 1370353         │ 1479833         │
│ Id             │            9 │ 1479835         │ 1588348         │
├────────────────┴──────────────┴─────────────────┴─────────────────┤
│ 10 rows                                                 4 columns │
└───────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.025 user 0.023375 sys 0.000999

Unsurprisingly queries which filter by Id are going to be much faster.

Here's a query that returns 119 rows and filters on DisplayName:

D select count(*), sum(UpVotes) from 'users.parquet' where DisplayName like 'Petko%';
┌──────────────┬──────────────┐
│ count_star() │ sum(UpVotes) │
│    int64     │    int128    │
├──────────────┼──────────────┤
│          119 │          330 │
└──────────────┴──────────────┘
Run Time (s): real 0.201 user 1.363518 sys 0.100677

And here's a query that returns 3110 rows and filters on Id:

D select count(*), sum(UpVotes) from 'users.parquet' where Id >= 1000 * 1000 and Id <= 1000 * 1000 + 5000;
┌──────────────┬──────────────┐
│ count_star() │ sum(UpVotes) │
│    int64     │    int128    │
├──────────────┼──────────────┤
│         3110 │       235823 │
└──────────────┴──────────────┘
Run Time (s): real 0.012 user 0.014406 sys 0.003434

The latter is much faster. We can look at the EXPLAIN too:

┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│          UpVotes          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│Filters: Id>=1000000 AND Id│
│  <=1005000 AND Id IS NOT  │
│            NULL           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 3988557        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            3110           │
│          (0.02s)          │
└───────────────────────────┘    

As you can see this took only 0.02s. The Parquet reader likely skipped most row groups and picked one or two.

Conclusion

There are a few points here for me:

  • Querying Parquet is fastest when you filter on rows which are sorted (or at least mostly sorted).
  • Parquet queries can be quite fast even if they're done directly on the Parquet data.

TODO

There are a couple of things I'm curious about exploring:

  • Can I achieve faster execution times for some queries by splitting the data into directories? For example, I can split by date and sort on DisplayName. This way queries which filter on date and a DisplayName prefix could possibly be made faster.
  • If I copy the Parquet file into a DuckDB table and create indexes on it, what kind of queries are going to become faster?

social