Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IMDB(JOB) Benchmark [2/N] (imdb queries) #12529

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

austin362667
Copy link
Contributor

@austin362667 austin362667 commented Sep 18, 2024

Which issue does this PR close?

Partially closes #12311.

  1. After generating IMDB dataset (.csv, .parquet).
    ./benchmarks/bench.sh data imdb
    
  2. Users can benchmark IMDB queries through following script against imdb parquet files.
    ./benchmarks/bench.sh run imdb
    
    Or just benchmarking single query, for example, query_id 5 indicates query 2a,
    cargo run --bin imdb benchmark datafusion --iterations 1 --path ./arrow-datafusion/benchmarks/data/imdb --prefer_hash_join true --format csv -o ./arrow-datafusion/benchmarks/results/heads_doupache_imdb-data/imdb.json --query 5 --debug
    
    Returning
     === Logical plan ===
     Projection: min(t.title) AS movie_title
       Aggregate: groupBy=[[]], aggr=[[min(t.title)]]
         Filter: cn.country_code = Utf8("[de]") AND k.keyword = Utf8("character-name-in-title") AND cn.id = mc.company_id AND mc.movie_id = t.id AND t.id = mk.movie_id AND mk.keyword_id = k.id AND mc.movie_id = mk.movie_id
     
     (...)
    
     +-------------+
     | movie_title |
     +-------------+
     | 'Doc'       |
     +-------------+
     Query 5 iteration 0 took 3222.3 ms and returned 1 rows
     Query 5 avg time: 3222.30 ms
    
  3. And verify SQL results via SQL Logic Test against imdb csv files
    INCLUDE_IMDB=true cargo test --test sqllogictests -- imdb
    

Rationale for this change

  1. We download IMDB queries from https://db.in.tum.de/~leis/qo/job.tgz and benchmark them with the helps of Add JOB benchmark dataset [1/N] (imdb dataset) #12497.
  2. Ensuring correctness by imdb.slt, just like what we did to tpch.slt.

Unlike TPC-H, IMDB dataset is not generated and it's fixed sized, so no scaling factor and we don't need another docker container to generate data and answers.
I have also cross-checked answers in csv files from https://github.com/duckdb/duckdb/tree/main/benchmark/imdb/answers .

What changes are included in this PR?

IMDB(JOB) queries don't have incremental query_id, so I hard-coded the benchmark runner query_id (1,2,3,4, ... 113 in integer) to actual IMDB query name (1a, 1b, 1c, 1d, 2a, ... 33c in string, there is no pattern) mapping via lots of if.

Currently, I've only add SLT for:

  • 1a, 1b, 1c, 1d
  • 2a, 2b, 2c, 2d
  • 3a, 3b, 3c
  • 4a, 4b, 4c
  • 5a, 5b, 5c
  • 6a, 6b, 6c, 6d, 6e, 6f
  • ...
  • 33a, 33b, 33c

Are these changes tested?

Yes, please check test_files/imdb for details.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 18, 2024
@austin362667 austin362667 force-pushed the imdb-benchmark branch 2 times, most recently from c1ccd0b to c3b4b8c Compare September 19, 2024 00:11
@github-actions github-actions bot added the development-process Related to development process of DataFusion label Sep 19, 2024
doupache and others added 14 commits September 20, 2024 11:28
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>

Fix `get_query_sql()` for CI roundtrip test

Signed-off-by: Austin Liu <[email protected]>

Fix `get_query_sql()` for CI roundtrip test

Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>
Signed-off-by: Austin Liu <[email protected]>

Prepare IMDB dataset

Signed-off-by: Austin Liu <[email protected]>
@austin362667 austin362667 marked this pull request as ready for review September 20, 2024 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of DataFusion sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add IMDB queries (a.k.a. JOB - Join Order Benchmark) to DataFusion benchmark suite
2 participants