kyuubi/dev/kyuubi-tpcds
haorenhui e49af52f4c
[KYUUBI #5925] Kyuubi TPC-DS support running benchmark with skipping some queries
# 🔍 Description
## Issue References 🔗

When running Kyuubi's TPCDS, some SQL runs slowly, but there are no parameters to skip it.

## Describe Your Solution 🔧

Add the skip parameter, specifying a comma-separated list of SQL

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

#### Behavior Without This Pull Request ⚰️
no parameters to skip it.

#### Behavior With This Pull Request 🎉
```
$SPARK_HOME/bin/spark-submit \
  --class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
  kyuubi-tpcds_*.jar --db tpcds_sf10 --exclude q2,q4
```

> == QUERY LIST ==
> q1-v2.4
> q3-v2.4
> q5-v2.4
> q6-v2.4
> q7-v2.4
> q8-v2.4
> q9-v2.4
> .....

#### Related Unit Tests

---

# Checklists
## 📝 Author Self Checklist

- [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project
- [x] I have performed a self-review
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

## 📝 Committer Pre-Merge Checklist

- [ ] Pull request title is okay.
- [ ] No license issues.
- [ ] Milestone correctly set?
- [ ] Test coverage is ok
- [ ] Assignees are selected.
- [ ] Minimum number of approvals
- [ ] No changes are requested

**Be nice. Be informative.**

Closes #5925 from rhh777/tpcds-support-skip-queries.

Closes #5925

682f30ce8 [haorenhui] Update some descriptions
cd90fb597 [haorenhui] Use include(list) and exclude(list) to replace filter(string)/queries(list)/skip(list)
13744e57e [haorenhui] kyuubi tpcds RunBenchmark support skip some of the queries

Authored-by: haorenhui <haorenhui@kingsoft.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-12-28 17:01:46 +08:00
..
src/main [KYUUBI #5925] Kyuubi TPC-DS support running benchmark with skipping some queries 2023-12-28 17:01:46 +08:00
pom.xml Bump 1.9.0-SNAPSHOT 2023-09-04 14:23:12 +08:00
README.md [KYUUBI #5925] Kyuubi TPC-DS support running benchmark with skipping some queries 2023-12-28 17:01:46 +08:00

Introduction

This module includes TPC-DS data generator and benchmark tool.

How to use

package jar with following command: ./build/mvn clean package -Ptpcds -pl dev/kyuubi-tpcds -am

Data Generator

Support options:

key default description
db default the database to write data
scaleFactor 1 the scale factor of TPC-DS
format parquet the format of table to store data
parallel scaleFactor * 2 the parallelism of Spark job

Example: the following command to generate 10GB data with new database tpcds_sf10.

$SPARK_HOME/bin/spark-submit \
  --class org.apache.kyuubi.tpcds.DataGenerator \
  kyuubi-tpcds_*.jar \
  --db tpcds_sf10 --scaleFactor 10 --format parquet --parallel 20

Benchmark Tool

Support options:

key default description
db none(required) the TPC-DS database
benchmark tpcds-v2.4-benchmark the name of application
iterations 3 the number of iterations to run
breakdown false whether to record breakdown results of an execution
results-dir /spark/sql/performance dir to store benchmark results, e.g. hdfs://hdfs-nn:9870/pref
include none(optional) name of the queries to run, use comma to split multiple names, e.g. q1,q2
exclude none(optional) name of the queries to exclude, use comma to split multiple names, e.g. q2,q4

Example: the following command to benchmark TPC-DS sf10 with exists database tpcds_sf10.

$SPARK_HOME/bin/spark-submit \
  --class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
  kyuubi-tpcds_*.jar --db tpcds_sf10

We also support run specified SQL collections of the TPC-DS query:

$SPARK_HOME/bin/spark-submit \
  --class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
  kyuubi-tpcds_*.jar --db tpcds_sf10 --include q1,q2

The result of TPC-DS benchmark like:

name minTimeMs maxTimeMs avgTimeMs stdDev stdDevPercent
q1-v2.4 8329.884508 14159.307004 10537.235825 3161.74253777417 30.0054263782615
q2-v2.4 16600.979609 18932.613523 18137.6516166666 1331.06332796139 7.33867512781137

If you want to exclude some SQL, you can use exclude:

$SPARK_HOME/bin/spark-submit \
  --class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
  kyuubi-tpcds_*.jar --db tpcds_sf10 --exclude q2,q4

The result of TPC-DS benchmark like:

name minTimeMs maxTimeMs avgTimeMs stdDev stdDevPercent
q1-v2.4 8329.884508 14159.307004 10537.235825 3161.74253777417 30.0054263782615
q3-v2.4 3841.009061 4685.16345 4128.583224 482.102016761038 11.6771781166603
q5-v2.4 39405.654981 48845.359253 43530.6847113333 4830.98802198401 11.0978911864583
q6-v2.4 2998.962221 7793.096796 4658.37355366666 2716.310089792 58.3102677039276
... ... ... ... ... ...
q99-v2.4 11747.22389 11900.570288 11813.018609 78.9544389266673 0.668368022941351

When both include and exclude exist simultaneously, the final SQL collections executed is include minus exclude:

$SPARK_HOME/bin/spark-submit \
  --class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
  kyuubi-tpcds_*.jar --db tpcds_sf10 --include q1,q2,q3,q4,q5 --exclude q2,q4

The result of TPC-DS benchmark like:

name minTimeMs maxTimeMs avgTimeMs stdDev stdDevPercent
q1-v2.4 8329.884508 14159.307004 10537.235825 3161.74253777417 30.0054263782615
q3-v2.4 3841.009061 4685.16345 4128.583224 482.102016761038 11.6771781166603
q5-v2.4 39405.654981 48845.359253 43530.6847113333 4830.98802198401 11.0978911864583