kyuubi/dev/kyuubi-tpcds/README.md
Cheng Pan 01f0ea8609
[KYUUBI #1743] Fix parallelism of DataGenerator and other enhancements
### _Why are the changes needed?_

The parallelism of DataGenerator always is `spark.sparkContext.defaultParallelism`, it does not make sense for generating large scale data.

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #1743 from pan3793/tpcds.

Closes #1743

62f7c866 [Cheng Pan] nit
fdcf8329 [Cheng Pan] nit
a52ff489 [Cheng Pan] Fix parallelism of DataGenerator and other enhancements

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2022-01-13 11:41:22 +08:00

3.0 KiB

Introduction

This module includes TPC-DS data generator and benchmark tool.

How to use

package jar with following command: ./build/mvn clean package -Ptpcds -pl dev/kyuubi-tpcds -am

Data Generator

Support options:

key default description
db default the database to write data
scaleFactor 1 the scale factor of TPC-DS
format parquet the format of table to store data
parallel scaleFactor * 2 the parallelism of Spark job

Example: the following command to generate 10GB data with new database tpcds_sf10.

$SPARK_HOME/bin/spark-submit \
  --class org.apache.kyuubi.tpcds.DataGenerator \
  kyuubi-tpcds_*.jar \
  --db tpcds_sf10 --scaleFactor 10 --format parquet --parallel 20

Benchmark Tool

Support options:

key default description
db none(required) the TPC-DS database
benchmark tpcds-v2.4-benchmark the name of application
iterations 3 the number of iterations to run
filter a filter on the name of the queries to run, e.g. q1-v2.4

Example: the following command to benchmark TPC-DS sf10 with exists database tpcds_sf10.

$SPARK_HOME/bin/spark-submit \
  --class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
  kyuubi-tpcds_*.jar --db tpcds_sf10

We also support run one of the TPC-DS query:

$SPARK_HOME/bin/spark-submit \
  --class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
  kyuubi-tpcds_*.jar --db tpcds_sf10 --filter q1-v2.4

The result of TPC-DS benchmark like:

name minTimeMs maxTimeMs avgTimeMs stdDev stdDevPercent
q1-v2.4 50.522384 868.010383 323.398267 471.6482 145.8413108576