### _Why are the changes needed?_ The parallelism of DataGenerator always is `spark.sparkContext.defaultParallelism`, it does not make sense for generating large scale data. ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #1743 from pan3793/tpcds. Closes #1743 62f7c866 [Cheng Pan] nit fdcf8329 [Cheng Pan] nit a52ff489 [Cheng Pan] Fix parallelism of DataGenerator and other enhancements Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Cheng Pan <chengpan@apache.org>
77 lines
3.0 KiB
Markdown
77 lines
3.0 KiB
Markdown
<!--
|
|
- Licensed to the Apache Software Foundation (ASF) under one or more
|
|
- contributor license agreements. See the NOTICE file distributed with
|
|
- this work for additional information regarding copyright ownership.
|
|
- The ASF licenses this file to You under the Apache License, Version 2.0
|
|
- (the "License"); you may not use this file except in compliance with
|
|
- the License. You may obtain a copy of the License at
|
|
-
|
|
- http://www.apache.org/licenses/LICENSE-2.0
|
|
-
|
|
- Unless required by applicable law or agreed to in writing, software
|
|
- distributed under the License is distributed on an "AS IS" BASIS,
|
|
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
- See the License for the specific language governing permissions and
|
|
- limitations under the License.
|
|
-->
|
|
|
|
# Introduction
|
|
This module includes TPC-DS data generator and benchmark tool.
|
|
|
|
# How to use
|
|
|
|
package jar with following command:
|
|
`./build/mvn clean package -Ptpcds -pl dev/kyuubi-tpcds -am`
|
|
|
|
## Data Generator
|
|
|
|
Support options:
|
|
|
|
| key | default | description |
|
|
|--------------|-----------------|-----------------------------------|
|
|
| db | default | the database to write data |
|
|
| scaleFactor | 1 | the scale factor of TPC-DS |
|
|
| format | parquet | the format of table to store data |
|
|
| parallel | scaleFactor * 2 | the parallelism of Spark job |
|
|
|
|
Example: the following command to generate 10GB data with new database `tpcds_sf10`.
|
|
|
|
```shell
|
|
$SPARK_HOME/bin/spark-submit \
|
|
--class org.apache.kyuubi.tpcds.DataGenerator \
|
|
kyuubi-tpcds_*.jar \
|
|
--db tpcds_sf10 --scaleFactor 10 --format parquet --parallel 20
|
|
```
|
|
|
|
## Benchmark Tool
|
|
|
|
Support options:
|
|
|
|
| key | default | description |
|
|
|------------|----------------------|--------------------------------------------------------|
|
|
| db | none(required) | the TPC-DS database |
|
|
| benchmark | tpcds-v2.4-benchmark | the name of application |
|
|
| iterations | 3 | the number of iterations to run |
|
|
| filter | a | filter on the name of the queries to run, e.g. q1-v2.4 |
|
|
|
|
Example: the following command to benchmark TPC-DS sf10 with exists database `tpcds_sf10`.
|
|
|
|
```shell
|
|
$SPARK_HOME/bin/spark-submit \
|
|
--class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
|
|
kyuubi-tpcds_*.jar --db tpcds_sf10
|
|
```
|
|
|
|
We also support run one of the TPC-DS query:
|
|
```shell
|
|
$SPARK_HOME/bin/spark-submit \
|
|
--class org.apache.kyuubi.tpcds.benchmark.RunBenchmark \
|
|
kyuubi-tpcds_*.jar --db tpcds_sf10 --filter q1-v2.4
|
|
```
|
|
|
|
The result of TPC-DS benchmark like:
|
|
|
|
| name | minTimeMs | maxTimeMs | avgTimeMs | stdDev | stdDevPercent |
|
|
|---------|-----------|-------------|------------|----------|----------------|
|
|
| q1-v2.4 | 50.522384 | 868.010383 | 323.398267 | 471.6482 | 145.8413108576 |
|