Commit Graph

2 Commits

Author SHA1 Message Date
xiongyinke
cb886e9a1d
[KYUUBI #1217] [DOC] Z-order by and order by performance test
<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/apache/incubator-kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #1217 from hzxiongyinke/zorder-by_and_order-by_performance_test.

Closes #1217

c0232c68 [xiongyinke] format z-order-benchmark.md
a7d71111 [xiongyinke] update  zorder benchmark data
3bf5f81b [xiongyinke] update benchmark result secondary headlines and fix z-order test result;
f5c9dfb5 [hzxiongyinke] Merge pull request #3 from apache/master
6f1892be [hzxiongyinke] Merge pull request #1 from apache/master

Lead-authored-by: xiongyinke <1062376716@qq.com>
Co-authored-by: hzxiongyinke <75288351+hzxiongyinke@users.noreply.github.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2021-10-12 15:54:11 +08:00
hzxiongyinke
0ecf8fbc7e
[KYUUBI #939] z-order performance_test
### What is the purpose of the pull request

pr for KYUUBI #939:Add Z-Order extensions to optimize table with zorder.Z-order is a technique that allows you to map multidimensional data to a single dimension. We did a performance test

for this test ,we used aliyun Databricks Delta test case
https://help.aliyun.com/document_detail/168137.html?spm=a2c4g.11186623.6.563.10d758ccclYtVb

Prepare data for the three scenarios:

1. 10 billion data and 2 hundred files(parquet files): for big file(1G)
2. 10 billion data and 1 thousand files(parquet files): for medium file(200m)
3. one billion data and 10 hundred files(parquet files): for smaller file(200k)

test env:
spark-3.1.2
hadoop-2.7.2
kyubbi-1.4.0

test step:

Step1: create hive tables

```scala
spark.sql(s"drop database if exists $dbName cascade")
spark.sql(s"create database if not exists $dbName")
spark.sql(s"use $dbName")
spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
spark.sql(s"show tables").show(false)
```

Step2: prepare data for parquet table with three scenarios
we use the following code

```scala
def randomIPv4(r: Random) = Seq.fill(4)(r.nextInt(256)).mkString(".")
def randomPort(r: Random) = r.nextInt(65536)

def randomConnRecord(r: Random) = ConnRecord(
  src_ip = randomIPv4(r), src_port = randomPort(r),
  dst_ip = randomIPv4(r), dst_port = randomPort(r))
```

Step3: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
	execute  'OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;' by kyuubi.

Step4: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
	execute  'OPTIMIZE conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;' by kyuubi.

---------------------
# benchmark result
by querying the tables before and after optimization, we find that

**10 billion data and 200 files and Query resource:200 core 600G memory**

| Table               | Average File Size | Scan row count | Average query time | row count Skipping ratio |
| ------------------- | ----------------- | -------------- | ------------------ | ------------------------ |
| conn_random_parquet | 1.2 G             | 10,000,000,000 | 27.554 s           | 0.0%                     |
| conn_zorder_only_ip | 890 M             | 43,170,600     | 2.459 s            | 99.568%                  |
| conn_zorder         | 890 M             | 54,841,302     | 3.185 s            | 99.451%                  |

**10 billion data and 2000 files and Query resource:200 core 600G memory**

| Table               | Average File Size | Scan row count | Average query time | row count Skipping ratio |
| ------------------- | ----------------- | -------------- | ------------------ | ------------------------ |
| conn_random_parquet | 234.8 M           | 10,000,000,000 | 27.031 s           | 0.0%                     |
| conn_zorder_only_ip | 173.9 M           | 43,170,600     | 2.668 s            | 99.568%                  |
| conn_zorder         | 174.0 M           | 54,841,302     | 3.207 s            | 99.451%                  |

**1 billion data and 10000 files and Query resource:10 core 40G memory**

| Table               | Average File Size | Scan row count | Average query time | row count Skipping ratio |
| ------------------- | ----------------- | -------------- | ------------------ | ------------------------ |
| conn_random_parquet | 2.7 M             | 1,000,000,000  | 76.772 s           | 0.0%                     |
| conn_zorder_only_ip | 2.1 M             | 406,572        | 3.963 s            | 99.959%                  |
| conn_zorder         | 2.2 M             | 387,942        | 3.621s             | 99.961%                  |

Closes #1178 from hzxiongyinke/zorder_performance_test.

Closes #939

369a9b41 [hzxiongyinke] remove set spark.sql.extensions=org.apache.kyuubi.sql.KyuubiSparkSQLExtension;
8c8ae458 [hzxiongyinke] add index z-order-benchmark
66bd20fd [hzxiongyinke] change tables to three scenarios
cc80f4e7 [hzxiongyinke] add License
70c29daa [hzxiongyinke] z-order performance_test
6f1892be [hzxiongyinke] Merge pull request #1 from apache/master

Lead-authored-by: hzxiongyinke <1062376716@qq.com>
Co-authored-by: hzxiongyinke <75288351+hzxiongyinke@users.noreply.github.com>
Signed-off-by: ulysses-you <ulyssesyou@apache.org>
2021-09-29 17:51:37 +08:00