[KYUUBI #1217] [DOC] Z-order by and order by performance test
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html 2. If the PR is related to an issue in https://github.com/apache/incubator-kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'. 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'. --> ### _Why are the changes needed?_ <!-- Please clarify why the changes are needed. For instance, 1. If you add a feature, you can talk about the use case of it. 2. If you fix a bug, you can clarify why it is a bug. --> ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #1217 from hzxiongyinke/zorder-by_and_order-by_performance_test. Closes #1217 c0232c68 [xiongyinke] format z-order-benchmark.md a7d71111 [xiongyinke] update zorder benchmark data 3bf5f81b [xiongyinke] update benchmark result secondary headlines and fix z-order test result; f5c9dfb5 [hzxiongyinke] Merge pull request #3 from apache/master 6f1892be [hzxiongyinke] Merge pull request #1 from apache/master Lead-authored-by: xiongyinke <1062376716@qq.com> Co-authored-by: hzxiongyinke <75288351+hzxiongyinke@users.noreply.github.com> Signed-off-by: Cheng Pan <chengpan@apache.org>
This commit is contained in:
parent
b26ef2ba9e
commit
cb886e9a1d
@ -23,7 +23,7 @@
|
||||
|
||||
</div>
|
||||
|
||||
# z-order benchmark
|
||||
# Z-order Benchmark
|
||||
|
||||
Z-order is a technique that allows you to map multidimensional data to a single dimension. We did a performance test
|
||||
|
||||
@ -50,6 +50,8 @@ spark.sql(s"drop database if exists $dbName cascade")
|
||||
spark.sql(s"create database if not exists $dbName")
|
||||
spark.sql(s"use $dbName")
|
||||
spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"show tables").show(false)
|
||||
@ -67,15 +69,17 @@ def randomConnRecord(r: Random) = ConnRecord(
|
||||
dst_ip = randomIPv4(r), dst_port = randomPort(r))
|
||||
```
|
||||
|
||||
Step3: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
|
||||
Step3: do optimize with z-order only ip and do optimize with order by only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
|
||||
|
||||
```
|
||||
INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip;
|
||||
OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
|
||||
```
|
||||
|
||||
Step4: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers .
|
||||
Step4: do optimize with z-order and do optimize with order by , sort column: src_ip, src_port, dst_ip, dst_port and shuffle partition just as file numbers .
|
||||
|
||||
```
|
||||
INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
|
||||
OPTIMIZE conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
|
||||
```
|
||||
|
||||
@ -111,6 +115,8 @@ spark.sql(s"drop database if exists $dbName cascade")
|
||||
spark.sql(s"create database if not exists $dbName")
|
||||
spark.sql(s"use $dbName")
|
||||
spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
|
||||
spark.sql(s"show tables").show(false)
|
||||
@ -150,7 +156,7 @@ spark.stop()
|
||||
|
||||
```
|
||||
|
||||
Optimize Sql:
|
||||
Z-order Optimize statement:
|
||||
|
||||
```sql
|
||||
|
||||
@ -161,9 +167,16 @@ OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
|
||||
OPTIMIZE zorder_test.conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
|
||||
```
|
||||
|
||||
ORDER BY statement:
|
||||
|
||||
```
|
||||
INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip;
|
||||
|
||||
Query Sql :
|
||||
INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
|
||||
|
||||
```
|
||||
|
||||
Query statement:
|
||||
|
||||
```sql
|
||||
|
||||
@ -177,7 +190,31 @@ select count(*) from conn_zorder where src_ip like '157%' and dst_ip like '216.%
|
||||
```
|
||||
|
||||
|
||||
# benchmark result
|
||||
## Benchmark result
|
||||
|
||||
We have done two performance tests: one is to compare the efficiency of Z-order Optimize and Order by Sort,
|
||||
and the other is to query based on the optimized Z-order by data and Random data
|
||||
|
||||
### Efficiency of Z-order Optimize and Order-by Sort
|
||||
|
||||
**10 billion data and 1000 files and Query resource:200 core 600G memory**
|
||||
|
||||
z-order by or order by only ip
|
||||
|
||||
| Table | row count | optimize time |
|
||||
| ------------------- | -------------- | ------------------ |
|
||||
| conn_order_only_ip | 10,000,000,000 | 1591.99 s |
|
||||
| conn_zorder_only_ip | 10,000,000,000 | 8371.405 s |
|
||||
|
||||
z-order by or order by all columns
|
||||
|
||||
| Table | row count | optimize time |
|
||||
| ------------------- | -------------- | ------------------ |
|
||||
| conn_order | 10,000,000,000 | 1515.298 s |
|
||||
| conn_zorder | 10,000,000,000 | 11057.194 s |
|
||||
|
||||
### Z-order by benchmark result
|
||||
|
||||
by querying the tables before and after optimization, we find that
|
||||
|
||||
**10 billion data and 200 files and Query resource:200 core 600G memory**
|
||||
@ -190,13 +227,13 @@ by querying the tables before and after optimization, we find that
|
||||
|
||||
|
||||
|
||||
**10 billion data and 2000 files and Query resource:200 core 600G memory**
|
||||
**10 billion data and 1000 files and Query resource:200 core 600G memory**
|
||||
|
||||
| Table | Average File Size | Scan row count | Average query time | row count Skipping ratio |
|
||||
| ------------------- | ----------------- | -------------- | ------------------ | ------------------------ |
|
||||
| conn_random_parquet | 234.8 M | 10,000,000,000 | 27.031 s | 0.0% |
|
||||
| conn_zorder_only_ip | 173.9 M | 43,170,600 | 2.668 s | 99.568% |
|
||||
| conn_zorder | 174.0 M | 54,841,302 | 3.207 s | 99.451% |
|
||||
| conn_zorder_only_ip | 173.9 M | 53,499,068 | 3.120 s | 99.465% |
|
||||
| conn_zorder | 174.0 M | 35,910,500 | 3.103 s | 99.640% |
|
||||
|
||||
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user