[KYUUBI #1217] [DOC] Z-order by and order by performance test

### _Why are the changes needed?_  ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #1217 from hzxiongyinke/zorder-by_and_order-by_performance_test. Closes #1217 c0232c68 [xiongyinke] format z-order-benchmark.md a7d71111 [xiongyinke] update zorder benchmark data 3bf5f81b [xiongyinke] update benchmark result secondary headlines and fix z-order test result; f5c9dfb5 [hzxiongyinke] Merge pull request #3 from apache/master 6f1892be [hzxiongyinke] Merge pull request #1 from apache/master Lead-authored-by: xiongyinke <1062376716@qq.com> Co-authored-by: hzxiongyinke <75288351+hzxiongyinke@users.noreply.github.com> Signed-off-by: Cheng Pan <chengpan@apache.org>
2021-10-12 15:54:11 +08:00 · 2021-10-12 15:54:11 +08:00 · cb886e9a1d
commit cb886e9a1d
parent b26ef2ba9e
1 changed files with 46 additions and 9 deletions
--- a/docs/sql/z-order-benchmark.md
+++ b/docs/sql/z-order-benchmark.md
@ -23,7 +23,7 @@

 </div>

-# z-order benchmark
+# Z-order Benchmark

 Z-order is a technique that allows you to map multidimensional data to a single dimension. We did a performance test

@ -50,6 +50,8 @@ spark.sql(s"drop database if exists $dbName cascade")
 spark.sql(s"create database if not exists $dbName")
 spark.sql(s"use $dbName")
 spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"show tables").show(false)
@ -67,15 +69,17 @@ def randomConnRecord(r: Random) = ConnRecord(
  dst_ip = randomIPv4(r), dst_port = randomPort(r))
 ```

-Step3： do optimize with z-order only ip， sort column： src_ip, dst_ip and shuffle partition just as file numbers .
+Step3： do optimize with z-order only ip and do optimize with order by only ip， sort column： src_ip, dst_ip and shuffle partition just as file numbers .

 ```
+INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip;
 OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
 ```

-Step4： do optimize with z-order only ip， sort column： src_ip, dst_ip and shuffle partition just as file numbers .
+Step4： do optimize with z-order and do optimize with order by ， sort column： src_ip, src_port, dst_ip, dst_port and shuffle partition just as file numbers .

 ```
+INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
 OPTIMIZE conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
 ```

@ -111,6 +115,8 @@ spark.sql(s"drop database if exists $dbName cascade")
 spark.sql(s"create database if not exists $dbName")
 spark.sql(s"use $dbName")
 spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
+spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet")
 spark.sql(s"show tables").show(false)
@ -150,7 +156,7 @@ spark.stop()

 ```

-Optimize Sql:
+Z-order Optimize statement:

 ```sql

@ -161,9 +167,16 @@ OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip;
 OPTIMIZE zorder_test.conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port;
 ```

+ORDER BY statement:

+```
+INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip;

-Query Sql : 
+INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port;
+
+```
+
+Query statement:

 ```sql

@ -177,7 +190,31 @@ select count(*) from conn_zorder where src_ip like '157%' and dst_ip like '216.%
 ```


-# benchmark result
+## Benchmark result
+
+We have done two performance tests: one is to compare the efficiency of  Z-order Optimize and Order by Sort, 
+and the other is to query based on the optimized Z-order by data and Random data
+
+### Efficiency of Z-order Optimize and Order-by Sort
+
+**10 billion data and 1000 files and Query resource:200 core 600G memory**
+
+z-order by or order by only ip
+
+| Table               | row count      | optimize  time     |
+| ------------------- | -------------- | ------------------ |
+| conn_order_only_ip  | 10,000,000,000 | 1591.99 s          |
+| conn_zorder_only_ip | 10,000,000,000 | 8371.405 s         |
+
+z-order by or order by all columns
+
+| Table               | row count      | optimize  time     |
+| ------------------- | -------------- | ------------------ |
+| conn_order          | 10,000,000,000 | 1515.298 s         |
+| conn_zorder         | 10,000,000,000 | 11057.194 s        |
+
+### Z-order by benchmark result
+
 by querying the tables before and after optimization, we find that

 **10 billion data and 200 files and Query resource:200 core 600G memory**
@ -190,13 +227,13 @@ by querying the tables before and after optimization, we find that



-**10 billion data and 2000 files and Query resource:200 core 600G memory**
+**10 billion data and 1000 files and Query resource:200 core 600G memory**

 | Table               | Average File Size | Scan row count | Average query time | row count Skipping ratio |
 | ------------------- | ----------------- | -------------- | ------------------ | ------------------------ |
 | conn_random_parquet | 234.8 M           | 10,000,000,000 | 27.031 s           | 0.0%                     |
-| conn_zorder_only_ip | 173.9 M           | 43,170,600     | 2.668 s            | 99.568%                  |
-| conn_zorder         | 174.0 M           | 54,841,302     | 3.207 s            | 99.451%                  |
+| conn_zorder_only_ip | 173.9 M           | 53,499,068     | 3.120 s            | 99.465%                  |
+| conn_zorder         | 174.0 M           | 35,910,500     | 3.103 s            | 99.640%                  |