diff --git a/docs/sql/z-order-benchmark.md b/docs/sql/z-order-benchmark.md index 968b51811..d60cb7c39 100644 --- a/docs/sql/z-order-benchmark.md +++ b/docs/sql/z-order-benchmark.md @@ -23,7 +23,7 @@ -# z-order benchmark +# Z-order Benchmark Z-order is a technique that allows you to map multidimensional data to a single dimension. We did a performance test @@ -50,6 +50,8 @@ spark.sql(s"drop database if exists $dbName cascade") spark.sql(s"create database if not exists $dbName") spark.sql(s"use $dbName") spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") +spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") +spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") spark.sql(s"show tables").show(false) @@ -67,15 +69,17 @@ def randomConnRecord(r: Random) = ConnRecord( dst_ip = randomIPv4(r), dst_port = randomPort(r)) ``` -Step3: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers . +Step3: do optimize with z-order only ip and do optimize with order by only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers . ``` +INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip; OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip; ``` -Step4: do optimize with z-order only ip, sort column: src_ip, dst_ip and shuffle partition just as file numbers . +Step4: do optimize with z-order and do optimize with order by , sort column: src_ip, src_port, dst_ip, dst_port and shuffle partition just as file numbers . ``` +INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port; OPTIMIZE conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port; ``` @@ -111,6 +115,8 @@ spark.sql(s"drop database if exists $dbName cascade") spark.sql(s"create database if not exists $dbName") spark.sql(s"use $dbName") spark.sql(s"create table $connRandomParquet (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") +spark.sql(s"create table $connOrderbyOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") +spark.sql(s"create table $connOrderby (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") spark.sql(s"create table $connZorderOnlyIp (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") spark.sql(s"create table $connZorder (src_ip string, src_port int, dst_ip string, dst_port int) stored as parquet") spark.sql(s"show tables").show(false) @@ -150,7 +156,7 @@ spark.stop() ``` -Optimize Sql: +Z-order Optimize statement: ```sql @@ -161,9 +167,16 @@ OPTIMIZE conn_zorder_only_ip ZORDER BY src_ip, dst_ip; OPTIMIZE zorder_test.conn_zorder ZORDER BY src_ip, src_port, dst_ip, dst_port; ``` +ORDER BY statement: +``` +INSERT overwrite table conn_order_only_ip select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, dst_ip; -Query Sql : +INSERT overwrite table conn_order select src_ip, src_port, dst_ip, dst_port from conn_random_parquet order by src_ip, src_port, dst_ip, dst_port; + +``` + +Query statement: ```sql @@ -177,7 +190,31 @@ select count(*) from conn_zorder where src_ip like '157%' and dst_ip like '216.% ``` -# benchmark result +## Benchmark result + +We have done two performance tests: one is to compare the efficiency of Z-order Optimize and Order by Sort, +and the other is to query based on the optimized Z-order by data and Random data + +### Efficiency of Z-order Optimize and Order-by Sort + +**10 billion data and 1000 files and Query resource:200 core 600G memory** + +z-order by or order by only ip + +| Table | row count | optimize time | +| ------------------- | -------------- | ------------------ | +| conn_order_only_ip | 10,000,000,000 | 1591.99 s | +| conn_zorder_only_ip | 10,000,000,000 | 8371.405 s | + +z-order by or order by all columns + +| Table | row count | optimize time | +| ------------------- | -------------- | ------------------ | +| conn_order | 10,000,000,000 | 1515.298 s | +| conn_zorder | 10,000,000,000 | 11057.194 s | + +### Z-order by benchmark result + by querying the tables before and after optimization, we find that **10 billion data and 200 files and Query resource:200 core 600G memory** @@ -190,13 +227,13 @@ by querying the tables before and after optimization, we find that -**10 billion data and 2000 files and Query resource:200 core 600G memory** +**10 billion data and 1000 files and Query resource:200 core 600G memory** | Table | Average File Size | Scan row count | Average query time | row count Skipping ratio | | ------------------- | ----------------- | -------------- | ------------------ | ------------------------ | | conn_random_parquet | 234.8 M | 10,000,000,000 | 27.031 s | 0.0% | -| conn_zorder_only_ip | 173.9 M | 43,170,600 | 2.668 s | 99.568% | -| conn_zorder | 174.0 M | 54,841,302 | 3.207 s | 99.451% | +| conn_zorder_only_ip | 173.9 M | 53,499,068 | 3.120 s | 99.465% | +| conn_zorder | 174.0 M | 35,910,500 | 3.103 s | 99.640% |