[KYUUBI #699][DOCS] Add document for kyuubi-extension-spark_3.1 module

<!--
Thanks for sending a pull request!

Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
  2. If the PR is related to an issue in https://github.com/NetEase/kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->

### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
  1. If you add a feature, you can talk about the use case of it.
  2. If you fix a bug, you can clarify why it is a bug.
-->
Make Kyuubi SQL extension readable.

### _How was this patch tested?_
The screen snapshot is:

![image](https://user-images.githubusercontent.com/12025282/122504925-e0e2d580-d02d-11eb-9ce0-087d814aad98.png)

Closes #702 from ulysses-you/docs.

Closes #699

d9d63604 [ulysses-you] nit
21d9cf75 [ulysses-you] docs
b034421d [ulysses-you] docs

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: ulysses-you <ulyssesyou18@gmail.com>
This commit is contained in:
ulysses-you 2021-06-18 18:10:01 +08:00
parent cd32e4aeb6
commit 2a05326c1b
2 changed files with 54 additions and 1 deletions

View File

@ -4,11 +4,12 @@
SQL References
==============
This part describes the use of the SQL References in Kyuubi, including lists of the available data types and functions for use in SQL commands.
This part describes the use of the SQL References in Kyuubi, including lists of the available extension, data types and functions for use in SQL commands.
.. toctree::
:maxdepth: 2
:numbered: 3
rules
functions

52
docs/sql/rules.md Normal file
View File

@ -0,0 +1,52 @@
<!-- DO NOT MODIFY THIS FILE DIRECTLY, IT IS AUTO GENERATED BY [org.apache.kyuubi.engine.spark.udf.KyuubiUDFRegistrySuite] -->
<div align=center>
![](../imgs/kyuubi_logo.png)
</div>
# Auxiliary SQL extension for Spark SQL
Kyuubi provides SQL extension out of box. Due to the version compatibility with Apache Spark, currently we only support Apache Spark branch-3.1 (i.e 3.1.1 and 3.1.2).
And don't worry, Kyuubi will support the new Apache Spark version in future. Thanks to the adaptive query execution framework (AQE), Kyuubi can do these optimization.
## What feature does Kyuubi SQL extension provide
- merging small files automatically
Small files is a long time issue with Apache Spark. Kyuubi can merge small files by adding an extra shuffle.
Currently, Kyuubi supports handle small files with datasource table and hive table, and also Kyuubi support optimize dynamic partition insertion.
For example, a common write query `INSERT INTO TABLE $table1 SELECT * FROM $table2`, Kyuubi will introduce an extra shuffle before write and then the small files will go away.
- insert shuffle node before Join to make AQE OptimizeSkewedJoin work
In current implementation, Apache Spark can only optimize skewed join by the standard join which means a join must have two sort and shuffle node.
However, in complex scenario this assuming will be broken easily. Kyuubi can guarantee the join is standard by adding an extra shuffle node before join.
So that, OptimizeSkewedJoin can work better.
- stage level config isolation in AQE
As we know, `spark.sql.adaptive.advisoryPartitionSizeInBytes` is a key config in Apache Spark AQE.
It controls how big data size per-task should handle during shuffle, so we always use a 64MB or a smaller value to make parallelism enough.
However, in general, we expect a file is big enough like 256MB or 512MB. Kyuubi can make the config isolation to solve the conflict so that
we can make staging partition data size small and last partition data size big.
## How to use Kyuubi SQL extension
1. you need to choose Apache Spark branch-3.1 or higher version with Kyuubi binary tgz.
2. if you want to compile Kyuubi by yourself, the maven opt should add `-Pkyuubi-extension-spark_3.1`
3. move the jar(kyuubi-extension-spark_*.jar) which is in $KYUUBI_HOME/extension into $SPARK_HOME/jars
4. add a config into `spark-defaults.conf`, `spark.sql.extensions=org.apache.kyuubi.sql.KyuubiSparkSQLExtension`
Now, you can enjoy the Kyuubi SQL Extension, and also Kyuubi provides some configs to make these feature easy to use.
Name | Default Value | Description | Since
--- | --- | --- | ---
spark.sql.optimizer.insertRepartitionBeforeWrite.enabled | true | Add repartition node at the top of query plan. An approach of merging small files. | 1.2.0
spark.sql.optimizer.insertRepartitionNum | none | The partition number if `spark.sql.optimizer.insertRepartitionBeforeWrite.enabled` is enabled. If AQE is disabled, the default value is `spark.sql.shuffle.partitions`. If AQE is enabled, the default value is none that means depend on AQE. | 1.2.0
spark.sql.optimizer.dynamicPartitionInsertionRepartitionNum | 100 | The partition number of each dynamic partition if `spark.sql.optimizer.insertRepartitionBeforeWrite.enabled` is enabled. We will repartition by dynamic partition columns to reduce the small file but that can cause data skew. This config is to extend the partition of dynamic partition column to avoid skew but may generate some small files. | 1.2.0
spark.sql.optimizer.forceShuffleBeforeJoin.enabled | false | Ensure shuffle node exists before shuffled join (shj and smj) to make AQE `OptimizeSkewedJoin` works (complex scenario join, multi table join). | 1.2.0
spark.sql.optimizer.finalStageConfigIsolation.enabled | false | If true, the final stage support use different config with previous stage. The prefix of final stage config key should be `spark.sql.finalStage.`. For example, the raw spark config: `spark.sql.adaptive.advisoryPartitionSizeInBytes`, then the final stage config should be: `spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes`. | 1.2.0