liangbowen 69d625a1be [KYUUBI #4200 ] [Improvement] [Docs] Introduce Markdown formatting with spotless-maven-plugin and flexmark for docs

### _Why are the changes needed?_

- to consolidate styles in markdown files from manual written or auto-generated
- apply markdown formatting rules with flexmark from [spotless-maven-plugin](https://github.com/diffplug/spotless/tree/main/plugin-maven#markdown) to *.md files in `/docs`
- use `flexmark` to format markdown generation in `TestUtils` of common module used by `AllKyuubiConfiguration` and `KyuubiDefinedFunctionSuite`, as the same way in `FlexmarkFormatterFunc ` of `spotless-maven-plugin` using with `COMMONMARK` as `FORMATTER_EMULATION_PROFILE` (https://github.com/diffplug/spotless/blob/maven/2.30.0/lib/src/flexmark/java/com/diffplug/spotless/glue/markdown/FlexmarkFormatterFunc.java)
- using `flexmark` of` 0.62.2`, as the last version requiring Java 8+ (checked from pom file and bytecode version)

```
<markdown>
    <includes>
        <include>docs/**/*.md</include>
    </includes>
    <flexmark></flexmark>
</markdown>
```

- Changes applied to markdown doc files,
  -  no style change or breakings in built docs by `make html`
  - removal all the first blank in licences and comments to conform markdown style rules
  - tables regenerated by flexmark following as in [GitHub Flavored Markdown](https://help.github.com/articles/organizing-information-with-tables/) (https://github.com/vsch/flexmark-java/wiki/Extensions#tables)

### _How was this patch tested?_
- [x] regenerate docs using `make html` successfully and check all the markdown pages available
- [x] regenerate `settings.md` and `functions.md` by `AllKyuubiConfiguration` and `KyuubiDefinedFunctionSuite`, and pass the checks by both themselves and spotless check via `dev/reformat`
- [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4200 from bowenliang123/markdown-formatting.

Closes #4200

1eeafce4 [liangbowen] revert minor changes in AllKyuubiConfiguration
4f892857 [liangbowen] use flexmark in markdown doc generation
8c978abd [liangbowen] changes on markdown files
a9190556 [liangbowen] apply markdown formatting rules with `spotless-maven-plugin` to markdown files with in `/docs`

Authored-by: liangbowen <liangbowen@gf.com.cn>
Signed-off-by: liangbowen <liangbowen@gf.com.cn>

2023-01-30 11:14:41 +08:00

4.8 KiB

Raw Blame History

Solution for Big Result Sets

Typically, when a user submits a SELECT query to Spark SQL engine, the Driver calls collect to trigger calculation and collect the entire data set of all tasks(a.k.a. partitions of an RDD), after all partitions data arrived, then the client pulls the result set from the Driver through the Kyuubi Server in small batch.

Therefore, the bottleneck is the Spark Driver for a query with a big result set. To avoid OOM, Spark has a configuration spark.driver.maxResultSize which default is 1g, you should enlarge it as well as spark.driver.memory if your query has result set in several GB. But what if the result set size is dozens GB or event hundreds GB? It would be best if you have incremental collection mode.

Incremental collection

Since v1.4.0-incubating, Kyuubi supports incremental collection mode, it is a solution for big result sets. This feature is disabled in default, you can turn on it by setting the configuration kyuubi.operation.incremental.collect to true.

The incremental collection changes the gather method from collect to toLocalIterator. toLocalIterator is a Spark action that sequentially submits Jobs to retrieve partitions. As each partition is retrieved, the client through pulls the result set from the Driver through the Kyuubi Server streamingly. It reduces the Driver memory significantly from the size of the complete result set to the maximum partition.

The incremental collection is not the silver bullet, you should turn it on carefully, because it can significantly hurt performance. And even in incremental collection mode, when multiple queries execute concurrently, each query still requires one partition of data in Driver memory. Therefore, it is still important to control the number of concurrent queries to avoid OOM.

Use in single connections

As above explains, the incremental collection mode is not suitable for common query sense, you can enable incremental collection mode for specific queries by using

beeline -u 'jdbc:hive2://kyuubi:10009/?spark.driver.maxResultSize=8g;spark.driver.memory=12g#kyuubi.engine.share.level=CONNECTION;kyuubi.operation.incremental.collect=true' \
    --incremental=true \
    -f big_result_query.sql

--incremental=true is required for beeline client, otherwise, the entire result sets is fetched and buffered before being displayed, which may cause client side OOM.

Change incremental collection mode in session

The configuration kyuubi.operation.incremental.collect can also be changed using SET in session.

~ beeline -u 'jdbc:hive2://localhost:10009'
Connected to: Apache Kyuubi (Incubating) (version 1.5.0-SNAPSHOT)

0: jdbc:hive2://localhost:10009/> set kyuubi.operation.incremental.collect=true;
+---------------------------------------+--------+
|                  key                  | value  |
+---------------------------------------+--------+
| kyuubi.operation.incremental.collect  | true   |
+---------------------------------------+--------+
1 row selected (0.039 seconds)

0: jdbc:hive2://localhost:10009/> select /*+ REPARTITION(5) */ * from range(1, 10);
+-----+
| id  |
+-----+
| 2   |
| 6   |
| 7   |
| 0   |
| 5   |
| 3   |
| 4   |
| 1   |
| 8   |
| 9   |
+-----+
10 rows selected (1.929 seconds)

0: jdbc:hive2://localhost:10009/> set kyuubi.operation.incremental.collect=false;
+---------------------------------------+--------+
|                  key                  | value  |
+---------------------------------------+--------+
| kyuubi.operation.incremental.collect  | false   |
+---------------------------------------+--------+
1 row selected (0.027 seconds)

0: jdbc:hive2://localhost:10009/> select /*+ REPARTITION(5) */ * from range(1, 10);
+-----+
| id  |
+-----+
| 2   |
| 6   |
| 7   |
| 0   |
| 5   |
| 3   |
| 4   |
| 1   |
| 8   |
| 9   |
+-----+
10 rows selected (0.128 seconds)

From the Spark UI, we can see that in incremental collection mode, the query produces 5 jobs (in red square), and in normal mode, only produces 1 job (in blue square).

4.8 KiB Raw Blame History

Solution for Big Result Sets

Incremental collection

Use in single connections

Change incremental collection mode in session

4.8 KiB

Raw Blame History