[KYUUBI #6512] Improve docs for KSHC

# 🔍 Description

Canonicalize the words, and enrich the description for KSCH.

## Types of changes 🔖

- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Test Plan 🧪

Review.

---

# Checklist 📝

- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)

**Be nice. Be informative.**

Closes #6512 from pan3793/kshc-doc.

Closes #6512

201c11341 [Cheng Pan] nit
1becc1ebb [Cheng Pan] nit
8d48c7c93 [Cheng Pan] fix
aea1e0386 [Cheng Pan] fix
5ba5094ab [Cheng Pan] fix
0c40de43d [Cheng Pan] fix
63dd21d11 [Cheng Pan] nit
1be266163 [Cheng Pan] Improve docs for KSHC

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
This commit is contained in:
Cheng Pan 2024-07-01 10:51:12 +08:00
parent d2dad1432c
commit f66216b43c
No known key found for this signature in database
GPG Key ID: 8001952629BCC75D
15 changed files with 106 additions and 81 deletions

View File

@ -42,9 +42,9 @@ Dependencies
The **classpath** of kyuubi flink sql engine with Hudi supported consists of
1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of flink distribution
3. hudi-flink<flink.version>-bundle_<scala.version>-<hudi.version>.jar (example: hudi-flink1.14-bundle_2.12-0.11.1.jar), which can be found in the `Maven Central`_
3. hudi-flink<flink.version>-bundle-<hudi.version>.jar (example: hudi-flink1.18-bundle-0.15.0.jar), which can be found in the `Maven Central`_
In order to make the Hudi packages visible for the runtime classpath of engines, we can use one of these methods:

View File

@ -43,9 +43,9 @@ Dependencies
The **classpath** of kyuubi flink sql engine with Iceberg supported consists of
1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of flink distribution
3. iceberg-flink-runtime-<flink.version>-<iceberg.version>.jar (example: iceberg-flink-runtime-1.14-0.14.0.jar), which can be found in the `Maven Central`_
3. iceberg-flink-runtime-<flink.version>-<iceberg.version>.jar (example: iceberg-flink-runtime-1.18-1.5.2.jar), which can be found in the `Maven Central`_
In order to make the Iceberg packages visible for the runtime classpath of engines, we can use one of these methods:

View File

@ -40,9 +40,9 @@ Dependencies
The **classpath** of kyuubi flink sql engine with Apache Paimon (Incubating) supported consists of
1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of flink distribution
3. paimon-flink-<version>.jar (example: paimon-flink-1.16-0.4-SNAPSHOT.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Flink`_
3. paimon-flink-<version>.jar (example: paimon-flink-1.18-0.8.1.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Flink`_
4. flink-shaded-hadoop-2-uber-<version>.jar, which code can be found in the `Pre-bundled Hadoop Jar`_
In order to make the Apache Paimon (Incubating) packages visible for the runtime classpath of engines, you need to:

View File

@ -44,7 +44,7 @@ Dependencies
The **classpath** of kyuubi hive sql engine with Iceberg supported consists of
1. kyuubi-hive-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
1. kyuubi-hive-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of hive distribution
3. iceberg-hive-runtime-<hive.version>_<scala.version>-<iceberg.version>.jar (example: iceberg-hive-runtime-3.2_2.12-0.14.0.jar), which can be found in the `Maven Central`_

View File

@ -42,7 +42,7 @@ Dependencies
The **classpath** of kyuubi hive sql engine with Iceberg supported consists of
1. kyuubi-hive-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
1. kyuubi-hive-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of hive distribution
3. paimon-hive-connector-<hive.binary.version>-<paimon.version>.jar (example: paimon-hive-connector-3.1-0.4-SNAPSHOT.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Hive`_

View File

@ -16,38 +16,38 @@
`Delta Lake`_
=============
Delta lake is an open-source project that enables building a Lakehouse
Delta Lake is an open-source project that enables building a Lakehouse
Architecture on top of existing storage systems such as S3, ADLS, GCS,
and HDFS.
.. tip::
This article assumes that you have mastered the basic knowledge and
operation of `Delta Lake`_.
For the knowledge about delta lake not mentioned in this article,
For the knowledge about Delta Lake not mentioned in this article,
you can obtain it from its `Official Documentation`_.
By using kyuubi, we can run SQL queries towards delta lake which is more
By using kyuubi, we can run SQL queries towards Delta Lake which is more
convenient, easy to understand, and easy to expand than directly using
spark to manipulate delta lake.
spark to manipulate Delta Lake.
Delta Lake Integration
----------------------
To enable the integration of kyuubi spark sql engine and delta lake through
Apache Spark Datasource V2 and Catalog APIs, you need to:
To enable the integration of Kyuubi Spark SQL engine and Delta Lake through
Spark DataSource V2 API, you need to:
- Referencing the delta lake :ref:`dependencies<spark-delta-lake-deps>`
- Setting the spark extension and catalog :ref:`configurations<spark-delta-lake-conf>`
- Referencing the Delta Lake :ref:`dependencies<spark-delta-lake-deps>`
- Setting the Spark extension and catalog :ref:`configurations<spark-delta-lake-conf>`
.. _spark-delta-lake-deps:
Dependencies
************
The **classpath** of kyuubi spark sql engine with delta lake supported consists of
The **classpath** of Kyuubi Spark SQL engine with Delta Lake supported consists of
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with kyuubi distributions
2. a copy of spark distribution
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of Spark distribution
3. delta-core & delta-storage, which can be found in the `Maven Central`_
In order to make the delta packages visible for the runtime classpath of engines, we can use one of these methods:
@ -63,7 +63,7 @@ In order to make the delta packages visible for the runtime classpath of engines
Configurations
**************
To activate functionality of delta lake, we can set the following configurations:
To activate functionality of Delta Lake, we can set the following configurations:
.. code-block:: properties

View File

@ -16,53 +16,52 @@
`Hive`_
==========
The Kyuubi Hive Connector is a datasource for both reading and writing Hive table,
It is implemented based on Spark DataSource V2, and supports concatenating multiple Hive metastore at the same time.
You may know that the Apache Spark has built-in support for accessing Hive tables, it works well in most cases,
but is limited to one Hive Metastore. The Kyuubi Spark Hive connector(KSHC) implemented a Hive connector based
on Spark DataSource V2 API, supports accessing multiple Hive Metastore in a single Spark application.
This connector can be used to federate queries of multiple hives warehouse in a single Spark cluster.
Hive Integration
----------------
Hive Connector Integration
-------------------
To enable the integration of kyuubi spark sql engine and Hive connector through
Apache Spark Datasource V2 and Catalog APIs, you need to:
To enable the integration of Kyuubi Spark SQL engine and Hive connector through
Spark DataSource V2 API, you need to:
- Referencing the Hive connector :ref:`dependencies<kyuubi-hive-deps>`
- Setting the spark extension and catalog :ref:`configurations<kyuubi-hive-conf>`
- Setting the Spark catalog :ref:`configurations<kyuubi-hive-conf>`
.. _kyuubi-hive-deps:
Dependencies
************
The **classpath** of kyuubi spark sql engine with Hive connector supported consists of
The **classpath** of Kyuubi Spark SQL engine with Hive connector supported consists of
1. kyuubi-spark-connector-hive_2.12-\ |release|\ , the hive connector jar deployed with Kyuubi distributions
2. a copy of spark distribution
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of Spark distribution
3. kyuubi-spark-connector-hive_2.12-\ |release|\ , which can be found in the `Maven Central`_
In order to make the Hive connector packages visible for the runtime classpath of engines, we can use one of these methods:
1. Put the Kyuubi Hive connector packages into ``$SPARK_HOME/jars`` directly
2. Set ``spark.jars=/path/to/kyuubi-hive-connector``
.. note::
Starting from v1.9.2 and v1.10.0, KSHC jars available in the `Maven Central`_ guarantee binary compatibility across
Spark versions, namely, Spark 3.3 onwards.
.. _kyuubi-hive-conf:
Configurations
**************
To activate functionality of Kyuubi Hive connector, we can set the following configurations:
To activate functionality of Kyuubi Spark Hive connector, we can set the following configurations:
.. code-block:: properties
spark.sql.catalog.hive_catalog org.apache.kyuubi.spark.connector.hive.HiveTableCatalog
spark.sql.catalog.hive_catalog.spark.sql.hive.metastore.version hive-metastore-version
spark.sql.catalog.hive_catalog.hive.metastore.uris thrift://metastore-host:port
spark.sql.catalog.hive_catalog.hive.metastore.port port
spark.sql.catalog.hive_catalog.spark.sql.hive.metastore.jars path
spark.sql.catalog.hive_catalog.spark.sql.hive.metastore.jars.path file:///opt/hive1/lib/*.jar
.. tip::
For details about the multi-version Hive configuration, see the related multi-version Hive configurations supported by Apache Spark.
spark.sql.catalog.hive_catalog org.apache.kyuubi.spark.connector.hive.HiveTableCatalog
spark.sql.catalog.hive_catalog.hive.metastore.uris thrift://metastore-host:port
spark.sql.catalog.hive_catalog.<other.hive.conf> <value>
spark.sql.catalog.hive_catalog.<other.hadoop.conf> <value>
Hive Connector Operations
------------------
@ -106,4 +105,29 @@ Taking ``DROP NAMESPACE`` as a example,
DROP NAMESPACE hive_catalog.ns;
.. _Apache Spark: https://spark.apache.org/
Advanced Usages
***************
Though KSHC is a pure Spark DataSource V2 connector which isn't coupled with Kyuubi deployment, due to the
implementation inside ``spark-sql``, you should not expect KSHC works properly with ``spark-sql``, and
any issues caused by such a combination usage won't be considered at this time. Instead, it's recommended
using BeeLine with Kyuubi as a drop-in replacement for ``spark-sql``, or switching to ``spark-shell``.
KSHC supports accessing Kerberized Hive Metastore and HDFS, by using keytab, or TGT cache, or Delegation Token.
It's not expected to work properly with multiple KDC instances, the limitation comes from JDK Krb5LoginModule,
for such cases, consider setting up Cross-Realm Kerberos trusts, then you just need to talk with one KDC.
For HMS Thrift API used by Spark, it's known that Hive 2.3.9 client is compatible with HMS from 2.1 to 4.0, and
Hive 2.3.10 client is compatible with HMS from 1.1 to 4.0, such version combinations should cover the most cases.
For other corner cases, KSHC also supports ``spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.jars`` and
``spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.version`` as well as the Spark built-in Hive datasource
does, you can refer to the Spark documentation for details.
Currently, KSHC has not implemented the Parquet/ORC Hive tables read/write optimization, in other words, it always
uses Hive SerDe to access Hive tables, so there might be a performance gap compared to the Spark built-in Hive
datasource, especially due to lack of support for vectorized reading. And you may hit bugs caused by Hive SerDe,
e.g. ``ParquetHiveSerDe`` can not read Parquet files that decimals are written in int-based format produced by
Spark Parquet datasource writer with ``spark.sql.parquet.writeLegacyFormat=false``.
.. _Apache Spark: https://spark.apache.org/
.. _Maven Central: https://mvnrepository.com/artifact/org.apache.kyuubi/kyuubi-spark-connector-hive

View File

@ -30,8 +30,8 @@ and easy to expand than directly using Spark to manipulate Hudi.
Hudi Integration
----------------
To enable the integration of kyuubi spark sql engine and Hudi through
Catalog APIs, you need to:
To enable the integration of Kyuubi Spark SQL engine and Hudi through
Spark DataSource V2 API, you need to:
- Referencing the Hudi :ref:`dependencies<spark-hudi-deps>`
- Setting the Spark extension and catalog :ref:`configurations<spark-hudi-conf>`
@ -41,10 +41,10 @@ Catalog APIs, you need to:
Dependencies
************
The **classpath** of kyuubi spark sql engine with Hudi supported consists of
The **classpath** of Kyuubi Spark SQL engine with Hudi supported consists of
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
2. a copy of spark distribution
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of Spark distribution
3. hudi-spark<spark.version>-bundle_<scala.version>-<hudi.version>.jar (example: hudi-spark3.2-bundle_2.12-0.11.1.jar), which can be found in the `Maven Central`_
In order to make the Hudi packages visible for the runtime classpath of engines, we can use one of these methods:

View File

@ -32,21 +32,21 @@ spark to manipulate Iceberg.
Iceberg Integration
-------------------
To enable the integration of kyuubi spark sql engine and Iceberg through
Apache Spark Datasource V2 and Catalog APIs, you need to:
To enable the integration of Kyuubi Spark SQL engine and Iceberg through
Spark DataSource V2 API, you need to:
- Referencing the Iceberg :ref:`dependencies<spark-iceberg-deps>`
- Setting the spark extension and catalog :ref:`configurations<spark-iceberg-conf>`
- Setting the Spark extension and catalog :ref:`configurations<spark-iceberg-conf>`
.. _spark-iceberg-deps:
Dependencies
************
The **classpath** of kyuubi spark sql engine with Iceberg supported consists of
The **classpath** of Kyuubi Spark SQL engine with Iceberg supported consists of
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
2. a copy of spark distribution
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of Spark distribution
3. iceberg-spark-runtime-<spark.version>_<scala.version>-<iceberg.version>.jar (example: iceberg-spark-runtime-3.2_2.12-0.14.0.jar), which can be found in the `Maven Central`_
In order to make the Iceberg packages visible for the runtime classpath of engines, we can use one of these methods:

View File

@ -30,21 +30,22 @@ spark to manipulate Apache Paimon (Incubating).
Apache Paimon (Incubating) Integration
-------------------
To enable the integration of kyuubi spark sql engine and Apache Paimon (Incubating), you need to set the following configurations:
To enable the integration of Kyuubi Spark SQL engine and Apache Paimon (Incubating) through
Spark DataSource V2 API, you need to:
- Referencing the Apache Paimon (Incubating) :ref:`dependencies<spark-paimon-deps>`
- Setting the spark extension and catalog :ref:`configurations<spark-paimon-conf>`
- Setting the Spark extension and catalog :ref:`configurations<spark-paimon-conf>`
.. _spark-paimon-deps:
Dependencies
************
The **classpath** of kyuubi spark sql engine with Apache Paimon (Incubating) consists of
The **classpath** of Kyuubi Spark SQL engine with Apache Paimon (Incubating) consists of
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
2. a copy of spark distribution
3. paimon-spark-<version>.jar (example: paimon-spark-3.3-0.4-20230323.002035-5.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Spark3`_
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of Spark distribution
3. paimon-spark-<version>.jar (example: paimon-spark-3.5-0.8.1.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Spark3`_
In order to make the Apache Paimon (Incubating) packages visible for the runtime classpath of engines, we can use one of these methods:

View File

@ -35,20 +35,20 @@ spark to manipulate TiDB/TiKV.
TiDB Integration
-------------------
To enable the integration of kyuubi spark sql engine and TiDB through
Apache Spark Datasource V2 and Catalog APIs, you need to:
To enable the integration of Kyuubi Spark SQL engine and TiDB through
Spark DataSource V2 API, you need to:
- Referencing the TiSpark :ref:`dependencies<spark-tidb-deps>`
- Setting the spark extension and catalog :ref:`configurations<spark-tidb-conf>`
- Setting the Spark extension and catalog :ref:`configurations<spark-tidb-conf>`
.. _spark-tidb-deps:
Dependencies
************
The classpath of kyuubi spark sql engine with TiDB supported consists of
The classpath of Kyuubi Spark SQL engine with TiDB supported consists of
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
2. a copy of spark distribution
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of Spark distribution
3. tispark-assembly-<spark.version>_<scala.version>-<tispark.version>.jar (example: tispark-assembly-3.2_2.12-3.0.1.jar), which can be found in the `Maven Central`_
In order to make the TiSpark packages visible for the runtime classpath of engines, we can use one of these methods:

View File

@ -32,21 +32,21 @@ Goto `Try Kyuubi`_ to explore TPC-DS data instantly!
TPC-DS Integration
------------------
To enable the integration of kyuubi spark sql engine and TPC-DS through
Apache Spark Datasource V2 and Catalog APIs, you need to:
To enable the integration of Kyuubi Spark SQL engine and TPC-DS through
Spark DataSource V2 API, you need to:
- Referencing the TPC-DS connector :ref:`dependencies<spark-tpcds-deps>`
- Setting the spark catalog :ref:`configurations<spark-tpcds-conf>`
- Setting the Spark catalog :ref:`configurations<spark-tpcds-conf>`
.. _spark-tpcds-deps:
Dependencies
************
The **classpath** of kyuubi spark sql engine with TPC-DS supported consists of
The **classpath** of Kyuubi Spark SQL engine with TPC-DS supported consists of
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
2. a copy of spark distribution
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of Spark distribution
3. kyuubi-spark-connector-tpcds-\ |release|\ _2.12.jar, which can be found in the `Maven Central`_
In order to make the TPC-DS connector package visible for the runtime classpath of engines, we can use one of these methods:

View File

@ -32,21 +32,21 @@ Goto `Try Kyuubi`_ to explore TPC-H data instantly!
TPC-H Integration
------------------
To enable the integration of kyuubi spark sql engine and TPC-H through
Apache Spark Datasource V2 and Catalog APIs, you need to:
To enable the integration of Kyuubi Spark SQL engine and TPC-H through
Spark DataSource V2 API, you need to:
- Referencing the TPC-H connector :ref:`dependencies<spark-tpch-deps>`
- Setting the spark catalog :ref:`configurations<spark-tpch-conf>`
- Setting the Spark catalog :ref:`configurations<spark-tpch-conf>`
.. _spark-tpch-deps:
Dependencies
************
The **classpath** of kyuubi spark sql engine with TPC-H supported consists of
The **classpath** of Kyuubi Spark SQL engine with TPC-H supported consists of
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
2. a copy of spark distribution
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of Spark distribution
3. kyuubi-spark-connector-tpch-\ |release|\ _2.12.jar, which can be found in the `Maven Central`_
In order to make the TPC-H connector package visible for the runtime classpath of engines, we can use one of these methods:

View File

@ -42,7 +42,7 @@ Dependencies
The **classpath** of kyuubi trino sql engine with Apache Paimon (Incubating) supported consists of
1. kyuubi-trino-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions
1. kyuubi-trino-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of trino distribution
3. paimon-trino-<version>.jar (example: paimon-trino-0.2.jar), which code can be found in the `Source Code`_
4. flink-shaded-hadoop-2-uber-<version>.jar, which code can be found in the `Pre-bundled Hadoop`_

View File

@ -21,12 +21,12 @@
- [Apache Ranger](https://ranger.apache.org/)
This plugin works as a ranger rest client with Apache Ranger admin server to do privilege check.
This plugin works as a ranger rest client with Apache Ranger Admin server to do privilege check.
Thus, a ranger server need to be installed ahead and available to use.
- Building(optional)
If your ranger admin or spark distribution is not compatible with the official pre-built [artifact](https://mvnrepository.com/artifact/org.apache.kyuubi/kyuubi-spark-authz) in maven central.
If your Ranger Admin or Spark distribution is not compatible with the official pre-built [artifact](https://mvnrepository.com/artifact/org.apache.kyuubi/kyuubi-spark-authz) in maven central.
You need to [build](build.md) the plugin targeting the spark/ranger you are using by yourself.
## Install