diff --git a/docs/connector/flink/hudi.rst b/docs/connector/flink/hudi.rst index 0000bde5b..39abee234 100644 --- a/docs/connector/flink/hudi.rst +++ b/docs/connector/flink/hudi.rst @@ -42,9 +42,9 @@ Dependencies The **classpath** of kyuubi flink sql engine with Hudi supported consists of -1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions +1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution 2. a copy of flink distribution -3. hudi-flink-bundle_-.jar (example: hudi-flink1.14-bundle_2.12-0.11.1.jar), which can be found in the `Maven Central`_ +3. hudi-flink-bundle-.jar (example: hudi-flink1.18-bundle-0.15.0.jar), which can be found in the `Maven Central`_ In order to make the Hudi packages visible for the runtime classpath of engines, we can use one of these methods: diff --git a/docs/connector/flink/iceberg.rst b/docs/connector/flink/iceberg.rst index ab4a701f4..9efbe77d8 100644 --- a/docs/connector/flink/iceberg.rst +++ b/docs/connector/flink/iceberg.rst @@ -43,9 +43,9 @@ Dependencies The **classpath** of kyuubi flink sql engine with Iceberg supported consists of -1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions +1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution 2. a copy of flink distribution -3. iceberg-flink-runtime--.jar (example: iceberg-flink-runtime-1.14-0.14.0.jar), which can be found in the `Maven Central`_ +3. iceberg-flink-runtime--.jar (example: iceberg-flink-runtime-1.18-1.5.2.jar), which can be found in the `Maven Central`_ In order to make the Iceberg packages visible for the runtime classpath of engines, we can use one of these methods: diff --git a/docs/connector/flink/paimon.rst b/docs/connector/flink/paimon.rst index b67101488..15404876a 100644 --- a/docs/connector/flink/paimon.rst +++ b/docs/connector/flink/paimon.rst @@ -40,9 +40,9 @@ Dependencies The **classpath** of kyuubi flink sql engine with Apache Paimon (Incubating) supported consists of -1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions +1. kyuubi-flink-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution 2. a copy of flink distribution -3. paimon-flink-.jar (example: paimon-flink-1.16-0.4-SNAPSHOT.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Flink`_ +3. paimon-flink-.jar (example: paimon-flink-1.18-0.8.1.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Flink`_ 4. flink-shaded-hadoop-2-uber-.jar, which code can be found in the `Pre-bundled Hadoop Jar`_ In order to make the Apache Paimon (Incubating) packages visible for the runtime classpath of engines, you need to: diff --git a/docs/connector/hive/iceberg.rst b/docs/connector/hive/iceberg.rst index baefe92dc..64ee88875 100644 --- a/docs/connector/hive/iceberg.rst +++ b/docs/connector/hive/iceberg.rst @@ -44,7 +44,7 @@ Dependencies The **classpath** of kyuubi hive sql engine with Iceberg supported consists of -1. kyuubi-hive-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions +1. kyuubi-hive-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution 2. a copy of hive distribution 3. iceberg-hive-runtime-_-.jar (example: iceberg-hive-runtime-3.2_2.12-0.14.0.jar), which can be found in the `Maven Central`_ diff --git a/docs/connector/hive/paimon.rst b/docs/connector/hive/paimon.rst index 000d2d7e8..9058d83f3 100644 --- a/docs/connector/hive/paimon.rst +++ b/docs/connector/hive/paimon.rst @@ -42,7 +42,7 @@ Dependencies The **classpath** of kyuubi hive sql engine with Iceberg supported consists of -1. kyuubi-hive-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions +1. kyuubi-hive-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution 2. a copy of hive distribution 3. paimon-hive-connector--.jar (example: paimon-hive-connector-3.1-0.4-SNAPSHOT.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Hive`_ diff --git a/docs/connector/spark/delta_lake.rst b/docs/connector/spark/delta_lake.rst index 164036ce1..b30635801 100644 --- a/docs/connector/spark/delta_lake.rst +++ b/docs/connector/spark/delta_lake.rst @@ -16,38 +16,38 @@ `Delta Lake`_ ============= -Delta lake is an open-source project that enables building a Lakehouse +Delta Lake is an open-source project that enables building a Lakehouse Architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS. .. tip:: This article assumes that you have mastered the basic knowledge and operation of `Delta Lake`_. - For the knowledge about delta lake not mentioned in this article, + For the knowledge about Delta Lake not mentioned in this article, you can obtain it from its `Official Documentation`_. -By using kyuubi, we can run SQL queries towards delta lake which is more +By using kyuubi, we can run SQL queries towards Delta Lake which is more convenient, easy to understand, and easy to expand than directly using -spark to manipulate delta lake. +spark to manipulate Delta Lake. Delta Lake Integration ---------------------- -To enable the integration of kyuubi spark sql engine and delta lake through -Apache Spark Datasource V2 and Catalog APIs, you need to: +To enable the integration of Kyuubi Spark SQL engine and Delta Lake through +Spark DataSource V2 API, you need to: -- Referencing the delta lake :ref:`dependencies` -- Setting the spark extension and catalog :ref:`configurations` +- Referencing the Delta Lake :ref:`dependencies` +- Setting the Spark extension and catalog :ref:`configurations` .. _spark-delta-lake-deps: Dependencies ************ -The **classpath** of kyuubi spark sql engine with delta lake supported consists of +The **classpath** of Kyuubi Spark SQL engine with Delta Lake supported consists of -1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with kyuubi distributions -2. a copy of spark distribution +1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution +2. a copy of Spark distribution 3. delta-core & delta-storage, which can be found in the `Maven Central`_ In order to make the delta packages visible for the runtime classpath of engines, we can use one of these methods: @@ -63,7 +63,7 @@ In order to make the delta packages visible for the runtime classpath of engines Configurations ************** -To activate functionality of delta lake, we can set the following configurations: +To activate functionality of Delta Lake, we can set the following configurations: .. code-block:: properties diff --git a/docs/connector/spark/hive.rst b/docs/connector/spark/hive.rst index cd682aa9d..077b08e7e 100644 --- a/docs/connector/spark/hive.rst +++ b/docs/connector/spark/hive.rst @@ -16,53 +16,52 @@ `Hive`_ ========== -The Kyuubi Hive Connector is a datasource for both reading and writing Hive table, -It is implemented based on Spark DataSource V2, and supports concatenating multiple Hive metastore at the same time. +You may know that the Apache Spark has built-in support for accessing Hive tables, it works well in most cases, +but is limited to one Hive Metastore. The Kyuubi Spark Hive connector(KSHC) implemented a Hive connector based +on Spark DataSource V2 API, supports accessing multiple Hive Metastore in a single Spark application. -This connector can be used to federate queries of multiple hives warehouse in a single Spark cluster. +Hive Integration +---------------- -Hive Connector Integration -------------------- - -To enable the integration of kyuubi spark sql engine and Hive connector through -Apache Spark Datasource V2 and Catalog APIs, you need to: +To enable the integration of Kyuubi Spark SQL engine and Hive connector through +Spark DataSource V2 API, you need to: - Referencing the Hive connector :ref:`dependencies` -- Setting the spark extension and catalog :ref:`configurations` +- Setting the Spark catalog :ref:`configurations` .. _kyuubi-hive-deps: Dependencies ************ -The **classpath** of kyuubi spark sql engine with Hive connector supported consists of +The **classpath** of Kyuubi Spark SQL engine with Hive connector supported consists of -1. kyuubi-spark-connector-hive_2.12-\ |release|\ , the hive connector jar deployed with Kyuubi distributions -2. a copy of spark distribution +1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution +2. a copy of Spark distribution +3. kyuubi-spark-connector-hive_2.12-\ |release|\ , which can be found in the `Maven Central`_ In order to make the Hive connector packages visible for the runtime classpath of engines, we can use one of these methods: 1. Put the Kyuubi Hive connector packages into ``$SPARK_HOME/jars`` directly 2. Set ``spark.jars=/path/to/kyuubi-hive-connector`` +.. note:: + Starting from v1.9.2 and v1.10.0, KSHC jars available in the `Maven Central`_ guarantee binary compatibility across + Spark versions, namely, Spark 3.3 onwards. + .. _kyuubi-hive-conf: Configurations ************** -To activate functionality of Kyuubi Hive connector, we can set the following configurations: +To activate functionality of Kyuubi Spark Hive connector, we can set the following configurations: .. code-block:: properties - spark.sql.catalog.hive_catalog org.apache.kyuubi.spark.connector.hive.HiveTableCatalog - spark.sql.catalog.hive_catalog.spark.sql.hive.metastore.version hive-metastore-version - spark.sql.catalog.hive_catalog.hive.metastore.uris thrift://metastore-host:port - spark.sql.catalog.hive_catalog.hive.metastore.port port - spark.sql.catalog.hive_catalog.spark.sql.hive.metastore.jars path - spark.sql.catalog.hive_catalog.spark.sql.hive.metastore.jars.path file:///opt/hive1/lib/*.jar - -.. tip:: - For details about the multi-version Hive configuration, see the related multi-version Hive configurations supported by Apache Spark. + spark.sql.catalog.hive_catalog org.apache.kyuubi.spark.connector.hive.HiveTableCatalog + spark.sql.catalog.hive_catalog.hive.metastore.uris thrift://metastore-host:port + spark.sql.catalog.hive_catalog. + spark.sql.catalog.hive_catalog. Hive Connector Operations ------------------ @@ -106,4 +105,29 @@ Taking ``DROP NAMESPACE`` as a example, DROP NAMESPACE hive_catalog.ns; -.. _Apache Spark: https://spark.apache.org/ \ No newline at end of file +Advanced Usages +*************** + +Though KSHC is a pure Spark DataSource V2 connector which isn't coupled with Kyuubi deployment, due to the +implementation inside ``spark-sql``, you should not expect KSHC works properly with ``spark-sql``, and +any issues caused by such a combination usage won't be considered at this time. Instead, it's recommended +using BeeLine with Kyuubi as a drop-in replacement for ``spark-sql``, or switching to ``spark-shell``. + +KSHC supports accessing Kerberized Hive Metastore and HDFS, by using keytab, or TGT cache, or Delegation Token. +It's not expected to work properly with multiple KDC instances, the limitation comes from JDK Krb5LoginModule, +for such cases, consider setting up Cross-Realm Kerberos trusts, then you just need to talk with one KDC. + +For HMS Thrift API used by Spark, it's known that Hive 2.3.9 client is compatible with HMS from 2.1 to 4.0, and +Hive 2.3.10 client is compatible with HMS from 1.1 to 4.0, such version combinations should cover the most cases. +For other corner cases, KSHC also supports ``spark.sql.catalog..spark.sql.hive.metastore.jars`` and +``spark.sql.catalog..spark.sql.hive.metastore.version`` as well as the Spark built-in Hive datasource +does, you can refer to the Spark documentation for details. + +Currently, KSHC has not implemented the Parquet/ORC Hive tables read/write optimization, in other words, it always +uses Hive SerDe to access Hive tables, so there might be a performance gap compared to the Spark built-in Hive +datasource, especially due to lack of support for vectorized reading. And you may hit bugs caused by Hive SerDe, +e.g. ``ParquetHiveSerDe`` can not read Parquet files that decimals are written in int-based format produced by +Spark Parquet datasource writer with ``spark.sql.parquet.writeLegacyFormat=false``. + +.. _Apache Spark: https://spark.apache.org/ +.. _Maven Central: https://mvnrepository.com/artifact/org.apache.kyuubi/kyuubi-spark-connector-hive diff --git a/docs/connector/spark/hudi.rst b/docs/connector/spark/hudi.rst index 3ccd1f93b..b46c30e4c 100644 --- a/docs/connector/spark/hudi.rst +++ b/docs/connector/spark/hudi.rst @@ -30,8 +30,8 @@ and easy to expand than directly using Spark to manipulate Hudi. Hudi Integration ---------------- -To enable the integration of kyuubi spark sql engine and Hudi through -Catalog APIs, you need to: +To enable the integration of Kyuubi Spark SQL engine and Hudi through +Spark DataSource V2 API, you need to: - Referencing the Hudi :ref:`dependencies` - Setting the Spark extension and catalog :ref:`configurations` @@ -41,10 +41,10 @@ Catalog APIs, you need to: Dependencies ************ -The **classpath** of kyuubi spark sql engine with Hudi supported consists of +The **classpath** of Kyuubi Spark SQL engine with Hudi supported consists of -1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions -2. a copy of spark distribution +1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution +2. a copy of Spark distribution 3. hudi-spark-bundle_-.jar (example: hudi-spark3.2-bundle_2.12-0.11.1.jar), which can be found in the `Maven Central`_ In order to make the Hudi packages visible for the runtime classpath of engines, we can use one of these methods: diff --git a/docs/connector/spark/iceberg.rst b/docs/connector/spark/iceberg.rst index 2ce58aa04..dab3802c8 100644 --- a/docs/connector/spark/iceberg.rst +++ b/docs/connector/spark/iceberg.rst @@ -32,21 +32,21 @@ spark to manipulate Iceberg. Iceberg Integration ------------------- -To enable the integration of kyuubi spark sql engine and Iceberg through -Apache Spark Datasource V2 and Catalog APIs, you need to: +To enable the integration of Kyuubi Spark SQL engine and Iceberg through +Spark DataSource V2 API, you need to: - Referencing the Iceberg :ref:`dependencies` -- Setting the spark extension and catalog :ref:`configurations` +- Setting the Spark extension and catalog :ref:`configurations` .. _spark-iceberg-deps: Dependencies ************ -The **classpath** of kyuubi spark sql engine with Iceberg supported consists of +The **classpath** of Kyuubi Spark SQL engine with Iceberg supported consists of -1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions -2. a copy of spark distribution +1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution +2. a copy of Spark distribution 3. iceberg-spark-runtime-_-.jar (example: iceberg-spark-runtime-3.2_2.12-0.14.0.jar), which can be found in the `Maven Central`_ In order to make the Iceberg packages visible for the runtime classpath of engines, we can use one of these methods: diff --git a/docs/connector/spark/paimon.rst b/docs/connector/spark/paimon.rst index 14e741955..de7efd39e 100644 --- a/docs/connector/spark/paimon.rst +++ b/docs/connector/spark/paimon.rst @@ -30,21 +30,22 @@ spark to manipulate Apache Paimon (Incubating). Apache Paimon (Incubating) Integration ------------------- -To enable the integration of kyuubi spark sql engine and Apache Paimon (Incubating), you need to set the following configurations: +To enable the integration of Kyuubi Spark SQL engine and Apache Paimon (Incubating) through +Spark DataSource V2 API, you need to: - Referencing the Apache Paimon (Incubating) :ref:`dependencies` -- Setting the spark extension and catalog :ref:`configurations` +- Setting the Spark extension and catalog :ref:`configurations` .. _spark-paimon-deps: Dependencies ************ -The **classpath** of kyuubi spark sql engine with Apache Paimon (Incubating) consists of +The **classpath** of Kyuubi Spark SQL engine with Apache Paimon (Incubating) consists of -1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions -2. a copy of spark distribution -3. paimon-spark-.jar (example: paimon-spark-3.3-0.4-20230323.002035-5.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Spark3`_ +1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution +2. a copy of Spark distribution +3. paimon-spark-.jar (example: paimon-spark-3.5-0.8.1.jar), which can be found in the `Apache Paimon (Incubating) Supported Engines Spark3`_ In order to make the Apache Paimon (Incubating) packages visible for the runtime classpath of engines, we can use one of these methods: diff --git a/docs/connector/spark/tidb.rst b/docs/connector/spark/tidb.rst index 366f3b2ad..fa73134b5 100644 --- a/docs/connector/spark/tidb.rst +++ b/docs/connector/spark/tidb.rst @@ -35,20 +35,20 @@ spark to manipulate TiDB/TiKV. TiDB Integration ------------------- -To enable the integration of kyuubi spark sql engine and TiDB through -Apache Spark Datasource V2 and Catalog APIs, you need to: +To enable the integration of Kyuubi Spark SQL engine and TiDB through +Spark DataSource V2 API, you need to: - Referencing the TiSpark :ref:`dependencies` -- Setting the spark extension and catalog :ref:`configurations` +- Setting the Spark extension and catalog :ref:`configurations` .. _spark-tidb-deps: Dependencies ************ -The classpath of kyuubi spark sql engine with TiDB supported consists of +The classpath of Kyuubi Spark SQL engine with TiDB supported consists of -1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions -2. a copy of spark distribution +1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution +2. a copy of Spark distribution 3. tispark-assembly-_-.jar (example: tispark-assembly-3.2_2.12-3.0.1.jar), which can be found in the `Maven Central`_ In order to make the TiSpark packages visible for the runtime classpath of engines, we can use one of these methods: diff --git a/docs/connector/spark/tpcds.rst b/docs/connector/spark/tpcds.rst index 1e02ab4f3..7fef8ee7a 100644 --- a/docs/connector/spark/tpcds.rst +++ b/docs/connector/spark/tpcds.rst @@ -32,21 +32,21 @@ Goto `Try Kyuubi`_ to explore TPC-DS data instantly! TPC-DS Integration ------------------ -To enable the integration of kyuubi spark sql engine and TPC-DS through -Apache Spark Datasource V2 and Catalog APIs, you need to: +To enable the integration of Kyuubi Spark SQL engine and TPC-DS through +Spark DataSource V2 API, you need to: - Referencing the TPC-DS connector :ref:`dependencies` -- Setting the spark catalog :ref:`configurations` +- Setting the Spark catalog :ref:`configurations` .. _spark-tpcds-deps: Dependencies ************ -The **classpath** of kyuubi spark sql engine with TPC-DS supported consists of +The **classpath** of Kyuubi Spark SQL engine with TPC-DS supported consists of -1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions -2. a copy of spark distribution +1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution +2. a copy of Spark distribution 3. kyuubi-spark-connector-tpcds-\ |release|\ _2.12.jar, which can be found in the `Maven Central`_ In order to make the TPC-DS connector package visible for the runtime classpath of engines, we can use one of these methods: diff --git a/docs/connector/spark/tpch.rst b/docs/connector/spark/tpch.rst index 72ad8e9b6..100a221f0 100644 --- a/docs/connector/spark/tpch.rst +++ b/docs/connector/spark/tpch.rst @@ -32,21 +32,21 @@ Goto `Try Kyuubi`_ to explore TPC-H data instantly! TPC-H Integration ------------------ -To enable the integration of kyuubi spark sql engine and TPC-H through -Apache Spark Datasource V2 and Catalog APIs, you need to: +To enable the integration of Kyuubi Spark SQL engine and TPC-H through +Spark DataSource V2 API, you need to: - Referencing the TPC-H connector :ref:`dependencies` -- Setting the spark catalog :ref:`configurations` +- Setting the Spark catalog :ref:`configurations` .. _spark-tpch-deps: Dependencies ************ -The **classpath** of kyuubi spark sql engine with TPC-H supported consists of +The **classpath** of Kyuubi Spark SQL engine with TPC-H supported consists of -1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions -2. a copy of spark distribution +1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution +2. a copy of Spark distribution 3. kyuubi-spark-connector-tpch-\ |release|\ _2.12.jar, which can be found in the `Maven Central`_ In order to make the TPC-H connector package visible for the runtime classpath of engines, we can use one of these methods: diff --git a/docs/connector/trino/paimon.rst b/docs/connector/trino/paimon.rst index 5ac892234..84f736f9e 100644 --- a/docs/connector/trino/paimon.rst +++ b/docs/connector/trino/paimon.rst @@ -42,7 +42,7 @@ Dependencies The **classpath** of kyuubi trino sql engine with Apache Paimon (Incubating) supported consists of -1. kyuubi-trino-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with Kyuubi distributions +1. kyuubi-trino-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution 2. a copy of trino distribution 3. paimon-trino-.jar (example: paimon-trino-0.2.jar), which code can be found in the `Source Code`_ 4. flink-shaded-hadoop-2-uber-.jar, which code can be found in the `Pre-bundled Hadoop`_ diff --git a/docs/security/authorization/spark/install.md b/docs/security/authorization/spark/install.md index ff4131c6f..94419ff91 100644 --- a/docs/security/authorization/spark/install.md +++ b/docs/security/authorization/spark/install.md @@ -21,12 +21,12 @@ - [Apache Ranger](https://ranger.apache.org/) - This plugin works as a ranger rest client with Apache Ranger admin server to do privilege check. + This plugin works as a ranger rest client with Apache Ranger Admin server to do privilege check. Thus, a ranger server need to be installed ahead and available to use. - Building(optional) - If your ranger admin or spark distribution is not compatible with the official pre-built [artifact](https://mvnrepository.com/artifact/org.apache.kyuubi/kyuubi-spark-authz) in maven central. + If your Ranger Admin or Spark distribution is not compatible with the official pre-built [artifact](https://mvnrepository.com/artifact/org.apache.kyuubi/kyuubi-spark-authz) in maven central. You need to [build](build.md) the plugin targeting the spark/ranger you are using by yourself. ## Install