diff --git a/docs/connector/spark/iceberg.rst b/docs/connector/spark/iceberg.rst index c23cd55ee..02959c15d 100644 --- a/docs/connector/spark/iceberg.rst +++ b/docs/connector/spark/iceberg.rst @@ -16,22 +16,109 @@ `Iceberg`_ ========== +Apache Iceberg is an open table format for huge analytic datasets. +Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and Impala +using a high-performance table format that works just like a SQL table. + +.. tip:: + This article assumes that you have mastered the basic knowledge and operation of `Iceberg`_. + For the knowledge about Iceberg not mentioned in this article, + you can obtain it from its `Official Documentation`_. + +By using kyuubi, we can run SQL queries towards Iceberg which is more +convenient, easy to understand, and easy to expand than directly using +spark to manipulate Iceberg. Iceberg Integration ------------------- +To enable the integration of kyuubi spark sql engine and Iceberg through +Apache Spark Datasource V2 and Catalog APIs, you need to: + +- Referencing the Iceberg :ref:`dependencies` +- Setting the spark extension and catalog :ref:`configurations` + .. _dependencies: Dependencies ************ +The **classpath** of kyuubi spark sql engine with Iceberg supported consists of + +1. kyuubi-spark-sql-engine-|release|.jar, the engine jar deployed with Kyuubi distributions +2. a copy of spark distribution +3. iceberg-spark-runtime-_-.jar (example: iceberg-spark-runtime-3.2_2.12-0.14.0.jar), which can be found in the `Maven Central`_ + +In order to make the Iceberg packages visible for the runtime classpath of engines, we can use one of these methods: + +1. Put the Iceberg packages into ``$SPARK_HOME/jars`` directly +2. Set ``spark.jars=/path/to/iceberg-spark-runtime`` + +.. warning:: + Please mind the compatibility of different Iceberg and Spark versions, which can be confirmed on the page of `Iceberg multi engine support`_. + .. _configurations: Configurations ************** +To activate functionality of Iceberg, we can set the following configurations: + +.. code-block:: properties + + spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog + spark.sql.catalog.spark_catalog.type=hive + spark.sql.catalog.spark_catalog.uri=thrift://metastore-host:port + spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions Iceberg Operations ------------------ -.. _Iceberg: https://iceberg.apache.org/ \ No newline at end of file +Taking ``CREATE TABLE`` as a example, + +.. code-block:: sql + + CREATE TABLE foo ( + id bigint COMMENT 'unique id', + data string) + USING iceberg; + +Taking ``SELECT`` as a example, + +.. code-block:: sql + + SELECT * FROM foo; + +Taking ``INSERT`` as a example, + +.. code-block:: sql + + INSERT INTO foo VALUES (1, 'a'), (2, 'b'), (3, 'c'); + +Taking ``UPDATE`` as a example, Spark 3.1 added support for UPDATE queries that update matching rows in tables. + +.. code-block:: sql + + UPDATE foo SET data = 'd', id = 4 WHERE id >= 3 and id < 4; + +Taking ``DELETE FROM`` as a example, Spark 3 added support for DELETE FROM queries to remove data from tables. + +.. code-block:: sql + + DELETE FROM foo WHERE id >= 1 and id < 2; + +Taking ``MERGE INTO`` as a example, + +.. code-block:: sql + + MERGE INTO target_table t + USING source_table s + ON t.id = s.id + WHEN MATCHED AND s.opType = 'delete' THEN DELETE + WHEN MATCHED AND s.opType = 'update' THEN UPDATE SET id = s.id, data = s.data + WHEN NOT MATCHED AND s.opType = 'insert' THEN INSERT (id, data) VALUES (s.id, s.data); + +.. _Iceberg: https://iceberg.apache.org/ +.. _Official Documentation: https://iceberg.apache.org/docs/latest/ +.. _Maven Central: https://mvnrepository.com/artifact/org.apache.iceberg +.. _Iceberg multi engine support: https://iceberg.apache.org/multi-engine-support/