kyuubi/docs/client/python/pyspark.md
liangbowen 69d625a1be [KYUUBI #4200] [Improvement] [Docs] Introduce Markdown formatting with spotless-maven-plugin and flexmark for docs
### _Why are the changes needed?_

- to consolidate styles in markdown files from manual written or auto-generated
- apply markdown formatting rules with flexmark from [spotless-maven-plugin](https://github.com/diffplug/spotless/tree/main/plugin-maven#markdown) to *.md files in `/docs`
- use `flexmark` to format markdown generation in `TestUtils` of common module used by `AllKyuubiConfiguration` and `KyuubiDefinedFunctionSuite`, as the same way in `FlexmarkFormatterFunc ` of `spotless-maven-plugin` using with `COMMONMARK` as `FORMATTER_EMULATION_PROFILE` (https://github.com/diffplug/spotless/blob/maven/2.30.0/lib/src/flexmark/java/com/diffplug/spotless/glue/markdown/FlexmarkFormatterFunc.java)
- using `flexmark` of` 0.62.2`, as the last version requiring Java 8+ (checked from pom file and bytecode version)

```
<markdown>
    <includes>
        <include>docs/**/*.md</include>
    </includes>
    <flexmark></flexmark>
</markdown>
```

- Changes applied to markdown doc files,
  -  no style change or breakings in built docs by `make html`
  - removal all the first blank in licences and comments to conform markdown style rules
  - tables regenerated by flexmark following as in [GitHub Flavored Markdown](https://help.github.com/articles/organizing-information-with-tables/) (https://github.com/vsch/flexmark-java/wiki/Extensions#tables)

### _How was this patch tested?_
- [x] regenerate docs using `make html` successfully and check all the markdown pages available
- [x] regenerate `settings.md` and `functions.md` by `AllKyuubiConfiguration` and `KyuubiDefinedFunctionSuite`, and pass the checks by both themselves and spotless check via `dev/reformat`
- [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4200 from bowenliang123/markdown-formatting.

Closes #4200

1eeafce4 [liangbowen] revert minor changes in AllKyuubiConfiguration
4f892857 [liangbowen] use flexmark in markdown doc generation
8c978abd [liangbowen] changes on markdown files
a9190556 [liangbowen] apply markdown formatting rules with `spotless-maven-plugin` to markdown files with in `/docs`

Authored-by: liangbowen <liangbowen@gf.com.cn>
Signed-off-by: liangbowen <liangbowen@gf.com.cn>
2023-01-30 11:14:41 +08:00

134 lines
5.7 KiB
Markdown

<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
-->
# PySpark
[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is an interface for Apache Spark in Python. Kyuubi can be used as JDBC source in PySpark.
## Requirements
PySpark works with Python 3.7 and above.
Install PySpark with Spark SQL and optional pandas support on Spark using PyPI as follows:
```shell
pip install pyspark 'pyspark[sql]' 'pyspark[pandas_on_spark]'
```
For installation using Conda or manually downloading, please refer to [PySpark installation](https://spark.apache.org/docs/latest/api/python/getting_started/install.html).
## Preparation
### Prepare JDBC driver
Prepare JDBC driver jar file. Supported Hive compatible JDBC Driver as below:
| Driver | Driver Class Name | Remarks |
|------------------------------------------------------|-----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Kyuubi Hive Driver ([doc](../jdbc/kyuubi_jdbc.html)) | org.apache.kyuubi.jdbc.KyuubiHiveDriver | Compile for the driver on master branch, as [KYUUBI #3484](https://github.com/apache/kyuubi/pull/3485) required by Spark JDBC source not yet included in released version. |
| Hive Driver ([doc](../jdbc/hive_jdbc.html)) | org.apache.hive.jdbc.HiveDriver |
Refer to docs of the driver and prepare the JDBC driver jar file.
### Prepare JDBC Hive Dialect extension
Hive Dialect support is required by Spark for wrapping SQL correctly and sending it to the JDBC driver. Kyuubi provides a JDBC dialect extension with auto-registered Hive Daliect support for Spark. Follow the instructions in [Hive Dialect Support](../../extensions/engines/spark/jdbc-dialect.html) to prepare the plugin jar file `kyuubi-extension-spark-jdbc-dialect_-*.jar`.
### Including jars of JDBC driver and Hive Dialect extension
Choose one of the following ways to include jar files in Spark.
- Put the jar file of JDBC driver and Hive Dialect to `$SPARK_HOME/jars` directory to make it visible for the classpath of PySpark. And adding `spark.sql.extensions = org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension` to `$SPARK_HOME/conf/spark_defaults.conf.`
- With spark's start shell, include the JDBC driver when submitting the application with `--packages`, and the Hive Dialect plugins with `--jars`
```
$SPARK_HOME/bin/pyspark --py-files PY_FILES \
--packages org.apache.hive:hive-jdbc:x.y.z \
--jars /path/kyuubi-extension-spark-jdbc-dialect_-*.jar
```
- Setting jars and config with SparkSession builder
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars", "/path/hive-jdbc-x.y.z.jar,/path/kyuubi-extension-spark-jdbc-dialect_-*.jar") \
.config("spark.sql.extensions", "org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension") \
.getOrCreate()
```
## Usage
For further information about PySpark JDBC usage and options, please refer to Spark's [JDBC To Other Databases](https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html).
### Using as JDBC Datasource programmingly
```python
# Loading data from Kyuubi via HiveDriver as JDBC datasource
jdbcDF = spark.read \
.format("jdbc") \
.options(driver="org.apache.hive.jdbc.HiveDriver",
url="jdbc:hive2://kyuubi_server_ip:port",
user="user",
password="password",
query="select * from testdb.src_table"
) \
.load()
```
### Using as JDBC Datasource table with SQL
From Spark 3.2.0, [`CREATE DATASOURCE TABLE`](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html) is supported to create jdbc source with SQL.
```python
# create JDBC Datasource table with DDL
spark.sql("""CREATE TABLE kyuubi_table USING JDBC
OPTIONS (
driver='org.apache.hive.jdbc.HiveDriver',
url='jdbc:hive2://kyuubi_server_ip:port',
user='user',
password='password',
dbtable='testdb.some_table'
)""")
# read data to dataframe
jdbcDF = spark.sql("SELECT * FROM kyuubi_table")
# write data from dataframe in overwrite mode
df.writeTo("kyuubi_table").overwrite
# write data from query
spark.sql("INSERT INTO kyuubi_table SELECT * FROM some_table")
```
### Use PySpark with Pandas
From PySpark 3.2.0, PySpark supports pandas API on Spark which allows you to scale your pandas workload out.
Pandas-on-Spark DataFrame and Spark DataFrame are virtually interchangeable. More instructions in [From/to pandas and PySpark DataFrames](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/pandas_pyspark.html#pyspark).
```python
import pyspark.pandas as ps
psdf = ps.range(10)
sdf = psdf.to_spark().filter("id > 5")
sdf.show()
```