Go to file
2018-05-21 19:19:28 +08:00
bin export kyuubi home in kyuubi env 2018-05-21 19:19:28 +08:00
build test class loader 2018-05-14 19:16:22 +08:00
docs doc typo 2018-05-10 14:51:51 +08:00
if init commit for kyuubi 2018-01-05 19:38:54 +08:00
patches readme doc 2018-03-06 11:20:57 +08:00
src fix kyuubi submit npe 2018-05-21 18:46:38 +08:00
_config.yml add jekyll conf 2018-03-07 15:24:45 +08:00
.gitignore suitable stop script 2018-05-21 19:12:12 +08:00
.travis.yml travis codecov 2018-03-20 14:51:45 +08:00
CODE_OF_CONDUCT.md Create CODE_OF_CONDUCT.md 2018-03-07 15:54:22 +08:00
CONTRIBUTING.md Create CONTRIBUTING.md 2018-03-07 15:51:46 +08:00
LICENSE Initial commit 2017-12-18 17:05:10 +08:00
pom.xml code coverage excludes 2018-05-14 15:31:32 +08:00
README.md add kyuubi architecture descriptions 2018-05-09 20:05:12 +08:00
scalastyle-config.xml 1. create sc in a new thread; 2. kill yarn app by app name when sc init timeout 2018-01-17 17:15:35 +08:00

Kyuubi codecov Build StatusHitCount

Kyuubi is an enhanced edition of the Apache Spark's primordial Thrift JDBC/ODBC Server. It is mainly designed for directly running SQL towards a cluster with all components including HDFS, YARN, Hive MetaStore, and itself secured.

Basicaly, the Thrift JDBC/ODBC Server as a similar ad-hoc SQL query service of Apache Hive's HiveServer2 for Spark SQL, acts as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, without the need to write any code. We can make pretty business reports with massive data using some BI tools which supported JDBC/ODBC connections, such as Tableau, NetEase YouData and so on. Benefitting from Apache Spark's capability, we can archive much more performance improvement than Apache Hive as a SQL on Hadoop service.

But unfortunately, due to the limitations of Spark's own architectureto be used as an enterprise-class product, there are a number of problems compared with HiveServer2such as multi-tenant isolation, authentication/authorization, high concurrency, high availability, and so on. And the Apache Spark community's support for this module has been in a state of prolonged stagnation.

Kyuubi has enhanced the Thrift JDBC/ODBC Server in some ways for these existing problems, as shown in the following table.

--- Thrift Server Kyuubi Comments
Multi SparkContext Instances Apache Spark has several issues to have multiple SparkContext instances in one single JVM. Option spark.driver.allowMultipleContexts=true only enables SparkContext to be instantiated many times but these instance can only share and use the scheduler and execution environments of the last initialized one, which is kind of like a shallow copy of a Java object. The patches of Kyuubi provides a way of isolating these components by user to avoid overlapping.
Dynamic SparkContext Initialization Each SparkContext initialization is delayed to the phase of first session of a particular user's creation in Kyuubi, while Thrift JDBC/ODBC Server create one only when it starts.
Dynamic SparkContext Recycling In Thrift JDBC/ODBC Server, SparkContext is a resident variable. Kyuubi will cache SparkContext instances for a while after session closed before the server terminating them.
Dynamic Yarn Queue We use spark.yarn.queue to specifying the queue that Spark on Yarn applications run into. Once Thrift JDBC/ODBC Server started, it becomes unchangeable, while HiveServer2 could switch queue byset mapred.job.queue.name=thequeue. Kyuubi adopts a compromise method which could identify and use spark.yarn.queue in the connection string.
Dynamic Configuring only spark.sql.* Kyuubi supports all Spark/Hive/Hadoop configurations, such as spark.executor.cores/memory, to be set in the connection string which will be used to initialize SparkContext.
Authorization Spark Authorizer will be add to Kyuubi soon.
Impersonation --proxy-user singleuser Kyuubi fully support hive.server2.proxy.user and hive.server2.doAs
Multi Tenancy Based on the above featuresKyuubi is able to run as a multi-tenant server on a LCE supported Yarn cluster.
SQL Operation Log Kyuubi redirect sql operation log to local file which has an interface for the client to fetch.
High Availability Based on ZooKeeper
cluster deploy mode yarn cluster mode will be supported soon
Type Mapping Kyuubi support Spark result/schema to be directly converted to Thrift result/schemas bypassing Hive format results

Getting Started

Packaging

Please refer to the Building Kyuubi in the online documentation for an overview on how to build Kyuubi.

Start Kyuubi

1. As a normal spark application

For test cases, your can run Kyuubi Server as a normal spark application.

$ $SPARK_HOME/bin/spark-submit \ 
    --class yaooqinn.kyuubi.server.KyuubiServer \
    --master yarn \
    --deploy-mode client \
    --driver-memory 10g \
    --conf spark.kyuubi.frontend.bind.port=10009 \
    $KYUUBI_HOME/target/kyuubi-1.0.0-SNAPSHOT.jar

2. As a long running service

Using nohup and & could run Kyuubi as a long running service

$ nohup $SPARK_HOME/bin/spark-submit \ 
    --class yaooqinn.kyuubi.server.KyuubiServer \
    --master yarn \
    --deploy-mode client \
    --driver-memory 10g \
    --conf spark.kyuubi.frontend.bind.port=10009 \
    $KYUUBI_HOME/target/kyuubi-1.0.0-SNAPSHOT.jar &

3. With built-in startup script

The more recommended way is through the built-in startup script bin/start-kyuubi.sh First of all, export SPARK_HOME in $KYUUBI_HOME/bin/kyuubi-env.sh`

export SPARK_HOME=/the/path/to/an/runable/spark/binary/dir

And then the last, start Kyuubi with bin/start-kyuubi.sh

$ bin/start-kyuubi.sh \ 
    --master yarn \
    --deploy-mode client \
    --driver-memory 10g \
    --conf spark.kyuubi.frontend.bind.port=10009

Run Spark SQL on Kyuubi

Now you can use beeline, Tableau or Thrift API based programs to connect to Kyuubi server.

Stop Kyuubi

bin/stop-kyuubi.sh

Notes: Obviouslywithout the patches we supplied, Kyuubi is mostly same with the Thrift JDBC/ODBC Server as an non-multi-tenancy server.

Multi Tenancy Support

Prerequisites

Kyuubi may work well with different deployments such as non-secured Yarn, Standalone, Mesos or even local mode, but it is mainly designed for a secured HDFS/Yarn Cluster on which Kyuubi will play well with multi tenant and secure features.

Suppose that you already have a secured HDFS cluster for deploying Spark, Hive or other applications.

Configure Yarn

Spark on Yarn

Configure Hive

  • Configuration of Hive is done by placing your hive-site.xml, core-site.xml and hdfs-site.xml files in $SPARK_HOME/conf.

Patch Spark

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Kyuubi.

Authentication

Please refer to the Authentication/Security Guide in the online documentation for an overview on how to enable security for Kyuubi.

Additional Documentation

Kyuubi Architecture