readme en / configuration doc kyuui part

2018-03-06 23:52:29 +08:00 · 2018-03-06 23:52:29 +08:00 · 3c74836463
commit 3c74836463
parent d5bf707015
4 changed files with 114 additions and 47 deletions
--- a/README.md
+++ b/README.md
@ -1,43 +1,48 @@
 # Kyuubi  [![Build Status](https://travis-ci.org/yaooqinn/kyuubi.svg?branch=master)](https://travis-ci.org/yaooqinn/kyuubi)

-**Kyuubi** is an enhanced version of [Apache Spark](http://spark.apache.org)'s primordial [Thrift JDBC/ODBC Server](http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server).    

-**Thrift JDBC/ODBC Server** as a similar servcie of Spark SQL [Apache Hive](https://hive.apache.org) [HiveServer2](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview)
-而存在的服务，通过该服务允许用户使用JDBC/ODBC端口协议来执行Spark SQL查询。通过Thrift JDBC/ODBC Server，业务用户就可以用使用一些支持JDBC/ODBC连接的BI工具，比如[Tableau](https://www.tableau.com/zh-cn)，[网易有数](https://youdata.163.com)等，
-来对接基于Spark的海量数据报表制作，并取得比Apache Hive更好的SQL on Hadoop性能。但由于Apache Spark本身架构的限制，要作为一款企业级的产品来使用，其与HiveServer2相比
-还有不少的问题存在，比如多租户隔离、权限控制、高并发、高可用等等。而Apache Spark社区对这块的支持也处于长期停滞的状态。      
+**Kyuubi** is an enhanced edition of [Apache Spark](http://spark.apache.org)'s primordial
+ [Thrift JDBC/ODBC Server](http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server).    

-**Kyuubi**针对这些存在的问题在某些方面对Thrift JDBC/ODBC Server进行的增强。具体如下表所示，
+The **Thrift JDBC/ODBC Server** as a similar servcie of [Apache Hive](https://hive.apache.org) [HiveServer2](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview) for Spark SQL, acting as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, without the need to write any code. These users can make pretty bussiness reports with massive data using some BI tools which supportted JDBC/ODBC connections, such as [Tableau](https://www.tableau.com), [NetEase YouData](https://youdata.163.com) and so on. Benifiting from Apache Spark's capabilty, they can achive much more perfomance improvement than Apache Hive as a SQL on Hadoop service.    

- |---|**Thrift JDBC/ODBC Server**|**Kyuubi**|备注|   
+But Unfortunately, due to the limitations of Spark's own architecture，to be used as an enterprise-class product, there are a number of problems compared with HiveServer2，such as multi-tenant isolation, authentication/authorization, high concurrency, high availability, and so on. And the Apache Spark community's support for this module has been in a state of prolonged stagnation.         
+
+**Kyuubi** has enhanced the Thrift JDBCODBC Server in some ways for these existing problems, as shown in the following table,     
+
+
+ |---|**Thrift JDBC/ODBC Server**|**Kyuubi**|Comments|   
 |:---:|:---:|:---:|---|
- |SparkContext多实例|✘|√|Apache Spark对于单个JVM中实例化多个SparkContext一直有较多的尝试，可以参见[这里](https://www.jianshu.com/p/e1cfcaece8f1)；</br> 而其多实例特性可以通过`spark.driver.allowMultipleContexts`开启，也不过是SparkContext被实例化多次，并公用一套调度和执行环境而已，有点像java对象的一次浅拷贝。 </br> Kyuubi附带的Patch提供了一种以用户隔离调度和执行环境的方法。|
- |SparkContext动态实例化|✘|√|Thrift JDBC/ODBC Server在启动时初始化一个SparkContext实例，而Kyuubi则在用户会话创建时去缓存中获取或新建SparkContext|
- |SparkContext动态回收|✘|√|Thrift JDBC/ODBC Server再用户断开会话后会回收SparkSession，而SparkContext则是常驻的变量；</br> Kyuubi对于SparkSession亦如是，不同的是由于SparkContext是动态新增的，从而对应的会有相应的回收机制。|
- |动态Yarn队列|✘|√|Spark on Yarn可以通过spark.yarn.queue指定队列，Thrift JDBC/ODBC Server指定这个队列后并无法修改这个队列，</br> HiveServer2可以`set mapred.job.queue.name=thequeue`来指定执行队列， </br> Kyuubi采取了折中方案，可以将spark.yarn.queue设置连接串中。|
- |动态参数设置|仅支持`spark.sql.`开头的动态参数|√|Kyuubi支持在连接串中指定`spark.executor.cores/memory`等参数动态设置对应SparkContext所能调度的资源|
- |权限控制|✘|✘|Kyuubi后续会增加[Spark Authorizer](https://github.com/yaooqinn/spark-authorizer)的支持。|
- |代理执行|仅支持代理一位用户，server启动时通过--proxy-user指定|支持hive.server2.proxy.user；</br> 支持hive.server2.doAs||
- |多租户|✘|√|基于以上特性，Kyuubi可以在开启LCE的Yarn集群上实现多租户|
- |SQL执行日志|✘|√|HiveServer2通过LogDivertAppender来从定向SQL执行的日志到文件中，Kyuubi基于重写了该Appender得以将执行日志拉去到文件中。|
- |高可用|✘|√|Thrift JDBC/ODBC Server实际是粗糙的改写了HiveServer2的部分代码，连接ZK实现高可用的代码被阉掉了，Kyuubi把他们加回来而已|
- |Cluster模式|✘|✘|Kyuubi后续会加入cluster模式的支持|
+ |Multi SparkContext Instances|✘|✔|Apache Spark has several issues to have multiple SparkContext instances in one single JVM，see [here](https://www.jianshu.com/p/e1cfcaece8f1).  Setting `spark.driver.allowMultipleContexts=true` only enables SparkContext to be instantiate many times but these instance can only share and use the scheduler and execution environments of the last initialized one, which is kind of like a shallow copy of a Java object. The patches of Kyuubi provides a way of isolating the scheduler and execution environments by user.|
+ |Dynamic SparkContext Initialization|✘|✔|SparkContext initialization is delayed to the phase of user session creation in Kyuubi, while Thrift JDBC/ODBC Server create one only when it starts.|
+ |Dynamic SparkContext Recycling|✘|✔| In Thrift JDBC/ODBC Server, SparkContext is a resident variable. Kyuubi will cache SparkContext instance for a while after the server terminating it.|
+ |Dynamic Yarn Queue|✘|✔|We use spark.yarn.queue to specifying the queue that Spark on Yarn applications run into. Once Thrift JDBC/ODBC Server started, it becomes unchangable, while HiveServer2 could switch queue by`set mapred.job.queue.name=thequeue`, Kyuubi adopts a compromise method which could identify and use spark.yarn.queue in the connection string.|
+ |Dynamic Configing|only spark.sql.*|✔|Kyuubi supports all Spark/Hive/Hadoop configutations, such as `spark.executor.cores/memory`, to be set in the connection string which will be used to initialize SparkContext. |
+ |Authorization|✘|✘|[Spark Authorizer](https://github.com/yaooqinn/spark-authorizer) will be add to Kyuubi soon.|
+ |Impersonation|`--proxy-user single user`|✔|Kyuubi fully support `hive.server2.proxy.user` and `hive.server2.doAs`|
+ |Multi Tenancy|✘|✔|Based on the above features，Kyuubi is able to run as a multi-tenant server on a LCE supported Yarn cluster.|
+ |SQL Operaton Log|✘|✔|Kyuubi redirect sql operation log to local file which has an interface for the client to fetch.|
+ |High Availability|✘|✔|Based on ZooKeeper |
+ |cluster deploy mode|✘|✘|yarn cluster mode will be supported soon|
 
 
-## 快速上手
+## Getting Started

-#### 编译
+#### Packaging

-**Kyuubi**基于maven构建
+**Kyuubi** server is based on Maven, 

 ```sbtshell
 build/mvn clean package
 ```

-#### 运行
+Running the code above in the Kyuubi project directory is all we need to build a runnable Kyuubi server.

-**Kyuubi**本质上作为一个Spark Application可以轻松的用spark-submit起来
+#### Start Kyuubi

+###### 1. As a normal spark application
+
+For test cases, your can run Kyuubi Server as a normal spark application.
 ```bash
 $ $SPARK_HOME/bin/spark-submit \ 
    --class yaooqinn.kyuubi.server.KyuubiServer \
@ -48,14 +53,30 @@ $ $SPARK_HOME/bin/spark-submit \
    $KYUUBI_HOME/target/kyuubi-1.0.0-SNAPSHOT.jar
 ```

-作为一个长服务，当然最好用`nohup`配合`&`来使用。但是更加推荐内置脚本来运行Kyuubi。

-首先，在`$KYUUBI_HOME/bin/kyuubi-env.sh`设置好`SPARK_HOME`
+###### 2. As a long running service
+
+Using `nohup` and `&` could run Kyuubi as a long running service
+```bash
+$ nohup $SPARK_HOME/bin/spark-submit \ 
+    --class yaooqinn.kyuubi.server.KyuubiServer \
+    --master yarn \
+    --deploy-mode client \
+    --driver-memory 10g \
+    --conf spark.hadoop.hive.server2.thrift.port=10009 \
+    $KYUUBI_HOME/target/kyuubi-1.0.0-SNAPSHOT.jar &
+```
+
+###### 3. With built-in startup script
+
+The more recommended way is through the built-in startup script `bin/start-kyuubi.sh`
+First of all, export `SPARK_HOME` in $KYUUBI_HOME/bin/kyuubi-env.sh`
+
 ```bash
 export SPARK_HOME=/the/path/to/an/runable/spark/binary/dir
 ```

-其次，通过`bin/start-kyuubi.sh`启动Kyuubi
+And then the last, start Kyuubi with  `bin/start-kyuubi.sh`
 ```bash
 $ bin/start-kyuubi.sh \ 
    --master yarn \
@ -64,21 +85,28 @@ $ bin/start-kyuubi.sh \
    --conf spark.hadoop.hive.server2.thrift.port=10009 \
 ```

-即可通过[beeline](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients)等Thrift API或者[Tableau](https://www.tableau.com/zh-cn)这种工具来连接了。
+#### Run Spark SQL on Kyuubi

-最后，`bin/stop-kyuubi.sh`来停止服务。
+Now you can use [beeline](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients), [Tableau](https://www.tableau.com/zh-cn) or Thrift API based programs to connect to Kyuubi server.

-**当然，仅仅如此的话，Kyuubi完全等价于Thrift JDBC/ODBC Server，并不具备多租户的特性。**
+#### Stop Kyuubi

-## 多租户
+```bash
+bin/stop-kyuubi.sh
+```

-#### 前置条件
+
+**Notes:** Obviously，without the patches we supplied, Kyuubi is mostly same with the Thrift JDBC/ODBC Server as an non-mutli-tenancy server. 
+
+## Multi Tenancy Support
+
+#### Prerequisites

  -  [Spark On Yarn](http://spark.apache.org/docs/latest/running-on-yarn.html) 
-    +  配置好Spark On Yarn
+    +  Setup Spark On Yarn
    + [LunixExecutorCantainer](https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/SecureContainer.html)
-    + 为不同的用户创建队列(Optional)
-  -  [Thrift JDBC/ODBC Server相关要求的配置](http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server)
-    +  配置好hive-site.xml
-  -  为Spark打上对应的Patch
+    + Yarn queues(Optional)
+  -  [Thrift JDBC/ODBC Server](http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server) Configutations
+    +  Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in $SPARK_HOME/conf/.
+  -  Patch
  
--- a/README_CN.md
+++ b/README_CN.md
@ -14,7 +14,7 @@
 |:---:|:---:|:---:|---|
 |SparkContext多实例|✘|√|Apache Spark对于单个JVM中实例化多个SparkContext一直有较多的尝试，可以参见[这里](https://www.jianshu.com/p/e1cfcaece8f1)；</br> 而其多实例特性可以通过`spark.driver.allowMultipleContexts`开启，也不过是SparkContext被实例化多次，并公用一套调度和执行环境而已，有点像java对象的一次浅拷贝。 </br> Kyuubi附带的Patch提供了一种以用户隔离调度和执行环境的方法。|
 |SparkContext动态实例化|✘|√|Thrift JDBC/ODBC Server在启动时初始化一个SparkContext实例，而Kyuubi则在用户会话创建时去缓存中获取或新建SparkContext|
- |SparkContext动态回收|✘|√|Thrift JDBC/ODBC Server再用户断开会话后会回收SparkSession，而SparkContext则是常驻的变量；</br> Kyuubi对于SparkSession亦如是，不同的是由于SparkContext是动态新增的，从而对应的会有相应的回收机制。|
+ |SparkContext动态回收|✘|√|Thrift JDBC/ODBC Server在用户断开会话后会回收SparkSession，而SparkContext则是常驻的变量；</br> Kyuubi对于SparkSession亦如是，不同的是由于SparkContext是动态新增的，从而对应的会有相应的回收机制。|
 |动态Yarn队列|✘|√|Spark on Yarn可以通过spark.yarn.queue指定队列，Thrift JDBC/ODBC Server指定这个队列后并无法修改这个队列，</br> HiveServer2可以`set mapred.job.queue.name=thequeue`来指定执行队列， </br> Kyuubi采取了折中方案，可以将spark.yarn.queue设置连接串中。|
 |动态参数设置|仅支持`spark.sql.`开头的动态参数|√|Kyuubi支持在连接串中指定`spark.executor.cores/memory`等参数动态设置对应SparkContext所能调度的资源|
 |权限控制|✘|✘|Kyuubi后续会增加[Spark Authorizer](https://github.com/yaooqinn/spark-authorizer)的支持。|
--- a/docs/configurations.md
+++ b/docs/configurations.md
@ -1,10 +1,45 @@
 # Configurations

 ## Kyuubi Configurations
-
+#### High Availability
 Name|Default|Description
 ---|---|---
-spark.kyuubi.session.clean.interval | 20min | checking interval for yuubi Server to clean the idle SparkSession instances
+spark.kyuubi.ha.enabled|false|Whether KyuubiServer supports dynamic service discovery for its clients. To support this, each instance of KyuubiServer currently uses ZooKeeper to register itself, when it is brought up. JDBC/ODBC clients should use the ZooKeeper ensemble: spark.kyuubi.ha.zk.quorum in their connection string.
+spark.kyuubi.ha.zk.quorum|none|Comma separated list of ZooKeeper servers to talk to, when KyuubiServer supports service discovery via Zookeeper.
+spark.kyuubi.ha.zk.namespace|kyuubiserver|The parent node in ZooKeeper used by KyuubiServer when supporting dynamic service discovery.
+spark.kyuubi.ha.zk.client.port|2181|The port of ZooKeeper servers to talk to. If the list of Zookeeper servers specified in spark.kyuubi.zookeeper.quorum does not contain port numbers, this value is used.
+spark.kyuubi.ha.zk.session.timeout|1,200,000|ZooKeeper client's session timeout (in milliseconds). The client is disconnected, and as a result, all locks released, if a heartbeat is not sent in the timeout.
+spark.kyuubi.ha.zk.connection.basesleeptime|1,000|Initial amount of time (in milliseconds) to wait between retries when connecting to the ZooKeeper server when using ExponentialBackoffRetry policy.
+spark.kyuubi.ha.zk.connection.max.retries|3|Max retry times for connecting to the zk server
+
+#### Operation Log
+Name|Default|Description
+---|---|---
+spark.kyuubi.logging.operation.enabled|true|When true, KyuubiServer will save operation logs and make them available for clients
+spark.kyuubi.logging.operation.log.dir|`SPARK_LOG_DIR` -> `SPARK_HOME`/operation_logs -> `java.io.tmpdir`/operation_logs|Top level directory where operation logs are stored if logging functionality is enabled
+
+#### Background Execution Thread Pool
+Name|Default|Description
+---|---|---
+spark.kyuubi.async.exec.threads|100|Number of threads in the async thread pool for KyuubiServer.
+spark.kyuubi.async.exec.wait.queue.size|100|Size of the wait queue for async thread pool in KyuubiServer. After hitting this limit, the async thread pool will reject new requests.
+spark.kyuubi.async.exec.keep.alive.time|10,000|Time (in milliseconds) that an idle KyuubiServer async thread (from the thread pool) will wait for a new task to arrive before terminating.
+spark.kyuubi.async.exec.shutdown.timeout|10,000|How long KyuubiServer shutdown will wait for async threads to terminate.
+
+#### Session Idle Check
+Name|Default|Description
+---|---|---
+spark.kyuubi.frontend.session.check.interval|6h|The check interval for frontend session/operation timeout, which can be disabled by setting to zero or negative value.
+spark.kyuubi.frontend.session.timeout|8h|The check interval for session/operation timeout, which can be disabled by setting  to zero or negative value.
+spark.kyuubi.frontend.session.check.operation| true |Session will be considered to be idle only if there is no activity, and there is no pending operation. This setting takes effect only if session idle timeout `spark.kyuubi.frontend.session.timeout` and checking `spark.kyuubi.frontend.session.check.interval` are enabled.
+spark.kyuubi.backend.session.check.interval|20min|The check interval for backend session a.k.a SparkSession timeout.
+
+#### On Spark Session Init
+Name|Default|Description
+---|---|---
+spark.kyuubi.backend.session.wait.other.times | 60 | How many times to check when another session with the same user is initializing SparkContext. Total Time will be times by `spark.kyuubi.backend.session.wait.other.interval`.
+spark.kyuubi.backend.session.wait.other.interval|1s|The interval for checking whether other thread with the same user has completed SparkContext instantiation.
+spark.kyuubi.backend.session.init.timeout|60s|How long we suggest the server to give up instantiating SparkContext.

 ---

@ -14,5 +49,8 @@ Name|Default|Description
 ---|---|---
 hive.server2.logging.operation.enabled | true | When true, Kyuubi Server will save operation logs and make them available for clients

+## [Spark Configurations]()
+
+
+## Hadoop Configurations

-## 
--- a/src/main/scala/org/apache/spark/KyuubiConf.scala
+++ b/src/main/scala/org/apache/spark/KyuubiConf.scala
@ -73,7 +73,7 @@ object KyuubiConf {
  val HA_ZOOKEEPER_CLIENT_PORT: ConfigEntry[String] =
    KyuubiConfigBuilder("spark.kyuubi.ha.zk.client.port")
      .doc("The port of ZooKeeper servers to talk to. If the list of Zookeeper servers specified" +
-        " in spark.kyuubi.zookeeper.quorum does not contain port numbers, this value is used")
+        " in spark.kyuubi.zookeeper.quorum does not contain port numbers, this value is used.")
      .stringConf
      .createWithDefault("2181")

@ -93,7 +93,7 @@ object KyuubiConf {

  val HA_ZOOKEEPER_CONNECTION_MAX_RETRIES: ConfigEntry[Int] =
    KyuubiConfigBuilder("spark.kyuubi.ha.zk.connection.max.retries")
-      .doc("max retry time connecting to the zk server")
+      .doc("Max retry times for connecting to the zk server")
      .intConf
      .createWithDefault(3)

@ -136,7 +136,7 @@ object KyuubiConf {

  val EXEC_KEEPALIVE_TIME: ConfigEntry[Long] =
    KyuubiConfigBuilder("spark.kyuubi.async.exec.keep.alive.time")
-      .doc("Time that an idle KyuubiServer async thread (from the thread pool) will wait for" +
+      .doc("Time (in milliseconds) that an idle KyuubiServer async thread (from the thread pool) will wait for" +
        " a new task to arrive before terminating")
      .timeConf(TimeUnit.MILLISECONDS)
      .createWithDefault(TimeUnit.SECONDS.toMillis(10L))
@ -170,8 +170,8 @@ object KyuubiConf {
    KyuubiConfigBuilder("spark.kyuubi.frontend.session.check.operation")
      .doc("Session will be considered to be idle only if there is no activity, and there is no" +
        " pending operation. This setting takes effect only if session idle timeout" +
-        " (spark.kyuubi.idle.session.timeout) and checking (spark.kyuubi.session.check.interval)" +
-        " are enabled.")
+        " (spark.kyuubi.frontend.session.timeout) and checking" +
+        " (spark.kyuubi.frontend.session.check.interval) are enabled.")
      .booleanConf
      .createWithDefault(true)

@ -195,13 +195,14 @@ object KyuubiConf {

  val BACKEND_SESSION_WAIT_OTHER_INTERVAL: ConfigEntry[Long] =
    KyuubiConfigBuilder("spark.kyuubi.backend.session.wait.other.interval")
-      .doc("")
+      .doc("The interval for checking whether other thread with the same user has completed" +
+      " SparkContext instantiation.")
      .timeConf(TimeUnit.MILLISECONDS)
      .createWithDefault(TimeUnit.SECONDS.toMillis(1L))

  val BACKEND_SESSTION_INIT_TIMEOUT =
    KyuubiConfigBuilder("spark.kyuubi.backend.session.init.timeout")
-    .doc("")
+    .doc("How long we suggest the server to give up instantiating SparkContext")
    .timeConf(TimeUnit.SECONDS)
    .createWithDefault(TimeUnit.SECONDS.toSeconds(60L))