Kyuubi Architecture documentation

2020-12-29 16:13:07 +08:00 · 2020-12-29 16:13:07 +08:00 · ad9c0562ad
commit ad9c0562ad
parent 3eb47c7546
7 changed files with 203 additions and 126 deletions
--- a/README.md
+++ b/README.md
@ -14,6 +14,27 @@ The character is named `Kyuubi Kitsune/Kurama`, which is a nine-tailed fox in my
 `Kyuubi` spread the power and spirit of fire, which is used here to represent the powerful [Apache Spark](http://spark.apache.org).
 It's nine tails stands for end-to end multi-tenancy support of this project.

+Kyuubi is a high-performance universal JDBC and SQL execution engine. The goal of Kyuubi is to facilitate users to handle big data like ordinary data.
+
+It provides a standardized JDBC interface with easy-to-use data access in big data scenarios.
+End-users can focus on developing their own business systems and mining data value without having to be aware of the underlying big data platform (compute engines, storage services, metadata management, etc.).
+
+Kyuubi relies on Apache Spark to provide high-performance data query capabilities,
+and every improvement in the engine's capabilities can help Kyuubi's performance make a qualitative leap.
+In addition, Kyuubi improves ad-hoc responsiveness through the engine caching,
+and enhances concurrency through horizontal scaling and load balancing.
+It provides complete authentication and authentication services to ensure data and metadata security.
+It provides robust high availability and load balancing to help you guarantee the SLA commitment.
+It provides a two-level elastic resource management architecture to effectively improve resource utilization while covering the performance and response requirements of all scenarios including interactive,
+or batch processing and point queries, or full table scans.
+It embraces Spark and builds an ecosystem on top of it,
+which allows Kyuubi to quickly expand its existing ecosystem and introduce new features,
+such as cloud-native support and `Data Lake/Lake House` support.
+
+Kyuubi's vision is to build on top of Apache Spark and Data Lake technologies to unify the portal and become an ideal data lake management platform.
+It can support data processing e.g. ETL, and analytics e.g. BI in a pure SQL way data processing e.g. ETL, and analytics e.g. BI.
+All workloads can be done on one platform, using one copy of data, with one SQL interface.
+
 Ready? [Getting Started](https://kyuubi.readthedocs.io/en/latest/quick_start/quick_start.html) with Kyuubi.

 ## Contributing
--- a/docs/deployment/architecture.md
+++ b/docs/deployment/architecture.md
@ -1,125 +0,0 @@
-# Kyuubi Architecture Introduction
-
- [Unified Interface](#1.1)
- [Runtime Resource Resiliency](#1.2)
-    - [Kyuubi Dynamic Resource Requesting](#1.2.1)
-    - [Kyuubi Dynamic SparkContext Cache](#1.2.2)
-    - [Spark Dynamic Resource Allocation](#1.2.3)
- [Security](#1.3)
-    - [Authentication](#1.3.1)
-    - [Authorization](#1.3.2)
- [High Availability](#1.4)
-    - [HA Configurations](#1.4.1)
-
-**Kyuubi** is an enhanced edition of the [Apache Spark](http://spark.apache.org)'s primordial
-[Thrift JDBC/ODBC Server](http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server).
-It is mainly designed for directly running SQL towards a cluster with all components including HDFS, YARN, Hive MetaStore,
-and itself secured. The main purpose of Kyuubi is to realize an architecture that can not only speed up SQL queries using
-Spark SQL Engine, and also be compatible with the HiveServer2's behavior as much as possible. Thus, Kyuubi use the same protocol
-of HiveServer2, which can be found at [HiveServer2 Thrift API](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Thrift+API)
-as the client-server communication mechanism, and a user session level `SparkContext` instantiating / registering / caching / recycling
-mechanism to implement multi-tenant functionality.
-
-![](../imgs/kyuubi_architecture.png)
-
-<h2 id="1.1">Unified Interface</h2>
-
-Because Kyuubi use the same protocol of HiveServer2, it supports all kinds of JDBC/ODBC clients, and user applications written based
-on this Thrift API as shown in the picture above. Cat Tom can use various types of clients to create connections with the Kyuubi Server,
-and each connection is bound to a `SparkSession` instance which also contains a independent `HiveMetaStoreClient` to interact with Hive MetaStore
-Server. Tom can set session level configurations for each connection without affecting each other.
-
-<h2 id="1.2">Runtime Resource Resiliency</h2>
-
-Kyuubi does not occupy any resources from the Cluster Manager(Yarn) during startup, and will give all resources back to Yarn if there
-is not any active session interacting with a `SparkContext`. And also with the ability of Spark [Dynamic Resource Allocation](http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation),
-it also allows us to dynamically allocating resources within a `SparkContext` a.k.a a Yarn Application.
-
-<h4 id="1.2.1">Kyuubi Dynamic Resource Requesting</h4>
- Session Level Resource Configurations        
-    Kyuubi supports all Spark/Hive/Hadoop configurations, such as `spark.executor.cores/memory`, to be set in the connection
-    string which will be used to initialize SparkContext.    
- Example        
-    ```
-    jdbc:hive2://<host>:<port>/;hive.server2.proxy.user=tom#spark.yarn.queue=theque;spark.executor.instances=3;spark.executor.cores=3;spark.executor.memory=10g
-    ```
-
-<h4 id="1.2.2">Kyuubi Dynamic SparkContext Cache</h4>
-
-Kyuubi implements a `SparkSessionCacheManager` to control `SparkSession`/`SparkContext` for instantiating, registering,
-caching, reusing, and recycling. Different user has one and only one `SparkContext` instance in Kyuubi Server after it connects
-to the server for the first time, which will be cached in `SparkSessionCacheManager` for the whole connection life time and
-a while after all connections closed.
-
-<div style="text-align: center">
-    <img style="zoom: 0.66" src="./imgs/impersonation.png" />
-</div>
-
-All connections belong to the same user shares this `SparkContext` to generate their own `SparkSession`s
-
-<h4 id="1.2.3">Spark Dynamic Resource Allocation</h4>
-
-Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. It means
-that your application may give resources back to the cluster if they are no longer used and request them again later when
-there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
-
-Please refer to [Dynamic Resource Allocation](http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation) to see more.
-
-Please refer to [Dynamic Allocation Configuration](http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation) to learn how to configure.
-
-With these features, Kyuubi allows us to use computing resources more efficiently.
-
-<h2 id="1.3">Security</h2>
-
-<h4 id="1.3.1">Authentication</h4>
-
-Please refer to the [Authentication/Security Guide](https://yaooqinn.github.io/kyuubi/docs/authentication.html) in the online documentation for an overview on how to enable security for Kyuubi.
-
-<h4 id="1.3.2">Authorization</h4>
-
-Kyuubi can be integrated with [Spark Authorizer](https://yaooqinn.github.io/spark-authorizer/) to offer row/column level access control.
-Kyuubi does not explicitly support spark-authorizer plugin yet, here is an example you may refer to
-[Spark Branch Authorized](https://github.com/yaooqinn/spark/tree/v2.1.2-based)
-
-<div style="text-align: center">
-    <img style="zoom: 1.00" src="./imgs/authorization.png" />
-</div>
-
-<h2 id="1.4">High Availability</h2>
-
-<div style="text-align: center">
-    <img style="zoom: 1.00" src="./imgs/ha.png" />
-</div>
-
-Multiple Kyuubi Server instances can register themselves with ZooKeeper when `spark.kyuubi.ha.enabled=true` and then
-the clients can find a Kyuubi Server through ZooKeeper. When a client requests a server instance, ZooKeeper randomly returns 
-a selected registered one. This feature offers:
-
- - High Availability
- - Load Balancing
- - Rolling Upgrade
-
-<h4 id="1.4.1">HA Configurations</h4>
-
-Name|Default|Description
---|---|---
-spark.kyuubi.<br />ha.enabled|false|Whether KyuubiServer supports dynamic service discovery for its clients. To support this, each instance of KyuubiServer currently uses ZooKeeper to register itself, when it is brought up. JDBC/ODBC clients should use the ZooKeeper ensemble: spark.kyuubi.ha.zk.quorum in their connection string.
-spark.kyuubi.<br />ha.zk.quorum|none|Comma separated list of ZooKeeper servers to talk to, when KyuubiServer supports service discovery via Zookeeper.
-spark.kyuubi.<br />ha.zk.namespace|kyuubiserver|The parent node in ZooKeeper used by KyuubiServer when supporting dynamic service discovery.
-
-# Kyuubi Internal
-
-Kyuubi's internal is very simple to understand, which is shown as the picture below. We may take about it more detailly later.
-
-![](../imgs/kyuubi_internal.png)
-
-## Additional Documentations
-
-[Building Kyuubi](https://yaooqinn.github.io/kyuubi/docs/building.html)  
-[Kyuubi Deployment Guide](https://yaooqinn.github.io/kyuubi/docs/deploy.html)   
-[Kyuubi Containerization Guide](https://yaooqinn.github.io/kyuubi/docs/containerization.html)   
-[High Availability Guide](https://yaooqinn.github.io/kyuubi/docs/high_availability_guide.html)  
-[Configuration Guide](https://yaooqinn.github.io/kyuubi/docs/configurations.html)  
-[Authentication/Security Guide](https://yaooqinn.github.io/kyuubi/docs/authentication.html)  
-[Kyuubi ACL Management Guide](https://yaooqinn.github.io/kyuubi/docs/authorization.html)   
-[Home Page](https://yaooqinn.github.io/kyuubi/)
--- a/docs/deployment/index.rst
+++ b/docs/deployment/index.rst
@ -12,7 +12,6 @@ Deploying Kyuubi
    settings
    on_yarn
    hive_metastore
-    architecture
    deploy
    high_availability_guide
    metrics
--- a/docs/imgs/ha.png
+++ b/docs/imgs/ha.png
--- a/docs/overview/architecture.md
+++ b/docs/overview/architecture.md
@ -0,0 +1,169 @@
+<div align=center>
+
+![](../imgs/kyuubi_logo_simple.png)
+
+</div>
+
+# Kyuubi Architecture
+
+## Introduction
+
+Kyuubi is a high-performance universal JDBC and SQL execution engine. The goal of Kyuubi is to facilitate users to handle big data like ordinary data.
+
+It provides a standardized JDBC interface with easy-to-use data access in big data scenarios.
+End-users can focus on developing their own business systems and mining data value without having to be aware of the underlying big data platform (compute engines, storage services, metadata management, etc.).
+
+Kyuubi relies on Apache Spark to provide high-performance data query capabilities,
+and every improvement in the engine's capabilities can help Kyuubi's performance make a qualitative leap.
+In addition, Kyuubi improves ad-hoc responsiveness through the engine caching,
+and enhances concurrency through horizontal scaling and load balancing.
+
+It provides complete authentication and authentication services to ensure data and metadata security.
+
+It provides robust high availability and load balancing to help you guarantee the SLA commitment.
+
+It provides a two-level elastic resource management architecture to effectively improve resource utilization while covering the performance and response requirements of all scenarios including interactive,
+or batch processing and point queries, or full table scans.
+
+It embraces Spark and builds an ecosystem on top of it,
+which allows Kyuubi to quickly expand its existing ecosystem and introduce new features,
+such as cloud-native support and `Data Lake/Lake House` support.
+
+Kyuubi's vision is to build on top of Apache Spark and Data Lake technologies to unify the portal and become an ideal data lake management platform.
+It can support data processing e.g. ETL, and analytics e.g. BI in a pure SQL way data processing e.g. ETL, and analytics e.g. BI.
+All workloads can be done on one platform, using one copy of data, with one SQL interface.
+
+## Architecture Overview
+
+The basic technical architecture of the Kyuubi system is shown in the following diagram.
+
+![](../imgs/kyuubi_architecture_new.png)
+
+The middle part of the diagram shows the main part of the Kyuubi server,
+which handles the connection and execution requests from the clients shown in the left part of the image. Within Kyuubi,
+these connection requests are maintained as `Kyuubi Session`s,
+and execution requests are maintained as `Kyuubi Operation`s which are bound to the corresponding sessions.
+
+The creation of a `Kyuubi Session` can be divided into two cases: lightweight and heavyweight.
+Most Session creation is lightweight and user-unaware.
+The only heavyweight case is when there is no `SparkContext` instantiated or cached in the user's shared domain,
+which usually happens when the user is connecting for the first time or has not connected for a long time.
+This one-time cost session maintenance model can meet most of the ad-hoc fast response requirements.
+
+Kyuubi maintains connections to `SparkConext` in a loosely coupled fashion. These `SparkContext`s can be Spark programs created locally in client deploy mode by this service instance,
+or in Yarn or Kubernetes clusters in cluster deploy mode.
+In highly available mode, these `SparkConext` can also be created by other Kyuubi instances on other machines and then shared by this instance.
+
+These `SparkConext` instances are essentially remote query execution engine programs hosted by Kyuubi services.
+These programs are implemented on Spark SQL and compile, optimize, and execute SQL statements end-to-end,
+as well as the necessary interaction with the metadata (e.g. Hive Metastore) and storage (e.g. HDFS) services,
+maximizing the power of Spark SQL.
+They can manage their own lifecycle,
+cache and recycle themselves,
+and are not affected by failover on the Kyuubi server.
+
+Next, let's share some of the key design concepts of Kyuubi.
+
+## Unified Interface
+
+Kyuubi implements the [Hive Service RPC](https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/2.3.7) module,
+which provides the same way of accessing data as HiveServer2 and Spark Thrift Server.
+On the client side，you can build fantastic business reports, BI applications or even ETL jobs only via the [Hive JDBC](https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc/2.3.7) module.
+
+You only need to be familiar with Structured Query Language (SQL) and Java Database Connectivity (JDBC) to handle massive data.
+It helps you focus on the design and implementation of your business system.
+
+- SQL is the standard language for accessing relational databases, and very popular in big data eco too.
+  It turns out that everybody knows SQL.
+- JDBC provides a standard API for tool/database developers and makes it possible to write database applications using a pure Java API.
+- There are plenty of free or commercial JDBC tools out there.
+
+## Runtime Resource Resiliency
+
+The biggest difference between Kyuubi and Spark Thrift Server(STS) is that STS is a single Spark application.
+For example, if it runs on an Apache Hadoop Yarn cluster,
+this application is also a single Yarn application that can only exist in a specific fixed queue of the Yarn cluster after it is created.
+Kyuubi supports the submission of multiple Spark applications.
+
+For resource management, Yarn loses its role as a resource manager and does not play the corresponding role of resource isolation and sharing.
+When users from the client have different resource queue permissions,
+STS will not be able to handle it in this case.
+
+For data access, a single Spark application has only one user globally,
+a.k.a. `sparkUser`, and we have to grant it a superuser-like role in order to allow it to perform data access to different client users,
+which is an extremely insecure practice in production environments.
+
+Kyuubi creates different Spark applications based on the connection requests from the client,
+and these applications can be placed in different shared domains for other connection requests to share.
+
+Kyuubi does not occupy any resources from the Cluster Manager(e.g. Yarn) during startup, and will give all resources back if there
+is not any active session interacting with a `SparkContext`.
+
+Spark also provides [Dynamic Resource Allocation](http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation) to dynamically adjust the resources your application occupies based on the workload. It means
+that your application may give resources back to the cluster if they are no longer used and request them again later when
+there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
+
+With these features, Kyuubi provides a two-level elastic resource management architecture to effectively improve resource utilization.
+
+For example,
+
+```shell
+./beeline - u 'jdbc:hive2://kyuubi.org:10009/; \
+  hive.server2.proxy.user=tom# \
+  spark.yarn.queue=thequeue; \
+  spark.dynamicAllocation.enabled=true \
+  spark.dynamicAllocation.maxExecutors=500 \
+  spark.shuffle.service.enabled=true \
+  spark.executor.cores=3; \
+  spark.executor.memory=10g'
+
+```
+
+If the user named `tom` opens a connection like above, Kyuubi will try to create a Spark SQL engine application with [3, 500] executors (3 cores, 10g mem each) in the queue named `thequeue` in Yarn cluster.
+
+On one hand, because tom enables Spark's dynamic resource request feature,
+Spark will efficiently request and recycle executors within the program based on the scale of the SQL operations and the available resources in the queue.
+On the other hand, when Kyuubi finds that the application has been idle for too long, it will also recycle the application itself.
+
+
+## High Availability & Load Balance
+
+For an enterprise service, the Service Level Agreement(SLA) commitment must be at a very high level.
+And the concurrency needs to be sufficiently robust to support the entire enterprise's requests.
+Spark Thrift Server, as a single Spark application and without High Availability implemented, can hardly meet the SLA and concurrency requirement.
+When there are large query requests, there are potential bottlenecks in metadata service access, scheduling and memory pressure of Spark Driver, or the overall computational resource constraints of the application.
+
+Kyuubi provides both high availability and load balancing solutions based on Zookeeper, as shown in the following diagram.
+
+![](../imgs/ha.png)
+
+Let's break it down from top to bottom based on the above diagram.
+
+1. At the top of the diagram is the client layer,
+where a client can find multiple registered instances of Kyuubi instance (k.i.) from the namespace in the service discovery layer and then choose one to connect to.
+Kyuubi instances registered to the same namespace provide the ability to load balance each other.
+
+2. The selected Kyuubi instance will pick an available engine instance (e.i.) from the engine-namespace in the service discovery layer to establish a connection.
+If no available instance is found, it will create a new one, wait for the engine to finish registering, and then proceed to connect.
+
+3. If a new connection is requested by the same person, the connection will be setup to the same or another Kyuubi instance, but the engine instance will be reused.
+
+4. For connections from different users, the step 2 and 3 will be repeated. 
+This is because in the service discovery layer,
+the namespaces used to store the address of the engine instances are isolated based on the user(by default),
+and different users cannot access other' s instances across the namespace.
+
+# Authentication & Authorization
+
+In a secure cluster, services should be able to identify and authenticate callers.
+As the fact that the user claims does not necessarily mean this is true.
+The authentication process of Kyuubi is used to verify the user identity that a client used to talk to the Kyuubi server.
+Once done, a trusted connection will be set up between the client and server if successful; otherwise, rejected.
+
+The authenticated client user will also be the user that creates the associate engine instance, then authorizations for database objects or storage could be applied.
+We also create a [Submarine: Spark Security](https://mvnrepository.com/artifact/org.apache.submarine/submarine-spark-security) as an external plugin to achieve fined-grained SQL standard based authorization.
+
+## Conclusions
+
+Kyuubi is a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of [Apache Spark™](http://spark.apache.org/).
+It extends the scenarios of Spark Thrift Server in enterprise applications, the most important of which is multi-tenancy support.
--- a/docs/overview/index.rst
+++ b/docs/overview/index.rst
@ -9,4 +9,5 @@ Overview
    :maxdepth: 2

    summary
+    Architecture <architecture>
    kyuubi_vs_hive
--- a/docs/overview/kyuubi_vs_thriftserver.md
+++ b/docs/overview/kyuubi_vs_thriftserver.md
@ -0,0 +1,12 @@
+
+
+**Kyuubi** is an enhanced edition of the [Apache Spark](http://spark.apache.org)'s primordial
+[Thrift JDBC/ODBC Server](http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server).
+In enterprise computing
+
+It is mainly designed for directly running SQL towards a cluster with all components including HDFS, YARN, Hive MetaStore,
+and itself secured. The main purpose of Kyuubi is to realize an architecture that can not only speed up SQL queries using
+Spark SQL Engine, and also be compatible with the HiveServer2's behavior as much as possible. Thus, Kyuubi use the same protocol
+of HiveServer2, which can be found at [HiveServer2 Thrift API](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Thrift+API)
+as the client-server communication mechanism, and a user session level `SparkContext` instantiating / registering / caching / recycling
+mechanism to implement multi-tenant functionality.