### Why are the changes needed?
Support to reassign the batches to alternative kyuubi instance in case kyuubi instance lost.
https://github.com/apache/kyuubi/issues/6884
### How was this patch tested?
Unit Test
### Was this patch authored or co-authored using generative AI tooling?
No
Closes#7037 from George314159/6884.
Closes#6884
8565d4aaa [Wang, Fei] KYUUBI_SESSION_CONNECTION_URL_KEY
22d4539e2 [Wang, Fei] admin
075654cb3 [Wang, Fei] check admin
5654a99f4 [Wang, Fei] log and lock
a19e2edf5 [Wang, Fei] minor comments
a60f23ba3 [George314159] refine
760e10f89 [George314159] Update Based On Comments
75f1ee2a9 [Fei Wang] ping (#1)
f42bcaf9a [George314159] Update Based on Comments
1bea70ed6 [George314159] [KYUUBI-6884] Support to reassign the batches to alternative kyuubi instance in case kyuubi instance lost
Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: George314159 <hua16732@gmail.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
To prevent the terminated app pods leak if the events missed during kyuubi server restart.
### How was this patch tested?
Manual test.
```
:2025-06-17 17:50:37.275 INFO [main] org.apache.kyuubi.engine.KubernetesApplicationOperation: [KubernetesInfo(Some(28),Some(dls-prod))] Found existing pod kyuubi-xb406fc5-7b0b-4fdf-8531-929ed2ae250d-8998-5b406fc5-7b0b-4fdf-8531-929ed2ae250d-8998-90c0b328-930f-11ed-a1eb-0242ac120002-0-20250423211008-grectg-stm-17da59fe-caf4-41e4-a12f-6c1ed9a293f9-driver with label: kyuubi-unique-tag=17da59fe-caf4-41e4-a12f-6c1ed9a293f9 in app state FINISHED, marking it as terminated
2025-06-17 17:50:37.278 INFO [main] org.apache.kyuubi.engine.KubernetesApplicationOperation: [KubernetesInfo(Some(28),Some(dls-prod))] Found existing pod kyuubi-xb406fc5-7b0b-4fdf-8531-929ed2ae250d-8998-5b406fc5-7b0b-4fdf-8531-929ed2ae250d-8998-90c0b328-930f-11ed-a1eb-0242ac120002-0-20250423212011-gpdtsi-stm-6a23000f-10be-4a42-ae62-4fa2da8fac07-driver with label: kyuubi-unique-tag=6a23000f-10be-4a42-ae62-4fa2da8fac07 in app state FINISHED, marking it as terminated
```
The pods are cleaned up eventually.
<img width="664" alt="image" src="https://github.com/user-attachments/assets/8cf58f61-065f-4fb0-9718-2e3c00e8d2e0" />
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7101 from turboFei/pod_cleanup.
Closes#7101
7f76cf57c [Wang, Fei] async
11c9db25d [Wang, Fei] cleanup
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
Respect terminated app state when building batch info from metadata
It is a followup for https://github.com/apache/kyuubi/pull/2911,
9e40e39c39/kyuubi-server/src/main/scala/org/apache/kyuubi/server/api/v1/BatchesResource.scala (L128-L142)
1. if the kyuubi instance is unreachable during maintain window.
2. the batch app state has been terminated, and the app stated was backfilled by another kyuubi instance peer, see #2911
3. the batch state in the metadata table is still PENDING/RUNNING
4. return the terminated batch state for such case instead of `PENDING or RUNNING`.
### How was this patch tested?
GA and IT.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7095 from turboFei/always_respect_appstate.
Closes#7095
ec72666c9 [Wang, Fei] rename
bc74a9c56 [Wang, Fei] if op not terminated
e786c8d9b [Wang, Fei] respect terminated app state when building batch info from metadata
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
To show how many metadata records cleaned up.
### How was this patch tested?
```
(base) ➜ kyuubi git:(delete_metadata) grep 'Cleaned up' target/unit-tests.log
01:58:17.109 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.124 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.144 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.161 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.180 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.199 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.216 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.236 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.253 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.270 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.290 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.310 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.327 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.348 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.368 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.384 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.400 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.419 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.437 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.456 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.475 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.493 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.513 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.533 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.551 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.569 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.590 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.611 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.631 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.651 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.668 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.688 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.705 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.725 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.744 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.764 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.784 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.801 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.822 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.849 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.870 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.889 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.910 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.929 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.948 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.970 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:17.994 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:18.014 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:18.032 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:18.050 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:18.069 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:18.086 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 0 records older than 1000 ms from metadata.
01:58:18.108 ScalaTest-run-running-JDBCMetadataStoreSuite INFO JDBCMetadataStore: Cleaned up 1 records older than 1000 ms from metadata.
01:58:18.162 ScalaTest-run INFO JDBCMetadataStore: Cleaned up 0 records older than 0 ms from k8s_engine_info.
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7093 from turboFei/delete_metadata.
Closes#7093
e0cf300f8 [Wang, Fei] update
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
The metrics `kyuubi_operation_state_LaunchEngine_*` cannot reflect the state of Semaphore after configuring the maximum engine startup limit through `kyuubi.server.limit.engine.startup`, add some metrics to show the relevant permit state.
### How was this patch tested?
### Was this patch authored or co-authored using generative AI tooling?
Closes#7072 from LennonChin/engine_startup_metrics.
Closes#7072
d6bf3696a [Lennon Chin] Expose metrics of engine startup permit status
Authored-by: Lennon Chin <i@coderap.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### Why are the changes needed?
As clarified in https://github.com/apache/kyuubi/issues/6926, there are some scenarios user want to launch engine on each kyuubi server. SERVER_LOCAL engine share level implement this function by extracting local host address as subdomain, in which case each kyuubi server's engine is unique.
### How was this patch tested?
### Was this patch authored or co-authored using generative AI tooling?
No
Closes#7013 from taylor12805/share_level_server_local.
Closes#6926
ba201bb72 [taylor.fan] [KYUUBI #6926] update format
42f0a4f7d [taylor.fan] [KYUUBI #6926] move host address to subdomain
e06de79ad [taylor.fan] [KYUUBI #6926] Add SERVER_LOCAL engine share level
Authored-by: taylor.fan <taylor.fan@vipshop.com>
Signed-off-by: Kent Yao <yao@apache.org>
### Why are the changes needed?
1. Persist the kubernetes application terminate info into metastore to prevent the event lose.
2. If it can not get the application info from informer application info store, fallback to get the application info from metastore instead of return NOT_FOUND directly.
3. It is critical because if we return false application state, it might cause data quality issue.
### How was this patch tested?
UT and IT.
<img width="1917" alt="image" src="https://github.com/user-attachments/assets/306f417c-5037-4869-904d-dcf657ff8f60" />
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7029 from turboFei/kubernetes_state.
Closes#7028
9f2badef3 [Wang, Fei] generic dialect
186cc690d [Wang, Fei] nit
82ea62669 [Wang, Fei] Add pod name
4c59bebb5 [Wang, Fei] Refine
327a0d594 [Wang, Fei] Remove create_time from k8s engine info
12c24b1d0 [Wang, Fei] do not use MYSQL deprecated VALUES(col)
becf9d1a7 [Wang, Fei] insert or replace
d167623c1 [Wang, Fei] migration
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
Add an option to allow construct the batch info from metadata directly instead of redirecting the requests to reduce the RPC latency.
### How was this patch tested?
Minor change and Existing GA.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7043 from turboFei/support_no_redirect.
Closes#7043
7f7a2fb80 [Wang, Fei] comments
bb0e324a1 [Wang, Fei] save
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
Followup for #7034 to fix the SparkOnKubernetesTestsSuite.
Sorry, I forget that the appInfo name and pod name were deeply bound before, the appInfo name was used as pod name and used to delete pod.
In this PR, we add `podName` into applicationInfo to separate app name and pod name.
### How was this patch tested?
GA should pass.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7039 from turboFei/fix_test.
Closes#7034
0ff7018d6 [Wang, Fei] revert
18e48c079 [Wang, Fei] comments
19f34bc83 [Wang, Fei] do not get pod name from appName
c1d308437 [Wang, Fei] reduce interval for test stability
50fad6bc5 [Wang, Fei] fix ut
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
To fix NPE.
Before, we use below method to get `metadataManager`.
```
private def metadataManager = KyuubiServer.kyuubiServer.backendService
.sessionManager.asInstanceOf[KyuubiSessionManager].metadataManager
```
But before the kyuubi server fully restarted, the `KyuubiServer.kyuubiServer` is null and might throw NPE during batch recovery phase.
For example:
```
:2025-04-23 14:06:24.040 ERROR [KyuubiSessionManager-exec-pool: Thread-231] org.apache.kyuubi.engine.KubernetesApplicationOperation: Failed to get application by label: kyuubi-unique-tag=95116703-4240-4cc1-9886-ccae3a2ac879, due to Cannot invoke "org.apache.kyuubi.server.KyuubiServer.backendService()" because the return value of "org.apache.kyuubi.server.KyuubiServer$.kyuubiServer()" is null
```
### How was this patch tested?
Existing GA.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7041 from turboFei/fix_NPE.
Closes#7041
064d88707 [Wang, Fei] Fix NPE
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
After https://github.com/apache/spark/pull/34460 (Since Spark 3.3.0), the `spark-app-name` is available.
We shall use it as the application name if it exists.
### How was this patch tested?
Minor change.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7034 from turboFei/k8s_app_name.
Closes#7034
bfa88a436 [Wang, Fei] Get pod app name
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
I found that, for a kyuubi batch on kubernetes.
1. It has been `FINISHED`.
2. then I delete the pod manually, then I check the k8s-audit.log, then the appState became `FAILED`.
```
2025-04-15 11:16:30.453 INFO [-675216314-pool-44-thread-839] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: label=61e7d8c1-e5a9-46cd-83e7-c611003f0224 context=97 namespace=dls-prod pod=kyuubi-spark-61e7d8c1-e5a9-46cd-83e7-c611003f0224-driver podState=Running containers=[microvault->ContainerState(running=ContainerStateRunning(startedAt=2025-04-15T18:13:48Z, additionalProperties={}), terminated=null, waiting=null, additionalProperties={}),spark-kubernetes-driver->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://72704f8e7ccb5e877c8f6b10bf6ad810d0c019e07e0cb5975be733e79762c1ec, exitCode=0, finishedAt=2025-04-15T18:14:22Z, message=null, reason=Completed, signal=null, startedAt=2025-04-15T18:13:49Z, additionalProperties={}), waiting=null, additionalProperties={})] appId=spark-228c62e0dc37402bacac189d01b871e4 appState=FINISHED appError=''
:2025-04-15 11:16:30.854 INFO [-675216314-pool-44-thread-840] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: label=61e7d8c1-e5a9-46cd-83e7-c611003f0224 context=97 namespace=dls-prod pod=kyuubi-spark-61e7d8c1-e5a9-46cd-83e7-c611003f0224-driver podState=Failed containers=[microvault->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://91654e3ee74e2c31218e14be201b50a4a604c2ad15d3afd84dc6f620e59894b7, exitCode=2, finishedAt=2025-04-15T18:16:30Z, message=null, reason=Error, signal=null, startedAt=2025-04-15T18:13:48Z, additionalProperties={}), waiting=null, additionalProperties={}),spark-kubernetes-driver->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://72704f8e7ccb5e877c8f6b10bf6ad810d0c019e07e0cb5975be733e79762c1ec, exitCode=0, finishedAt=2025-04-15T18:14:22Z, message=null, reason=Completed, signal=null, startedAt=2025-04-15T18:13:49Z, additionalProperties={}), waiting=null, additionalProperties={})] appId=spark-228c62e0dc37402bacac189d01b871e4 appState=FAILED appError='{
```
This PR is a followup for #6690 , which ignore the container state if POD is terminated.
It is more reasonable to respect the terminated container state than terminated pod state.
### How was this patch tested?
Integration testing.
```
:2025-04-15 13:53:24.551 INFO [-1077768163-pool-36-thread-3] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: eventType=DELETE label=e0eb4580-3cfa-43bf-bdcc-efeabcabc93c context=97 namespace=dls-prod pod=kyuubi-spark-e0eb4580-3cfa-43bf-bdcc-efeabcabc93c-driver podState=Failed containers=[microvault->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://66c42206730950bd422774e3c1b0f426d7879731788cea609bbfe0daab24a763, exitCode=2, finishedAt=2025-04-15T20:53:22Z, message=null, reason=Error, signal=null, startedAt=2025-04-15T20:52:00Z, additionalProperties={}), waiting=null, additionalProperties={}),spark-kubernetes-driver->ContainerState(running=null, terminated=ContainerStateTerminated(containerID=containerd://9179a73d9d9e148dcd9c13ee6cc29dc3e257f95a33609065e061866bb611cb3b, exitCode=0, finishedAt=2025-04-15T20:52:28Z, message=null, reason=Completed, signal=null, startedAt=2025-04-15T20:52:01Z, additionalProperties={}), waiting=null, additionalProperties={})] appId=spark-578df0facbfd4958a07f8d1ae79107dc appState=FINISHED appError=''
```
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7025 from turboFei/container_terminated.
Closes#7025Closes#6686
a3b2a5a56 [Wang, Fei] comments
4356d1bc9 [Wang, Fei] fix the app state logical
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
1. Audit the kubernetes resource event type.
2. Fix the process logical for DELETE event.
Before this pr:
I tried to delete the POD manually, then I saw that, kyuubi thought the `appState=PENDING`.
```
:2025-04-15 13:58:20.320 INFO [-1077768163-pool-36-thread-7] org.apache.kyuubi.engine.KubernetesApplicationAuditLogger: eventType=DELETE label=3c58e9fd-cf8c-4cc3-a9aa-82ae40e200d8 context=97 namespace=dls-prod pod=kyuubi-spark-3c58e9fd-cf8c-4cc3-a9aa-82ae40e200d8-driver podState=Pending containers=[] appId=spark-cd125bbd9fc84ffcae6d6b5d41d4d8ad appState=PENDING appError=''
```
It seems that, the pod status in the event is the snapshot before pod deleted.
Then we would not receive any event for this POD, and finally the batch FINISHED with application `NOT_FOUND` .
<img width="1389" alt="image" src="https://github.com/user-attachments/assets/5df03db6-0924-4a58-9538-b196fbf87f32" />
Seems we need to process the DELETE event specially.
1. get the app state from the pod/container states
2. if the applicationState got is terminated, return the applicationState directly
3. otherwise, the applicationState should be FAILED, as the pod has been deleted.
### How was this patch tested?
<img width="1614" alt="image" src="https://github.com/user-attachments/assets/11e64c6f-ad53-4485-b8d2-a351bb23e8ca" />
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7026 from turboFei/k8s_audit.
Closes#7026
4e5695d34 [Wang, Fei] for delete
c16757218 [Wang, Fei] audit the pod event type
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
This ensure the Kyuubi server is promptly informed for any Kubernetes resource changes after startup. It is highly recommend to set it for multiple Kyuubi instances mode.
### How was this patch tested?
Existing GA and Integration testing.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7027 from turboFei/k8s_client_init.
Closes#7027
393b9960a [Wang, Fei] server only
a640278c4 [Wang, Fei] refresh
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
Currently, if the kyuubi session between client and kyuubi session disconnected without closing properly, it is difficult to debug, and we have to check the kyuubi server log.
It is better that, we can record such kind of information into kyuubi session event.
### How was this patch tested?
IT.
<img width="1264" alt="image" src="https://github.com/user-attachments/assets/d2c5b6d0-6298-46ec-9b73-ce648551120c" />
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7015 from turboFei/disconnect.
Closes#7015
c95709284 [Wang, Fei] do not post
e46521410 [Wang, Fei] nit
bca7f9b7e [Wang, Fei] post
1cf6f8f49 [Wang, Fei] disconnect
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
To fix the batch kyuubi instance port is negative issue.
<img width="697" alt="image" src="https://github.com/user-attachments/assets/ef992390-8d20-44b3-8640-35496caff85d" />
It happen after I stop the kyuubi service.
We should use variable instead of function for jetty server serverUri.
After the server connector stopped, the localPort would be `-2`.

### How was this patch tested?
Existing UT.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7017 from turboFei/server_port_negative.
Closes#7017
3d34c4031 [Wang, Fei] warn
e58298646 [Wang, Fei] mutable server uri
2cbaf772a [Wang, Fei] Revert "hard code the server uri"
b64d91b32 [Wang, Fei] hard code the server uri
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
Since https://github.com/apache/kyuubi/pull/3618
Kyuubi server could retry opening the engine when encountering a special error.
1937dd93f9/kyuubi-server/src/main/scala/org/apache/kyuubi/session/KyuubiSessionImpl.scala (L177-L212)
The `_client` might be reset and closed.
So, we shall set `_client` after open engine session successfully, as the `client` method is a public method.
### How was this patch tested?
Existing UT.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#7011 from turboFei/client_ready.
Closes#7011
3ad57ee91 [Wang, Fei] fix npe
b956394fa [Wang, Fei] close internal engine client
523b48a4d [Wang, Fei] internal client
5baeedec1 [Wang, Fei] Revert "method"
84c808cfb [Wang, Fei] method
8efaa52f6 [Wang, Fei] check engine launched
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
Fix the missing `assert` in `SparkProcessBuilderSuite - spark process builder`.
Fix the flaky test `SparkProcessBuilderSuite - capture error from spark process builder` by increasing `kyuubi.session.engine.startup.maxLogLines` from 10 to 4096, this is easy to fail, especially in Spark 4.0 due to increased error stack trace. for example, https://github.com/apache/kyuubi/actions/runs/13974413470/job/39290129824
```
SparkProcessBuilderSuite:
- spark process builder
- capture error from spark process builder *** FAILED ***
The code passed to eventually never returned normally. Attempted 167 times over 1.5007926256666668 minutes. Last failure message: "org.apache.kyuubi.KyuubiSQLException: Suppressed: org.apache.spark.util.Utils$OriginalTryStackTraceException: Full stacktrace of original doTryWithCallerStacktrace caller
See more: /home/runner/work/kyuubi/kyuubi/kyuubi-server/target/work/kentyao/kyuubi-spark-sql-engine.log.2
at org.apache.kyuubi.KyuubiSQLException$.apply(KyuubiSQLException.scala:69)
at org.apache.kyuubi.engine.ProcBuilder.$anonfun$start$1(ProcBuilder.scala:239)
at java.base/java.lang.Thread.run(Thread.java:1583)
.
FYI: The last 10 line(s) of log are:
25/03/24 12:53:39 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
25/03/24 12:53:39 INFO MemoryStore: MemoryStore cleared
25/03/24 12:53:39 INFO BlockManager: BlockManager stopped
25/03/24 12:53:39 INFO BlockManagerMaster: BlockManagerMaster stopped
25/03/24 12:53:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
25/03/24 12:53:39 INFO SparkContext: Successfully stopped SparkContext
25/03/24 12:53:39 INFO ShutdownHookManager: Shutdown hook called
25/03/24 12:53:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-18455622-344e-48ac-92eb-4b368c35e697
25/03/24 12:53:39 INFO ShutdownHookManager: Deleting directory /home/runner/work/kyuubi/kyuubi/kyuubi-server/target/work/kentyao/artifacts/spark-7479249b-44a2-4fe5-aa0f-544074f9c356
25/03/24 12:53:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-5ba8250f-1ff2-4e0d-a365-27d7518308e1" did not contain "org.apache.hadoop.hive.ql.metadata.HiveException:". (SparkProcessBuilderSuite.scala:77)
```
### How was this patch tested?
Pass GHA, and verified locally with Spark 4.0.0 RC3 by running tests 10 times with constant success.
### Was this patch authored or co-authored using generative AI tooling?
No
Closes#6998 from pan3793/spark-pb-ut.
Closes#6998
a4290b413 [Cheng Pan] harness SparkProcessBuilderSuite
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### Why are the changes needed?
We meet below issue:
For spark on yarn:
```
spark.yarn.submit.waitAppCompletion=false
kyuubi.engine.yarn.submit.timeout=PT10M
```
Due to network issue, the application submission was very slow.
It was submitted after 15 minutes.
<img width="1430" alt="image" src="https://github.com/user-attachments/assets/a326c3d1-4d39-42da-b6aa-cad5f8e7fc4b" />
<img width="1350" alt="image" src="https://github.com/user-attachments/assets/8e20056a-bd71-4515-a5e3-f881509a34b2" />
Then the batch failed from PENDING state to ERRO state directly, due to application state NOT_FOUND(exceeds the kyuubi.engine.yarn.submit.timeout).
a54ee39ab3/kyuubi-server/src/main/scala/org/apache/kyuubi/engine/ApplicationOperation.scala (L99-L106)
<img width="1727" alt="image" src="https://github.com/user-attachments/assets/20a2987c-675c-4136-a107-001f30b1b217" />
Here is the operation event:
<img width="1727" alt="image" src="https://github.com/user-attachments/assets/e2bab9c3-a959-4e2b-a207-813ae6489b30" />
But from the batch log, the current application status should be `PENDING`.
```
:2025-03-21 17:36:19.350 INFO [KyuubiSessionManager-exec-pool: Thread-176922] org.apache.kyuubi.operation.BatchJobSubmission: Batch report for bbba09c8-3704-4a87-8394-9bcbbd39cc34, Some(ApplicationInfo(application_1741747369441_2258235,6042072c-e8fa-425d-a6a3-3d5bbb4ec1e3-275732_6042072c-e8fa-425d-a6a3-3d5bbb4ec1e3-275732.e3a34b86-7fc7-43ea-b4a5-1b6f27df54b5.0_20250322002147.stm,PENDING,Some(https://apollo-rno-rm-2.vip.hadoop.ebay.com:50030/proxy/application_1741747369441_2258235/),Some()))
```
So, we should retrieve the batch application info after the submission process terminated before checking the application failed, to get the current application information to prevent the corner case:
1. the application submission time exceeds the `kyuubi.engine.yarn.submit.timeout` and the app state is NOT FOUND
2. can not get the application report before the submission process terminated
3. then the batch state to ERROR from PENDING directly.
Conclusion:
The application state transition was:
UNKNOWN(before submit timeout) -> NOT_FOUND(reach submit timeout) -> processExit -> batchOpError -> PENDING(updateApplicationInfoMetadataIfNeeded) -> UNKNOWN(batchError but app not terminated)
After this PR, it should be:
UNKNOWN(before submit timeout) -> NOT_FOUND(reach submit timeout) -> processExit-> PENDING(after process terminated) -> ....
### How was this patch tested?
Existing GA.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#6997 from turboFei/app_not_found_v2.
Closes#6997
370cf49e9 [Wang, Fei] v2
912ec28ca [Wang, Fei] nit
3c376f922 [Wang, Fei] log the op ex
d9cbdb87d [Wang, Fei] fix app not found
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
# 🔍 Description
## Issue References 🔗
As title.
Fix NPE, because the cleanupTerminatedAppInfoTrigger will be set to `null`.
d3520ddbce/kyuubi-server/src/main/scala/org/apache/kyuubi/engine/KubernetesApplicationOperation.scala (L269)
Also shutdown the ExecutorService when KubernetesApplicationOperation stoped.
## Describe Your Solution 🔧
Shutdown the thread executor service and check the null.
## Types of changes 🔖
- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6785 from turboFei/npe_k8s.
Closes#6785
6afd052e6 [Wang, Fei] comments
f0c3e3134 [Wang, Fei] prevent npe
9dffe0125 [Wang, Fei] shutdown
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
[
[KYUUBI #6984] Fix ValueError when rendering MapType data
](https://github.com/apache/kyuubi/issues/6984)
### Why are the changes needed?
The issue was caused by an incorrect iteration of MapType data in the `%table` magic command. When iterating over a `MapType` column, the code used `for k, v in m` directly, which leads to a `ValueError` because raw `Map` entries may not be properly unpacked
### How was this patch tested?
- [x] Manual testing:
Executed a query with a `MapType` column and confirmed that the `%table` command now renders it without errors.
```python
from pyspark.sql import SparkSession
from pyspark.sql.types import MapType, StringType, IntegerType
spark = SparkSession.builder \
.appName("MapFieldExample") \
.getOrCreate()
data = [
(1, {"a": "1", "b": "2"}),
(2, {"x": "10"}),
(3, {"key": "value"})
]
schema = "id INT, map_col MAP<STRING, STRING>"
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
df2=df.collect()
```
using `%table` render table
```python
%table df2
```
result
```python
{'application/vnd.livy.table.v1+json': {'headers': [{'name': 'id', 'type': 'INT_TYPE'}, {'name': 'map_col', 'type': 'MAP_TYPE'}], 'data': [[1, {'a': '1', 'b': '2'}], [2, {'x': '10'}], [3, {'key': 'value'}]]}}
```
### Was this patch authored or co-authored using generative AI tooling?
No
**notice** This PR was co-authored by DeepSeek-R1.
Closes#6985 from JustFeng/patch-1.
Closes#6984
e0911ba94 [Reese Feng] Update PySparkTests for magic cmd
bc3ce1a49 [Reese Feng] Update PySparkTests for magic cmd
200d7ad9b [Reese Feng] Fix syntax error in dict iteration in magic_table_convert_map
Authored-by: Reese Feng <10377945+JustFeng@users.noreply.github.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 7.0.3 to 7.0.6.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/moxystudio/node-cross-spawn/blob/master/CHANGELOG.md">cross-spawn's changelog</a>.</em></p>
<blockquote>
<h3><a href="https://github.com/moxystudio/node-cross-spawn/compare/v7.0.5...v7.0.6">7.0.6</a> (2024-11-18)</h3>
<h3>Bug Fixes</h3>
<ul>
<li>update cross-spawn version to 7.0.5 in package-lock.json (<a href="f700743918">f700743</a>)</li>
</ul>
<h3><a href="https://github.com/moxystudio/node-cross-spawn/compare/v7.0.4...v7.0.5">7.0.5</a> (2024-11-07)</h3>
<h3>Bug Fixes</h3>
<ul>
<li>fix escaping bug introduced by backtracking (<a href="640d391fde">640d391</a>)</li>
</ul>
<h3><a href="https://github.com/moxystudio/node-cross-spawn/compare/v7.0.3...v7.0.4">7.0.4</a> (2024-11-07)</h3>
<h3>Bug Fixes</h3>
<ul>
<li>disable regexp backtracking (<a href="https://redirect.github.com/moxystudio/node-cross-spawn/issues/160">#160</a>) (<a href="5ff3a07d9a">5ff3a07</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="77cd97f3ca"><code>77cd97f</code></a> chore(release): 7.0.6</li>
<li><a href="6717de49ff"><code>6717de4</code></a> chore: upgrade standard-version</li>
<li><a href="f700743918"><code>f700743</code></a> fix: update cross-spawn version to 7.0.5 in package-lock.json</li>
<li><a href="9a7e3b2165"><code>9a7e3b2</code></a> chore: fix build status badge</li>
<li><a href="085268352d"><code>0852683</code></a> chore(release): 7.0.5</li>
<li><a href="640d391fde"><code>640d391</code></a> fix: fix escaping bug introduced by backtracking</li>
<li><a href="bff0c87c8b"><code>bff0c87</code></a> chore: remove codecov</li>
<li><a href="a7c6abc6fe"><code>a7c6abc</code></a> chore: replace travis with github workflows</li>
<li><a href="9b9246e096"><code>9b9246e</code></a> chore(release): 7.0.4</li>
<li><a href="5ff3a07d9a"><code>5ff3a07</code></a> fix: disable regexp backtracking (<a href="https://redirect.github.com/moxystudio/node-cross-spawn/issues/160">#160</a>)</li>
<li>Additional commits viewable in <a href="https://github.com/moxystudio/node-cross-spawn/compare/v7.0.3...v7.0.6">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`.
[//]: # (dependabot-automerge-start)
Dependabot will merge this PR once CI passes on it, as requested by yaooqinn.
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `dependabot cancel merge` will cancel a previously requested merge and block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/apache/kyuubi/network/alerts).
</details>
Closes#6814 from dependabot[bot]/dependabot/npm_and_yarn/kyuubi-server/web-ui/cross-spawn-7.0.6.
Closes#6814
10dafbc6e [dependabot[bot]] ⬆️ Bump cross-spawn from 7.0.3 to 7.0.6 in /kyuubi-server/web-ui
Authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### Why are the changes needed?
The vanilla Spark neither support rolling nor expiration mechanism for `spark.kubernetes.file.upload.path`, if you use file system that does not support TTL, e.g. HDFS, additional cleanup mechanisms are needed to prevent the files in this directory from growing indefinitely.
This PR proposes to let `spark.kubernetes.file.upload.path` support placeholders `{{YEAR}}`, `{{MONTH}}` and `{{DAY}}` and introduce a switch `kyuubi.kubernetes.spark.autoCreateFileUploadPath.enabled` to let Kyuubi server create the directory with 777 permission automatically before submitting Spark application.
For example, the user can configure the below configurations in `kyuubi-defaults.conf` to enable monthly rolling support for `spark.kubernetes.file.upload.path`
```
kyuubi.kubernetes.spark.autoCreateFileUploadPath.enabled=true
spark.kubernetes.file.upload.path=hdfs://hadoop-cluster/spark-upload-{{YEAR}}{{MONTH}}
```
Note that: spark would create sub dir `s"spark-upload-${UUID.randomUUID()}"` under the `spark.kubernetes.file.upload.path` for each uploading, the administer still needs to clean up the staging directory periodically.
For example:
```
hdfs://hadoop-cluster/spark-upload-202412/spark-upload-f2b71340-dc1d-4940-89e2-c5fc31614eb4
hdfs://hadoop-cluster/spark-upload-202412/spark-upload-173a8653-4d3e-48c0-b8ab-b7f92ae582d6
hdfs://hadoop-cluster/spark-upload-202501/spark-upload-3b22710f-a4a0-40bb-a3a8-16e481038a63
```
Administer can safely delete the `hdfs://hadoop-cluster/spark-upload-202412` after 20250101
### How was this patch tested?
New UTs are added.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#6876 from pan3793/rolling-upload.
Closes#6876
6614bf29c [Cheng Pan] comment
5d5cb3eb3 [Cheng Pan] docs
343adaefb [Cheng Pan] review
3eade8bc4 [Cheng Pan] fix
706989778 [Cheng Pan] docs
38953dc3f [Cheng Pan] Support rolling spark.kubernetes.file.upload.path
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### Why are the changes needed?
Address comments: https://github.com/apache/kyuubi/discussions/6877#discussioncomment-11743818
> I guess this is a Kyuubi implementation issue, we just read the content from the kyuubi.kubernetes.authenticate.oauthTokenFile and call ConfigBuilder.withOauthToken, I guess this approach does not support token refresh...
### How was this patch tested?
Existing GA.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#6883 from turboFei/k8s_token_provider.
Closes#6883
69dd28d27 [Wang, Fei] comments
a01040f94 [Wang, Fei] withOauthTokenProvider
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### Why are the changes needed?
see https://github.com/apache/kyuubi/issues/6843
If the session manager's ThreadPoolExecutor refuses to execute asyncOperation, then we need to shut down the query-timeout-thread in the catch
### How was this patch tested?
1 Use jstack to view threads on the long-lived engine side

2 Wait for all SQL statements in the engine to finish executing, and then use stack to check the number of query-timeout-thread threads, which should be empty.

### Was this patch authored or co-authored using generative AI tooling?
NO
Closes#6844 from ASiegeLion/master.
Closes#6843
9107a300e [liupeiyue] [KYUUBI #6843] FIX 'query-timeout-thread' thread leak
4b3417f21 [liupeiyue] [KYUUBI #6843] FIX 'query-timeout-thread' thread leak
ef1f66bb5 [liupeiyue] [KYUUBI #6843] FIX 'query-timeout-thread' thread leak
9e1a015f6 [liupeiyue] [KYUUBI #6843] FIX 'query-timeout-thread' thread leak
78a9fde09 [liupeiyue] [KYUUBI #6843] FIX 'query-timeout-thread' thread leak
Authored-by: liupeiyue <liupeiyue@yy.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### Why are the changes needed?
Followup for https://github.com/apache/kyuubi/pull/6866
It would throw exception if both thrift binary SSL and thrift http SSL enabled
### How was this patch tested?
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#6872 from turboFei/duplicate_gauge.
Closes#6866
ea356766e [Wang, Fei] prevent conflicts
982f175fd [Wang, Fei] conflicts
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
### Why are the changes needed?
Add metrics for SSL keystore expiration, then we can add alert if the keystore will expire in 1 month.
### How was this patch tested?
Integration testing.
<img width="1721" alt="image" src="https://github.com/user-attachments/assets/f4ef6af6-923b-403c-a80d-06dbb80dbe1c" />
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#6866 from turboFei/keystore_expire.
Closes#6866
77c6db0a7 [Wang, Fei] Add metrics for SSL keystore expiration time #6866
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
### Why are the changes needed?
1. add metrics `kyuubi.operartion.batch_pending_max_elapse` for the batch pending max elapse time, which is helpful for batch health monitoring, and we can send alert if the batch pending elapse time too long
2. For `GET /api/v1/batches` api, limit the max time window for listing batches, which is helpful that, we want to reserve more metadata in kyuubi server end, for example: 90 days, but for list batches, we just want to allow user to search the last 7 days. It is optional. And if `create_time` is specified, order by `create_time` instead of `key_id`.
68a6f48da5/kyuubi-server/src/main/resources/sql/mysql/metadata-store-schema-1.8.0.mysql.sql (L32)
### How was this patch tested?
GA.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#6829 from turboFei/batch_pending_time.
Closes#6829
ee4f93125 [Wang, Fei] docs
bf8169ad4 [Wang, Fei] comments
f493a2af8 [Wang, Fei] new config
ab7b6db65 [Wang, Fei] ut
168017587 [Wang, Fei] in memory session
510a30b6a [Wang, Fei] batchSearchWindow opt
1e93dd276 [Wang, Fei] save
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
# 🔍 Description
## Issue References 🔗
This issue was noticed a few times when the batch `state` was `set` to `ERROR`, but the `appState` kept the non-terminal state forever (e.g. `RUNNING`), even if the application was finished (in this case Yarn Application).
```json
{
"id": "********",
"user": "****",
"batchType": "SPARK",
"name": "*********",
"appStartTime": 0,
"appId": "********",
"appUrl": "********",
"appState": "RUNNING",
"appDiagnostic": "",
"kyuubiInstance": "*********",
"state": "ERROR",
"createTime": 1725343207318,
"endTime": 1725343300986,
"batchInfo": {}
}
```
It seems that this happens when there is some intermittent failure during the monitoring step and the batch ends with ERROR, leaving the application metadata without an update. This can lead to some misinterpretation that the application is still running. We need to set this to `UNKNOWN` state to avoid errors.
## Describe Your Solution 🔧
This is a simple fix that only checks if the batch state is `ERROR` and the appState is not in a terminal state and changes the `appState` to `UNKNOWN`, in these cases (during the batch metadata update).
## Types of changes 🔖
- [x] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
If there is some error between the Kyuubi and the Application request (e.g. YARN client), the batch is finished with `ERROR` state and the application keeps the last know state (e.g. RUNNING).
#### Behavior With This Pull Request 🎉
If there is some error between the Kyuubi and the Application request (e.g. YARN client), the batch is finished with `ERROR `state and the application has a non-terminal state, it is forced to `UNKNOWN` state.
#### Related Unit Tests
I've tried to implement a unit test to replicate this behavior but I didn't make it. We need to force an exception in the Engine Request (e.g. `YarnClient.getApplication`) but we need to wait for the application to be in the RUNNING state before raising this exception, or maybe block the connection between kyuubi and the engine.
---
# Checklist 📝
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6722 from joaopamaral/fix/app-state-on-batch-error.
Closes#6722
8409eacac [Wang, Fei] fix
da8c356a7 [Joao Amaral] format fix
73b77b3f7 [Joao Amaral] use isTerminated
64f96a256 [Joao Amaral] Remove test
1eb80ef73 [Joao Amaral] Remove test
13498fa6b [Joao Amaral] Remove test
60ce55ef3 [Joao Amaral] add todo
3a3ba162b [Joao Amaral] Fix
215ac665f [Joao Amaral] Fix AppState when Engine connection is terminated
Lead-authored-by: Joao Amaral <7281460+joaopamaral@users.noreply.github.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
# 🔍 Description
## Issue References 🔗
This pull request fixes#2112
## Describe Your Solution 🔧
Similar to #2113, the query-timeout-thread should verify the Thrift protocol version. For protocol versions <= HIVE_CLI_SERVICE_PROTOCOL_V8, it should convert TIMEDOUT_STATE to CANCELED.
## Types of changes 🔖
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6787 from lsm1/branch-timer-checker-set-cancel.
Closes#6787
9fbe1ac97 [senmiaoliu] add isHive21OrLower method
0c77c6f6f [senmiaoliu] time checker set cancel state
Authored-by: senmiaoliu <senmiaoliu@trip.com>
Signed-off-by: senmiaoliu <senmiaoliu@trip.com>
# 🔍 Description
## Issue References 🔗
This pull request fixes #
## Describe Your Solution 🔧
Preparing v1.11.0-SNAPSHOT after branch-1.10 cut
```shell
build/mvn versions:set -DgenerateBackupPoms=false -DnewVersion="1.11.0-SNAPSHOT"
(cd kyuubi-server/web-ui && npm version "1.11.0-SNAPSHOT")
```
## Types of changes 🔖
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6769 from bowenliang123/bump-1.11.
Closes#6769
6db219d28 [Bowen Liang] get latest_branch by sorting version in branch name
465276204 [Bowen Liang] update package.json
81f2865e5 [Bowen Liang] bump
Authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
# 🔍 Description
## Issue References 🔗
## Describe Your Solution 🔧
This PR addresses an issue in the ProcessBuilder class where Java options passed as a single string (e.g., "-Dxxx -Dxxx") do not take effect. The command list must separate these options into individual elements to ensure they are recognized correctly by the Java runtime.
## Types of changes 🔖
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6772 from lsm1/branch-fix-processBuilder.
Closes#6772
fb6d53234 [senmiaoliu] fix process builder java opts
Authored-by: senmiaoliu <senmiaoliu@trip.com>
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
# 🔍 Description
## Issue References 🔗
This pull request fixes #
## Describe Your Solution 🔧
Check the uploaded resource files when creating batch via REST API
- add config `kyuubi.batch.resource.file.max.size` for resource file's max size in bytes
- add config `kyuubi.batch.extra.resource.file.max.size` for each extra resource file's max size in bytes
## Types of changes 🔖
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6756 from bowenliang123/resource-maxsize.
Closes#6756
5c409c425 [Bowen Liang] nit
4b16bcfc4 [Bowen Liang] nit
743920d25 [Bowen Liang] check resource file size max size
Authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
# 🔍 Description
## Issue References 🔗
Allow delegation tokens to be used and renewed by yarn resourcemanager. (used in proxy user mode of flink engine, address https://github.com/apache/kyuubi/pull/6383#discussion_r1635768060)
## Describe Your Solution 🔧
Set hadoop fs delegation token renewer to empty.
## Types of changes 🔖
- [X] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [X] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6753 from wForget/renewer.
Closes#6753
f2e1f0aa1 [wforget] Set hadoop fs delegation token renewer to empty
Authored-by: wforget <643348094@qq.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
# 🔍 Description
## Issue References 🔗
Seems NotAllowedException is used for method not allowed, and currently, we use false constructor, the error message we expected would not be return to client end.
It only told:
```
{"message":"HTTP 405 Method Not Allowed"}
```
Because the message we used to build the NotAllowedException was treated as `allowed` method, not as `message`.

## Describe Your Solution 🔧
We should use the ForbidenException instead, and then the error message we excepted can be visible in client end.
85dd5a52ef/kyuubi-server/src/main/scala/org/apache/kyuubi/server/api/api.scala (L47-L51)
## Types of changes 🔖
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
<img width="913" alt="image" src="https://github.com/user-attachments/assets/6c4e836d-a47a-485d-85a3-fd3a35a9e425">
---
# Checklist 📝
- [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6750 from turboFei/not_allowed_exception.
Closes#6750
4dd6fc18c [Wang, Fei] Using ForbiddenException instead of NotAllowedException
Authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
# 🔍 Description
## Issue References 🔗
This pull request fixes #
## Describe Your Solution 🔧
- check all the required extra resource files are uploaded in POST multi-part request as expected, when creating batch with REST Batch API
## Types of changes 🔖
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6731 from bowenliang123/extra-resource-check.
Closes#6731
116a47ea5 [Bowen Liang] update
cd4433a8c [Bowen Liang] update
4852b1569 [Bowen Liang] update
5bb2955e8 [Bowen Liang] update
1696e7328 [Bowen Liang] update
911a9c195 [Bowen Liang] update
042e42d23 [Bowen Liang] update
56dc7fb8a [Bowen Liang] update
Authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
# 🔍 Description
## Issue References 🔗
This pull request fixes #
## Describe Your Solution 🔧
- to fix CVE-2024-45812, CVE-2024-45811, CVE-2024-45812, CVE-2024-45811 and CVE-2024-45811 reported by dependent bot security alerts
## Types of changes 🔖
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6744 from bowenliang123/vite-4.5.4.
Closes#6744
271db1f5c [Bowen Liang] update
Authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: Bowen Liang <liangbowen@gf.com.cn>
# 🔍 Description
## Issue References 🔗
This pull request fixes https://github.com/apache/kyuubi/issues/6704
## Describe Your Solution 🔧
if periodic gc is set to 0, there is no need to perform an explicit gc.
## Types of changes 🔖
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [x] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
#### Behavior Without This Pull Request ⚰️
#### Behavior With This Pull Request 🎉
#### Related Unit Tests
---
# Checklist 📝
- [ ] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html)
**Be nice. Be informative.**
Closes#6725 from taylor12805/master.
Closes#6704
a52ddda62 [Bowen Liang] update doc
b84a32f35 [Bowen Liang] make periodic gc thead pool lazy
2d4bd7c05 [Bowen Liang] update doc in spark style
3e04604b0 [taylor.fan] [KYUUBI #6704] disable periodic gc if set interval to 0
bf20b134b [taylor.fan] [KYUUBI #6704] disable periodic gc if set interval to 0
c2b7c3078 [taylor.fan] [KYUUBI #6704] disable periodic gc if set interval to 0
6182075fc [taylor.fan] [KYUUBI #6704] disable periodic gc if set interval to 0
52b1c078b [taylor.fan] [KYUUBI #6704] disable periodic gc if set interval to 0
ccf19cf24 [taylor.fan] [KYUUBI #6704] disable periodic gc if set interval to 0
affd67c88 [taylor.fan] [KYUUBI #6704] disable periodic gc if set interval to 0
d4ee164d1 [taylor.fan] disable periodic gc if set interval to 0
Lead-authored-by: taylor.fan <taylor.fan@vipshop.com>
Co-authored-by: Bowen Liang <liangbowen@gf.com.cn>
Signed-off-by: Cheng Pan <chengpan@apache.org>
# 🔍 Description
## Issue References 🔗
This pull request fixes#6720
## Describe Your Solution 🔧
If pod goes into OOMKilled state, application should be marked as KILLED, which is eventually identified as isFailed
## Types of changes 🔖
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
## Test Plan 🧪
Tested locally, was able to launch new session
<img width="922" alt="kyuubi_new_session" src="https://github.com/user-attachments/assets/b003c86f-484d-40c5-b173-847374a45b1d">
---
**Be nice. Be informative.**
Closes#6721 from Madhukar525722/OOM.
Closes#6720
cd0bdf633 [madlnu] [KYUUBI #6720] K8s pod OOM Killed should be identified as Application failed state
Authored-by: madlnu <madlnu@visa.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>