<!--
Thanks for sending a pull request!
Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://kyuubi.readthedocs.io/en/latest/community/contributions.html
2. If the PR is related to an issue in https://github.com/apache/incubator-kyuubi/issues, add '[KYUUBI #XXXX]' in your PR title, e.g., '[KYUUBI #XXXX] Your PR title ...'.
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][KYUUBI #XXXX] Your PR title ...'.
-->
### _Why are the changes needed?_
<!--
Please clarify why the changes are needed. For instance,
1. If you add a feature, you can talk about the use case of it.
2. If you fix a bug, you can clarify why it is a bug.
-->
For critical errors at engine side, it is not handled properly. For example,when engine oom
- server may not be able to get operation statuses because of no response of engine side. In this case, client only get a ambiguous `read timeout` as a final cause.
- the oom hook of engine side might directly crash the engine, when the server is still trying to get operation status
In this PR,
- a config for retry to make the operation status updating process more robust
- make the engine oom hook only de register itself to make it able to recover for some transient errors
### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible
- [ ] Add screenshots for manual tests if appropriate
- [ ] [Run test](https://kyuubi.readthedocs.io/en/latest/develop_tools/testing.html#running-tests) locally before make a pull request
Closes#1312 from yaooqinn/1300.
Closes#1300
a715ecca [Kent Yao] add comments
a816b0f0 [Kent Yao] refine
3557c927 [Kent Yao] add comments
2ed8bfb4 [Kent Yao] refine
aefd2a7f [Kent Yao] update doc
f2ea7e4c [Kent Yao] restore SparkOperation
386e4eac [Kent Yao] [KYUUBI #1300] Detecting critical errors
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>