[CELEBORN-1341][FOLLOWUP] Improve Celeborn document

### What changes were proposed in this pull request? Improve Celeborn document to fix typos, formats, unvalid link and unsynced default value of document. Meanwhile, the public interfaces of `shuffleclient.md` keep the consistent with `ShuffleClient`. ### Why are the changes needed? There are some typos, formats, unvalid link and unsynced default value fixes in Celeborn document at present. Meanwhile, the public interfaces of `shuffleclient.md` is inconsistent with `ShuffleClient`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #2410 from SteNicholas/CELEBORN-1341. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-03-22 16:34:25 +08:00 · 2024-03-22 16:34:25 +08:00 · 8fbcbead48
commit 8fbcbead48
parent d62f75fdc7
8 changed files with 20 additions and 13 deletions
--- a/docker/DEPLOY_ON_K8S.md
+++ b/docker/DEPLOY_ON_K8S.md
@ -46,7 +46,7 @@ For more information of Celeborn configurations, see [CONFIGURATIONS](../CONFIGU

 #### Install Celeborn
 ```
-helm install celeborn ${CELEBORN_HOME}/charts/celebron -n ${celeborn namespace}
+helm install celeborn ${CELEBORN_HOME}/charts/celeborn -n ${celeborn namespace}
 ```

 #### Connect to Celeborn in K8s pod
--- a/docs/README.md
+++ b/docs/README.md
@ -20,7 +20,7 @@ license: |
 ---
 Quick Start
 ===
-This documentation gives a quick start guide for running Apache Spark/Flink/MapReduce with Apache Celeborn™(Incubating).
+This documentation gives a quick start guide for running Spark/Flink/MapReduce with Apache Celeborn™(Incubating).

 ### Download Celeborn
 Download the latest Celeborn binary from the [Downloading Page](https://celeborn.apache.org/download/).
@ -126,11 +126,13 @@ cp $CELEBORN_HOME/flink/<Celeborn Client Jar> $FLINK_HOME/lib/
 ```
 #### Add Celeborn configuration to Flink's conf
 Set `shuffle-service-factory.class` to Celeborn's ShuffleServiceFactory in Flink configuration file:
+
 - Flink 1.14.x, Flink 1.15.x, Flink 1.17.x, Flink 1.18.x
 ```shell
 cd $FLINK_HOME
 vi conf/flink-conf.yaml
 ```
+
 - Flink 1.19.x
 ```shell
 cd $FLINK_HOME
--- a/docs/configuration/index.md
+++ b/docs/configuration/index.md
@ -189,7 +189,7 @@ Example can refer to [Hadoop Rack Awareness](https://hadoop.apache.org/docs/stab

 `ShuffleClient` records the shuffle partition location's host, service port, and filename,
 to support workers recovering reading existing shuffle data after worker restart,
-during worker shutdown, workers should store the meta about reading shuffle partition files in LevelDB,
+during worker shutdown, workers should store the meta about reading shuffle partition files in RocksDB or LevelDB(deprecated),
 and restore the meta after restarting workers, also workers should keep a stable service port to support
 `ShuffleClient` retry reading data. Users should set `celeborn.worker.graceful.shutdown.enabled` to `true` and
 set below service port with stable port to support worker recover status.
--- a/docs/deploy_on_k8s.md
+++ b/docs/deploy_on_k8s.md
@ -33,7 +33,7 @@ Celeborn currently supports rapid deployment by using helm.

 ### 1. Get Celeborn Binary Package

-You can find released version of Celeborn on https://celeborn.apache.org/download/.
+You can find released version of Celeborn on [Downloading Page](https://celeborn.apache.org/download/).

 Of course, you can build binary package from master branch or your own branch by using `./build/make-distribution.sh` in
 source code.
@ -139,7 +139,7 @@ network infrastructure, this may cause pressure on DNS service or other network

 ### 6. Build Celeborn Client

-Here, without going into detail on how to configure spark/flink to find celeborn master/worker, mention the key
+Here, without going into detail on how to configure Spark/Flink/MapReduce to find celeborn master/worker, mention the key
 configuration:

 ```
@ -149,5 +149,5 @@ spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.<namespac
 You can find why config endpoints such way
 in [Kubernetes DNS for Service And Pods](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/)

-> Notice: You should ensure that Spark/Flink can find the Celeborn Master/Worker via IP or the Kubernetes DNS mentioned
+> Notice: You should ensure that Spark/Flink/MapReduce can find the Celeborn Master/Worker via IP or the Kubernetes DNS mentioned
 > above
--- a/docs/developers/integrate.md
+++ b/docs/developers/integrate.md
@ -20,8 +20,7 @@ license: |
 ## Overview
 The core components of Celeborn, i.e. `Master`, `Worker`, and `Client` are all engine irrelevant. Developers can
 integrate Celeborn with various engines or applications by using or extending Celeborn's `Client`, as the officially
-supported plugins for Apache Spark and Apache Flink, see [Spark Plugin](../../developers/spark) and 
-[Flink Plugin](../../developers/flink).
+supported plugins for Spark/Flink/MapReduce.

 This article briefly describes an example of integrating Celeborn into a simple distributed application using
 Celeborn `Client`.
--- a/docs/developers/overview.md
+++ b/docs/developers/overview.md
@ -104,7 +104,7 @@ When graceful shutdown is turned on, upon shutdown, Celeborn will do the followi
 2. Worker will inform Clients to split.
 3. Client will send `CommitFiles` to the Worker.

-Then the Worker waits until all `PartitionLocation` flushes data to persistent storage, stores states in local leveldb/rocksdb,
+Then the Worker waits until all `PartitionLocation` flushes data to persistent storage, stores states in local RocksDB or LevelDB(deprecated),
 then stops itself. The process is typically within one minute.

 For more details, please refer to [Rolling upgrade](../../upgrading/#rolling-upgrade)
--- a/docs/developers/shuffleclient.md
+++ b/docs/developers/shuffleclient.md
@ -124,19 +124,25 @@ to guarantee no data is lost.
 ```java
  public abstract CelebornInputStream readPartition(
      int shuffleId,
+      int appShuffleId,
      int partitionId,
      int attemptNumber,
      int startMapIndex,
-      int endMapIndex)
+      int endMapIndex,
+      ExceptionMaker exceptionMaker,
+      MetricsCallback metricsCallback)
 ```

- `shuffleId` is the unique shuffle id of the application
+- `shuffleId` is the unique shuffle id of Celeborn
+- `appShuffleId` is the unique shuffle id of the application
 - `partitionId` is the partition id to read from
 - `attemptNumber` is the attempt id of reduce task, can be safely set to any value
 - `startMapIndex` is the index of start map index of interested map range, set to 0 if you want to read all
  partition data
 - `endMapIndex` is the index of end map index of interested map range, set to `Integer.MAX_VALUE` if you want
  to read all partition data
+- `exceptionMaker` is the marker of exception including fetch failure exception.
+- `metricsCallback` is the callback of monitoring metrics to increase read bytes and time etc.

 The returned input stream is guaranteed to be `Exactly Once`, meaning no data lost and no duplicated reading, or else
 an exception will be thrown, see [Here](../../developers/faulttolerant#exactly-once).
--- a/docs/upgrading.md
+++ b/docs/upgrading.md
@ -50,7 +50,7 @@ Users can increase the configuration value appropriately according to the situat
 Shuffle client records the shuffle partition location's host, service port, and filename,
 to support workers recovering reading existing shuffle data after worker restart,
 during worker shutdown, workers should store the meta about reading shuffle partition files
-in LevelDB, and restore the meta after restarting workers.
+in RocksDB or LevelDB(deprecated), and restore the meta after restarting workers.
 Users should set `celeborn.worker.graceful.shutdown.enabled` to `true` to enable graceful shutdown.
 During this process, worker will wait all allocated partition's in this worker to be committed
 within a timeout of `celeborn.worker.graceful.shutdown.checkSlotsFinished.timeout`, which default value is `480s`.
@ -70,7 +70,7 @@ In order to speed up the restart process, worker let all push data requests retu
 during worker shutdown, and shuffle client will re-apply for a new partition location for these allocated partitions.
 Then client side can record all HARD_SPLIT partition information and pre-commit these partition,
 then the worker side allocated partitions can be committed in a very short time. User should enable
-`celeborn.client.shuffle.batchHandleCommitPartition.enabled`, the default value is false.
+`celeborn.client.shuffle.batchHandleCommitPartition.enabled`, the default value is true.

 ### Example setting