Commit Graph

1423 Commits

Author SHA1 Message Date
jiaoqingbo
f1713dacaf [MINOR] Fix incorrect default resume ratio in trafficcontrol doc
<!--
Thanks for sending a pull request!  Here are some tips for you:
  - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] Your PR title ...'.
  - Be sure to keep the PR description updated to reflect all changes.
  - Please write your PR title to summarize what this PR proposes.
  - If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

Since 0.3.1, Celeborn changed the default value of `celeborn.worker.directMemoryRatioToResume` from `0.5` to `0.7`.

the doc should be update

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

PASS GA

Closes #1931 from jiaoqingbo/ratiofix.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-21 11:18:48 +08:00
sychen
8eba1b470e
[CELEBORN-1000] MR module style check
### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1929 from cxzl25/CELEBORN-1000.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-09-20 16:54:42 +08:00
sychen
bb50618780
[CELEBORN-997][DOC] Fix Rolling upgrade broken link
### What changes were proposed in this pull request?
https://celeborn.apache.org/docs/latest/developers/overview/

> For more details, please refer to Rolling upgrade

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1927 from cxzl25/CELEBORN-997.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-09-20 16:44:42 +08:00
Fu Chen
6b0addb934 [CELEBORN-989] Add support for making distribution package via SBT
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

Users have the capability to generate the binary distribution package using SBT by executing the following command:

```shell
./build/make-distribution.sh --sbt-enabled
```

### How was this patch tested?

Pass GA && locally tested.

Closes #1921 from cfmcgrady/sbt-make-dist-3.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-09-20 10:03:01 +08:00
sychen
90ff0dba2f [CELEBORN-951][FOLLOWUP] IssueNavigationLink adapts to early Github Issues
### What changes were proposed in this pull request?

https://github.com/apache/incubator-celeborn/pull/1883

![image](https://github.com/apache/incubator-celeborn/assets/3898450/b39d58a7-1466-46e2-b157-fc765960edd4)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1926 from cxzl25/CELEBORN-951_followup.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-20 09:06:00 +08:00
sychen
b2b7c4d359 [CELEBORN-991][DOC] Remove incorrect spark.metrics.conf
### What changes were proposed in this pull request?
1. Replace `spark.metrics.conf` with `celeborn.metrics.conf`.
2. Fix broken links.
https://celeborn.apache.org/docs/latest/monitoring/#metrics

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1925 from cxzl25/CELEBORN-991.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-20 09:03:27 +08:00
ming.li
428e2660bc [CELEBORN-990] Add exception handler when calling CelebornHadoopUtils.getHadoopFS
Add exception handler when calling CelebornHadoopUtils.getHadoopFS(conf) on Master and Worker, Avoid Concealing Initialization HDFS Exception Information

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1923 from leemingzixxoo/main.

Authored-by: ming.li <ming.li@dmall.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-19 19:44:59 +08:00
zwangsheng
b21069866d [CELEBORN-971][HELM] Should update Charts appVersion when we update project version
### What changes were proposed in this pull request?
Should update Charts appVersion when we update project version and we should change this appVersion in future release work.

### Why are the changes needed?
appVersion means:
>This is the version number of the application being deployed. This version number should be incremented each time you make changes to the application. Versions are not expected to follow Semantic Versioning. They should reflect the version the application is using.

### Does this PR introduce _any_ user-facing change?
Yes, user will find appVersion changed when using celeborn chart

### How was this patch tested?
Local

Closes #1904 from zwangsheng/CELEBORN-971.

Authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-19 11:19:39 +08:00
Fu Chen
1e49ff76f3 [CELEBORN-988] Add config option to control original unsorted file deletion in PartitionFilesSorter
### What changes were proposed in this pull request?

This PR adds a new configuration option, `celeborn.worker.sortPartition.lazyRemovalOfOriginalFiles.enabled`, allowing users to control whether the `PartitionFilesSorter` deletes the original unsorted file.

### Why are the changes needed?

https://github.com/apache/incubator-celeborn/pull/1907#issuecomment-1723420513

### Does this PR introduce _any_ user-facing change?

Users have the option to prevent the `PartitionSorter` from deleting the original unsorted file by configuring `celeborn.worker.sortPartition.lazyRemovalOfOriginalFiles.enabled = false`.

### How was this patch tested?

Pass GA

Closes #1922 from cfmcgrady/make-delete-configurable.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-19 11:14:51 +08:00
sychen
beed2a85b0
[CELEBORN-977] Support RocksDB as recover DB backend
### What changes were proposed in this pull request?

### Why are the changes needed?

LevelDB does not support mac arm version.

```java
java.lang.UnsatisfiedLinkError: Could not load library. Reasons: [no leveldbjni64-1.8 in java.library.path, no leveldbjni-1.8 in java.library.path, no leveldbjni in java.library.path, /private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8: dlopen(/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8, 0x0001): tried: '/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8' (fat file, but missing compatible architecture (have 'x86_64,i386', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8' (no such file), '/private/var/folders/tc/r2n_8g6j4731h7clfqwntg880000gn/T/libleveldbjni-64-1-4616234670453989010.8' (fat file, but missing compatible architecture (have 'x86_64,i386', need 'arm64'))]
  	at org.fusesource.hawtjni.runtime.Library.doLoad(Library.java:182)
  	at org.fusesource.hawtjni.runtime.Library.load(Library.java:140)
  	at org.fusesource.leveldbjni.JniDBFactory.<clinit>(JniDBFactory.java:48)
  	at org.apache.celeborn.service.deploy.worker.shuffledb.LevelDBProvider.initLevelDB(LevelDBProvider.java:49)
  	at org.apache.celeborn.service.deploy.worker.shuffledb.DBProvider.initDB(DBProvider.java:30)
  	at org.apache.celeborn.service.deploy.worker.storage.StorageManager.<init>(StorageManager.scala:197)
  	at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:109)
  	at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:734)
  	at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala)
```

The released `leveldbjni-all` for `org.fusesource.leveldbjni` does not support AArch64 Linux, we need to use `org.openlabtesting.leveldbjni`.

See https://issues.apache.org/jira/browse/HADOOP-16614

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
local test

Closes #1913 from cxzl25/CELEBORN-977.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-19 09:20:33 +08:00
sychen
4d35e501a3 [CELEBORN-984][DOC] shutdownWorkers API documentation
### What changes were proposed in this pull request?
https://celeborn.apache.org/docs/latest/monitoring/#master_1

07c1dc2568/service/src/main/scala/org/apache/celeborn/server/common/http/HttpRequestHandler.scala (L74-L75)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1920 from cxzl25/CELEBORN-984.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-18 19:58:11 +08:00
Shuang
615479c442 [CELEBORN-468] Timeout useless lostWorkers/shutdownWorkers meta
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
If Worker lost or lost after graceful shutdown, Master would retain these lostWorker/shutdownWorkers meta permanently,
These meta would cause some noisy message in lifecycleManager. For these meta better to delete them after a while

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT & E2E test

Closes #1916 from RexXiong/CELEBORN-468.

Authored-by: Shuang <lvshuang.tb@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-18 18:39:43 +08:00
sychen
07c1dc2568 [CELEBORN-975] Refactor the check logic to stop the celeborn master and worker
### What changes were proposed in this pull request?

`stop-master.sh` and `stop-worker.sh` support the stop command to wait up to 600s after starting `kill -15`.

Delete the pid file only when the stop succeeds, to avoid failing to retry the stop command to find the pid file.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1911 from cxzl25/CELEBORN-975.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-18 16:23:32 +08:00
sychen
fbeb5a62ec [CELEBORN-982] Improve RPC bind port tips
### What changes were proposed in this pull request?
Current
```
23/09/18 11:35:07,506 WARN [main] Utils: Service 'MasterSys' could not bind on port 9097. Attempting port 9098.
23/09/18 11:35:07,506 INFO [main] NettyRpcEnvFactory: Starting RPC Server [MasterSys] on clb-master:9098 with advisor endpoint clb-master:9098
Exception in thread "main" java.net.BindException: Address already in use: Service 'MasterSys' failed after 1 retries (starting from 9097)! Consider explicitly setting the appropriate port for the service 'MasterSys' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
	at sun.nio.ch.Net.bind0(Native Method)
```
PR
```
23/09/18 11:43:03,157 WARN [main] Utils: Service 'MasterSys' could not bind on port 9097. Attempting port 9098.
23/09/18 11:43:03,157 INFO [main] NettyRpcEnvFactory: Starting RPC Server [MasterSys] on clb-master:9098 with advisor endpoint clb-master:9098
Exception in thread "main" java.net.BindException: Address already in use: Service 'MasterSys' failed after 1 retries (starting from 9097)! Consider explicitly setting the appropriate port for the service 'MasterSys' to an available port or increasing celeborn.port.maxRetries.
	at sun.nio.ch.Net.bind0(Native Method)
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1918 from cxzl25/CELEBORN-982.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-18 16:00:22 +08:00
jiaoqingbo
107f3df8ba [CELEBORN-979] Reduce default disk Check Interval
### What changes were proposed in this pull request?

Reduce default disk Check Interval

### Why are the changes needed?

since https://github.com/apache/incubator-celeborn/pull/1909 ,In PushDataHandler#checkDiskFull method,Added check logic for DiskInfo status, the default disk Check Interval should be reduced

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

PASS GA

Closes #1915 from jiaoqingbo/979.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-18 14:54:22 +08:00
sychen
045c682c89 [CELEBORN-978] Improve dependency.sh replacement mode
### What changes were proposed in this pull request?

### Why are the changes needed?
When executing the update script locally, it may generate such a Log, which causes awk to exit with an error.
```
Downloading from nexus: httpxxxx
```

```bash
./dev/dependencies.sh --replace
```

```
awk: trying to access out of range field -1
 input record number 1, file
 source line number 2
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1914 from cxzl25/CELEBORN-978.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-16 09:35:13 +08:00
sychen
375e855d42 [CELEBORN-976] Introduce script to check master and worker status
### What changes were proposed in this pull request?
Use `status-master.sh` and `status-worker.sh` to check the pid status corresponding to the master and worker.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1912 from cxzl25/CELEBORN-976.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-16 09:27:52 +08:00
sychen
7b4d38ea14 [CELEBORN-981] Imrpove enable graceful shutdown tips
### What changes were proposed in this pull request?

```
celeborn.worker.graceful.shutdown.enabled=true
```

```
  23/09/15 17:17:29,887 ERROR [main] Worker: Initialize worker failed.
  java.lang.AssertionError: assertion failed: If enable graceful shutdown, the worker should use stable server port.
  	at scala.Predef$.assert(Predef.scala:223)
  	at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:87)
  	at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:734)
  	at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala)
```

```
23/09/15 17:51:25,937 ERROR [main] Worker: Initialize worker failed.
  java.lang.AssertionError: assertion failed: If enable graceful shutdown, the worker should use non-zero port. celeborn.worker.rpc.port=0, celeborn.worker.fetch.port=9193, celeborn.worker.push.port=9192, celeborn.worker.replicate.port=9194
  	at scala.Predef$.assert(Predef.scala:223)
  	at org.apache.celeborn.service.deploy.worker.Worker.<init>(Worker.scala:91)
  	at org.apache.celeborn.service.deploy.worker.Worker$.main(Worker.scala:738)
  	at org.apache.celeborn.service.deploy.worker.Worker.main(Worker.scala)
```

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1917 from cxzl25/CELEBORN-981.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-15 22:33:14 +08:00
jiaoqingbo
03fc00e6a6 [CELEBORN-962] Add check DiskInfo#Status in PushDataHandler#checkDiskFull
### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

As Title

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

PASS GA

Closes #1909 from jiaoqingbo/CELEBORN-962.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 19:53:28 +08:00
mingji
cb9adfc511
[CELEBORN-974] Add quick start guide about using MapReduce with Celeborn
### What changes were proposed in this pull request?
Add quick start guide about using MapReduce with Celeborn.

### Why are the changes needed?
Celeborn supports MapReduce client recently.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
No need to test.

Closes #1908 from FMX/CELEBORN-974.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2023-09-14 19:31:01 +08:00
mingji
e0c00ecd38 [CELEBORN-839][MR] Support Hadoop MapReduce
### What changes were proposed in this pull request?
1. Map side merge and push.
2. Support hadoop2 & 3.
3. Reduce in-memory merge.
4. Integrate LifecycleManager to RmApplicationMaster.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

I tested this PR on a cluster with a 4x 16 CPU 64G Mem 4ESSD cluster.
Hadoop 2.8.5

1TB Terasort, 8400 mappers, 1000 reducers
Celeborn 81min vs MR shuffle 89min
![mr1](https://github.com/apache/incubator-celeborn/assets/4150993/a3cf6493-b6ff-4c03-9936-4558cf22761d)
![mr2](https://github.com/apache/incubator-celeborn/assets/4150993/9119ffb4-6996-4b77-bcdf-cbd6db5c096f)

1GB wordcount, 8 mappers, 8 reducers
Celeborn 35s VS MR shuffle 38s
![mr3](https://github.com/apache/incubator-celeborn/assets/4150993/907dce24-16b7-4788-ab5d-5b784fd07d47)
![mr4](https://github.com/apache/incubator-celeborn/assets/4150993/8e8065b9-6c46-4c8d-9e71-45eed8e63877)

Closes #1830 from FMX/CELEBORN-839.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 14:12:53 +08:00
zhongqiang.czq
8b4fe73d4f [CELEBORN-972][HELM] Enhance workingdirDiskCapacity unit parsing and fix ConfigMap is not effected for workerStatefuleSet
### What changes were proposed in this pull request?
1. fix the issue with the configmap not being mounted for worker
2. fix compatability with different workingdir's capacity byte unit types, e.g.  Gi, Ti.

### Why are the changes needed?

1. in previous pr the configmap is removed from value.yaml, but worker-statefulset.yaml still use this config, so worker pod can't mount the /opt/celeborn/conf to configmap volume.

2. in previous pr capacity is appended for workingdir, but unit type Gi is not suppored by byteStringTransformer
``` java
  java.lang.NumberFormatException: Size must be specified as bytes (b), kibibytes (k), mebibytes (m), gibibytes (g), tebibytes (t), or pebibytes(p). E.g. 50b, 100k, or 250m.Invalid suffix: "gi"

celeborn.worker.storage.dirs=/mnt/disk1:disktype=SSD:capacity=100Gi,/mnt/disk2:disktype=SSD:capacity=100Gi,/mnt/disk3:disktype=SSD:capacity=100Gi,/mnt/disk4:disktype=SSD:capacity=100Gi
```

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?
manual on k8s
1. the issue of configMap not being mounted
- befor this pr
```shell
kubectl get pod celeborn-worker-0 -o yaml |grep conf
  - configMap:
      name: celeborn-conf
      - configMap:
```
- after this pr
``` shell
kubectl get pod celeborn-worker-0 -o yaml |grep conf
    - mountPath: /opt/celeborn/conf
  - configMap:
      name: celeborn-conf
      - configMap:
```
2. compatilbiy
the NumberFormatException is not thrown after this pr.

Closes #1906 from zhongqiangczq/helm-fix.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-09-14 11:58:26 +08:00
camper42
3c8d89ca99 [CELEBORN-969][HELM] Allow user set priorityClass used by celeborn pods
### What changes were proposed in this pull request?

Allow user set priorityClass used by celeborn pods

### Why are the changes needed?

Allow user set a proper priorityClass to avoid unwanted evict happens.

### Does this PR introduce _any_ user-facing change?

No, default values change nothing.

### How was this patch tested?

Run test locally before make a pull request

`helm template test charts/celeborn > new_rendered.yaml` && `diff old_rendered.yaml new_rendered.yaml`

Closes #1902 from camper42/priority-class.

Authored-by: camper42 <camper.xlii@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-14 11:39:04 +08:00
Melody
0bf42013f0
[CELEBORN-968] Make volume name dynamic in StatefulSet in Helm chart
### What changes were proposed in this pull request?
Update volumeMounts's name by a dynamic {{ $.Release.Name }} prefix  in `master-stateful.yaml` and `worker-stateful.yaml`

### Why are the changes needed?

When running multiple Celeborn clusters with different release names and Celeborn versions, helm install was failed until making the initContainers' volume name dynamic. See the error from master's statefulset :

>> kubectl describe statefulset.apps/clbv3-master  -n celeborn

Events:
  Type     Reason        Age                From                    Message
  ----     ------        ----               ----                    -------
  Warning  FailedCreate  5s (x14 over 46s)  statefulset-controller  create Pod clbv3-master-0 in StatefulSet clbv3-master failed error: Pod "clbv3-master-0" is invalid: spec.initContainers[0].volumeMounts[0].name: Not found: "celeborn-master-vol-0"

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Helm install Celeborn cluster twice in the same namespace but different release names. It shouldn't fail.
For example:
```
helm install celeborn charts/celeborn-shuffle-service  -n celeborn
helm install clbv3 charts/celeborn-shuffle-service  -n celeborn
```

Closes #1901 from a140262/main.

Authored-by: Melody <meloyang@amazon.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-13 13:45:25 +08:00
zwangsheng
03a39819b5 [CELEBORN-882][WORKER][METRICS] Add Pause Push Data Time Count Metrics & Dashboard Panel
### What changes were proposed in this pull request?
Add `PausePushDataTime ` Metrics

### Why are the changes needed?
Count each celeborn worker pause time.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster Test

Closes #1800 from zwangsheng/CELEBORN-882.

Lead-authored-by: zwangsheng <2213335496@qq.com>
Co-authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-12 17:45:26 +08:00
onebox-li
0e53a3d552 [CELEBORN-932] Fix worker register after gracefaully restart
### What changes were proposed in this pull request?
Worker will firstly register failed after worker gracefully restart in HA mode, it will be really registered after one heartbeat.
<img width="889" alt="image" src="https://github.com/apache/incubator-celeborn/assets/19429353/371aa0e0-b2e9-4c1f-9e40-276dc1460219">
This is because master here uses same `requestId` to submit request,  causing the second request not be processed correctly, due to Ratis `RetryCache`.
Master logs like below:
(worker gracefully stop)
Master: Receive ReportNodeFailure
(worker start)
Master: Received RegisterWorker request
Master: Received heartbeat from unknown worker
Master: Registered worker

So here improve AbstractMetaManager#updateRegisterWorkerMeta to cover `WorkerRemove` logic. For back compatibility and possible inconsistencies during rolling upgrade, temporarily fix duplicate requestId and keep remove function. And we can try to remove `WorkerRemove` logic in the future version.

### Why are the changes needed?
Ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Cluster test

Closes #1863 from onebox-li/fix-restart-register.

Authored-by: onebox-li <lyh-36@163.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-11 21:23:28 +08:00
sychen
d7e900fa9a [CELEBORN-959] Use Java API to obtain disk capacity information instead of df command
### What changes were proposed in this pull request?
Use Java API to obtain disk capacity information.

bf605c8acc/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DF.java (L84-L104)

599bb77c45/jdk/src/solaris/native/java/io/UnixFileSystem_md.c (L439-L467)

### Why are the changes needed?

Some OS does not support the `df -B1` command, and the worker will throw an `ArrayIndexOutOfBoundsException` exception.

We can replace the df command with the Java API, which is more general.

```java
23/09/08 22:03:25,522 ERROR [worker-disk-checker] LocalDeviceMonitor: Device check failed.
java.util.concurrent.ExecutionException: java.lang.ArrayIndexOutOfBoundsException: -4
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:206)
	at org.apache.celeborn.common.util.Utils$.tryWithTimeoutAndCallback(Utils.scala:858)
	at org.apache.celeborn.service.deploy.worker.storage.DeviceMonitor$.highDiskUsage(DeviceMonitor.scala:258)
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.$anonfun$run$9(DeviceMonitor.scala:136)
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.$anonfun$run$9$adapted(DeviceMonitor.scala:135)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.$anonfun$run$2(DeviceMonitor.scala:135)
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.$anonfun$run$2$adapted(DeviceMonitor.scala:110)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at org.apache.celeborn.service.deploy.worker.storage.LocalDeviceMonitor$$anon$1.run(DeviceMonitor.scala:110)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -4
	at org.apache.celeborn.service.deploy.worker.storage.DeviceMonitor$.$anonfun$highDiskUsage$1(DeviceMonitor.scala:240)
	at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
	at org.apache.celeborn.common.util.Utils$$anon$3.call(Utils.scala:851)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	... 3 more
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1892 from cxzl25/CELEBORN-959.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-11 17:42:29 +08:00
sychen
0aaffe6f97
[CELEBORN-964] Simplify read process output to prevent leak
### What changes were proposed in this pull request?
close InputStream

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1897 from cxzl25/CELEBORN-964.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-11 16:46:13 +08:00
zwangsheng
50c70265b5
[CELEBORN-963] Add WORKDIR in celeborn Dockerfile
### What changes were proposed in this pull request?
Introduce env `WORKDIR` into celeborn `docker/Dockerfile`.

### Why are the changes needed?
We should add `WORKDIR` in Dockerfile, this will lead us to `/opt/celeborn` when we get into Celeborn Containers.

According to https://docs.docker.com/engine/reference/builder/{}
> The WORKDIR instruction sets the working directory for any RUN, CMD, ENTRYPOINT, COPY and ADD instructions that follow it in the Dockerfile. If the WORKDIR doesn't exist, it will be created even if it's not used in any subsequent Dockerfile instruction.

And also we can find same `WORKDIR` in spark project
3d119a5280/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile (L57)
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Local test

```log
hadoopXXXX:~/yangbinjie/XXXXe$ docker run cd3d2a0ccab5e88c202ad56c98d4db6ca5d36b2f7d44b5aa2a9166f075d5f950 ls -l
total 269
drwxrwxr-x 2 celeborn celeborn   4 Sep 11 05:37 bin
drwxrwxr-x 2 celeborn celeborn   9 Sep 11 05:37 conf
drwxrwxr-x 2 celeborn celeborn  78 Sep 11 05:37 jars
drwxrwxr-x 2 celeborn celeborn  79 Sep 11 05:37 master-jars
-rw-rw-r-- 1 celeborn celeborn 138 Sep 11 03:33 RELEASE
drwxrwxr-x 2 celeborn celeborn  11 Sep 11 05:37 sbin
drwxrwxr-x 2 celeborn celeborn  66 Sep 11 05:37 worker-jars
```

Closes #1896 from zwangsheng/CELEBORN-963.

Authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-11 15:10:26 +08:00
sychen
8b7989ad0c [CELEBORN-900][FOLLOWUP] Disable jemalloc in non-docker environment
### What changes were proposed in this pull request?
1. Provide `CELEBORN_PREFER_JEMALLOC` configuration to determine whether to enable jemalloc
2. Provide `CELEBORN_JEMALLOC_PATH` to configure the jemalloc path, for example, Centos is `/usr/lib64/libjemalloc.so`
3. Enable jemalloc by default in the docker environment

### Why are the changes needed?
Prevent unnecessary WARNING.

https://github.com/apache/incubator-celeborn/pull/1824#discussion_r1319909938

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
local test

Closes #1895 from cxzl25/CELEBORN-900_diable.

Lead-authored-by: sychen <sychen@ctrip.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-11 14:55:10 +08:00
sychen
cb8ace406b [CELEBORN-960] Exclude workers without healthy disks
### What changes were proposed in this pull request?
The master checks the number of healthy disks in the woker and decides whether to exclude it.

### Why are the changes needed?

When the disks of all the workers are unhealthy, HDFS is not enabled, and the master does not exclude the workers, the spark client calls checkWorkersAvailable and returns available, and the shuffle write ultimately fails without fallback.

```java
23/09/08 23:20:44 ERROR LifecycleManager: Aggregated error of reserveSlots for shuffleId 9 failure:
 [reserveSlots] Failed to reserve buffers for shuffleId 9 from worker Host:1.2.3.4:RpcPort:55803:PushPort:55805:FetchPort:55807:ReplicatePort:55806. Reason: Local storage has no available dirs!
23/09/08 23:20:44 ERROR LifecycleManager: Retry reserve slots for 9 failed caused by not enough slots.
23/09/08 23:20:44 WARN LifecycleManager: Reserve buffers for 9 still fail after retrying, clear buffers.
23/09/08 23:20:44 ERROR LifecycleManager: reserve buffer for 9 failed, reply to all.
23/09/08 23:20:44 ERROR ShuffleClientImpl: LifecycleManager request slots return RESERVE_SLOTS_FAILED, retry again, remain retry times 0.
23/09/08 23:20:47 WARN TaskSetManager: Lost task 8.0 in stage 27.0 (TID 89) (1.2.3.4 executor driver): TaskKilled (Stage cancelled)
23/09/08 23:20:59 ERROR MasterClient: Send rpc with failure, has tried 15, max try 15!
org.apache.celeborn.common.exception.CelebornException: Exception thrown in awaitResult:
	at org.apache.celeborn.common.util.ThreadUtils$.awaitResult(ThreadUtils.scala:229)
	at org.apache.celeborn.common.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:74)
	at org.apache.celeborn.common.client.MasterClient.sendMessageInner(MasterClient.java:150)
	at org.apache.celeborn.common.client.MasterClient.askSync(MasterClient.java:118)
	at org.apache.celeborn.client.LifecycleManager.requestMasterRequestSlots(LifecycleManager.scala:1033)
	at org.apache.celeborn.client.LifecycleManager.requestMasterRequestSlotsWithRetry(LifecycleManager.scala:1022)
	at org.apache.celeborn.client.LifecycleManager.org$apache$celeborn$client$LifecycleManager$$offerAndReserveSlots(LifecycleManager.scala:402)
	at org.apache.celeborn.client.LifecycleManager$$anonfun$receiveAndReply$1.applyOrElse(LifecycleManager.scala:210)
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
local test
```
23/09/08 23:23:27 WARN CelebornShuffleFallbackPolicyRunner: No workers available for current user `default`.`default`.
23/09/08 23:23:27 WARN SparkShuffleManager: Fallback to vanilla Spark SortShuffleManager for shuffle: 10
23/09/08 23:23:28 WARN CelebornShuffleFallbackPolicyRunner: No workers available for current user `default`.`default`.
23/09/08 23:23:28 WARN SparkShuffleManager: Fallback to vanilla Spark SortShuffleManager for shuffle: 11
100000
Time taken: 0.192 seconds, Fetched 1 row(s)
```
```

Closes #1893 from cxzl25/CELEBORN-960.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-09 18:52:25 +08:00
jiang13021
163dc8d5b8 [CELEBORN-961] Catch exception when constructing Worker
### What changes were proposed in this pull request?
Move the constructor of Worker into a try-catch block.

### Why are the changes needed?
There are some exceptions thrown from Worker's constructor instead of Worker.initialize(), so it is necessary to catch these exceptions.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Start a worker with conf
```
celeborn.worker.directMemoryRatioToPauseReceive=0.7
celeborn.worker.directMemoryRatioToResume=0.7
```
an IllegalArgumentException will be caught

Closes #1894 from jiang13021/celenorn-961.

Authored-by: jiang13021 <jiangyanze.jyz@antgroup.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-09 17:40:13 +08:00
zhouyifan279
7ab674393b [CELEBORN-913][FOLLOWUP] Recover SBT CI jobs skipped due to last commit
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Verified SBT CI job list.

<img width="843" alt="image" src="https://github.com/apache/incubator-celeborn/assets/88070094/2bbaf661-8f4d-4f3a-a7e4-242484fbd9a2">

Closes #1890 from zhouyifan279/sbt-ci.

Authored-by: zhouyifan279 <zhouyifan279@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-09-09 02:12:09 +08:00
hongzhaoyang
a77a8eb8fd [CELEBORN-881][BUG] StorageManager clean up thread may delete new app directories
### What changes were proposed in this pull request?

Worker throw FileNotFoundException while fetch chunk:
```
java.io.FileNotFoundException: /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/871-0-0 (No such file or directory
```
before commit shuffle files, files are deleted in storage-scheduler thread
```
2023-09-07 19:38:16,506 [INFO] [dispatcher-event-loop-44] - org.apache.celeborn.service.deploy.worker.storage.StorageManager -Logging.scala(51) -Create file /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/986-0-0 success
2023-09-07 19:38:16,506 [INFO] [dispatcher-event-loop-44] - org.apache.celeborn.service.deploy.worker.Controller -Logging.scala(51) -Reserved 29 primary location and 0 replica location for application_1693206141914_540726_1-1
2023-09-07 19:38:16,537 [INFO] [storage-scheduler] - org.apache.celeborn.service.deploy.worker.storage.StorageManager -Logging.scala(51) -Delete expired app dir /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
2023-09-07 19:38:16,580 [INFO] [storage-scheduler] - org.apache.celeborn.service.deploy.worker.storage.StorageManager -Logging.scala(51) -Delete expired app dir /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
2023-09-07 19:38:16,629 [INFO] [storage-scheduler] - org.apache.celeborn.service.deploy.worker.storage.StorageManager -Logging.scala(51) -Delete expired app dir /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
2023-09-07 19:38:16,661 [INFO] [storage-scheduler] - org.apache.celeborn.service.deploy.worker.storage.StorageManager -Logging.scala(51) -Delete expired app dir /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
2023-09-07 19:38:16,681 [INFO] [storage-scheduler] - org.apache.celeborn.service.deploy.worker.storage.StorageManager -Logging.scala(51) -Delete expired app dir /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1.
2023-09-07 19:38:17,355 [INFO] [dispatcher-event-loop-12] - org.apache.celeborn.service.deploy.worker.Controller -Logging.scala(51) -Start commitFiles for application_1693206141914_540726_1-1
2023-09-07 19:38:17,362 [INFO] [async-reply] - org.apache.celeborn.service.deploy.worker.Controller -Logging.scala(51) -CommitFiles for application_1693206141914_540726_1-1 success with 29 committed primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 committed replica partitions, 0 empty replica partitions, 0 failed replica partitions.
java.io.FileNotFoundException: /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/976-0-0 (No such file or directory)
java.io.FileNotFoundException: /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/482-0-0 (No such file or directory)
java.io.FileNotFoundException: /xxx/celeborn-worker/shuffle_data/subdir_0/application_1693206141914_540726_1/1/658-0-0 (No such file or directory)
```
it may have concurrent problem in this method.
``` scala
private def cleanupExpiredAppDirs(): Unit = {
  val appIds = shuffleKeySet().asScala.map(key => Utils.splitShuffleKey(key)._1)
  disksSnapshot().filter(_.status != DiskStatus.IO_HANG).foreach { diskInfo =>
    diskInfo.dirs.foreach {
      case workingDir if workingDir.exists() =>
        workingDir.listFiles().foreach { appDir =>
          // Don't delete shuffleKey's data that exist correct shuffle file info.
          if (!appIds.contains(appDir.getName)) {
            val threadPool = diskOperators.get(diskInfo.mountPoint)
            deleteDirectory(appDir, threadPool)
            logInfo(s"Delete expired app dir $appDir.")
          }
        }
      // workingDir not exist when initializing worker on new disk
      case _ => // do nothing
    }
  }
}
```
We should find all app directories first, then get the active shuffle keys.

https://issues.apache.org/jira/browse/CELEBORN-881

### Why are the changes needed?
Bugfix.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passes GA and manual test.

Closes #1889 from zy-jordan/CELEBORN-881.

Lead-authored-by: hongzhaoyang <15316036153@163.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Co-authored-by: Keyong Zhou <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-08 20:41:04 +08:00
sychen
fe2ce00176 [CELEBORN-958] Log DNS resolution result
### What changes were proposed in this pull request?

In some scenarios, DNS resolution may fail. We can record the DNS resolution results like Spark.

fd424caf6c/common/network-common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java (L185-L192)

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1891 from cxzl25/CELEBORN-958.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-08 20:12:04 +08:00
Jun He
ada12a2c0e
[CELEBORN-900] Prefer to use jemalloc for memory allocation
### What changes were proposed in this pull request?

Only the Dockfile needs to change in this pr.

### Why are the changes needed?

When deploying celeborn for flink on kubernetes, Introducing jemalloc can improve pod memory usage.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
Maybe starting a production job to test the memory usage improvement is needed.

Closes #1824 from mddxhj/feature/introduce_jemalloc.

Authored-by: Jun He <xuehaijuxian@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-08 19:49:24 +08:00
sychen
38a68163e0 [CELEBORN-957] Simplify nano time duration calculation
### What changes were proposed in this pull request?
use `TimeUnit.NANOSECONDS.toMillis` instead of `/1000_000`

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1888 from cxzl25/CELEBORN-957.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-08 19:03:37 +08:00
jiaoqingbo
dd817b267e [CELEBORN-956] Modify parameter passing in AbstractRemoteShuffleInputGateFactory
### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

As Title

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

PASS GA

Closes #1887 from jiaoqingbo/956.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-08 18:09:48 +08:00
zwangsheng
bf0deae752 [CELEBORN-953] Remove unused-imports in Utils.scala
### What changes were proposed in this pull request?
As title

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
GA

Closes #1886 from zwangsheng/CELEBORN-953.

Lead-authored-by: zwangsheng <2213335496@qq.com>
Co-authored-by: zwangsheng <binjieyang@apache.org>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-07 22:13:29 +08:00
zhouyifan279
9e01aac501
[CELEBORN-913] Implement method ShuffleDriverComponents#supportsReliableStorage
### What changes were proposed in this pull request?
As title

### Why are the changes needed?
See https://issues.apache.org/jira/browse/SPARK-42689

### Does this PR introduce _any_ user-facing change?
Yes. User need to set `spark.shuffle.sort.io.plugin.class` to `org.apache.spark.shuffle.celeborn.CelebornShuffleDataIO` to enable this feature.

### How was this patch tested?
Add a new matrix dimension, shuffle-plugin-class, in github ci, to run spark tests over `LocalDiskShuffleDataIO` and `CelebornShuffleDataIO` respectively.

Closes #1884 from zhouyifan279/spark-driver-component.

Authored-by: zhouyifan279 <zhouyifan279@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-09-07 16:25:09 +08:00
Fu Chen
142d12caa5 [CELEBORN-929][INFRA] Add dependencies check CI
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

As title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass GA

Closes #1852 from cfmcgrady/audit-deps-ci.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Fu Chen <cfmcgrady@gmail.com>
2023-09-07 14:02:07 +08:00
zhongqiang.czq
b1e3d661e6 [CELEBORN-627][FLINK][FOLLOWUP] Support split partitions
### What changes were proposed in this pull request?
fix duplicated sending commitFiles for MapPartition and fix not sending BufferStreamEnd while opening MapPartition split.

### Why are the changes needed?
After open partition split  for MapPartition, there are 2 errors.
- ERROR1 : Worker don't send streamend to client because concurrent thread sync problem . After idle timeout, client will close the channel and throws the Exception **" xx is lost, notify related stream xx"**
```java
2023-09-06T04:40:47.7549935Z 23/09/06 04:40:47,753 WARN [Keyed Aggregation -> Map -> Sink: Unnamed (5/8)#0] Task: Keyed Aggregation -> Map -> Sink: Unnamed (5/8)#0 (c1cade728ddb3a32e0bf72acb1d87588_c27dcf7b54ef6bfd6cff02ca8870b681_4_0) switched from RUNNING to FAILED with failure cause:
2023-09-06T04:40:47.7550644Z java.io.IOException: Client localhost/127.0.0.1:38485 is lost, notify related stream 256654410004
2023-09-06T04:40:47.7551219Z 	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.errorReceived(RemoteBufferStreamReader.java:142)
2023-09-06T04:40:47.7551886Z 	at org.apache.celeborn.plugin.flink.RemoteBufferStreamReader.lambda$new$0(RemoteBufferStreamReader.java:77)
2023-09-06T04:40:47.7552576Z 	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.processMessageInternal(ReadClientHandler.java:57)
2023-09-06T04:40:47.7553250Z 	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.lambda$channelInactive$0(ReadClientHandler.java:119)
2023-09-06T04:40:47.7553806Z 	at java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
2023-09-06T04:40:47.7554564Z 	at org.apache.celeborn.plugin.flink.network.ReadClientHandler.channelInactive(ReadClientHandler.java:110)
2023-09-06T04:40:47.7555270Z 	at org.apache.celeborn.common.network.server.TransportRequestHandler.channelInactive(TransportRequestHandler.java:71)
2023-09-06T04:40:47.7556005Z 	at org.apache.celeborn.common.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:136)
2023-09-06T04:40:47.7556710Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
2023-09-06T04:40:47.7557370Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
2023-09-06T04:40:47.7558172Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
2023-09-06T04:40:47.7558803Z 	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
2023-09-06T04:40:47.7559368Z 	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
2023-09-06T04:40:47.7559954Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:303)
2023-09-06T04:40:47.7560589Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
2023-09-06T04:40:47.7561222Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
2023-09-06T04:40:47.7561829Z 	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
2023-09-06T04:40:47.7562620Z 	at org.apache.celeborn.plugin.flink.network.TransportFrameDecoderWithBufferSupplier.channelInactive(TransportFrameDecoderWithBufferSupplier.java:206)
2023-09-06T04:40:47.7563506Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:305)
2023-09-06T04:40:47.7564207Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
2023-09-06T04:40:47.7564829Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:274)
2023-09-06T04:40:47.7565417Z 	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
2023-09-06T04:40:47.7566014Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:301)
2023-09-06T04:40:47.7566654Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:281)
2023-09-06T04:40:47.7567317Z 	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
2023-09-06T04:40:47.7567813Z 	at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813)
2023-09-06T04:40:47.7568297Z 	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
2023-09-06T04:40:47.7568830Z 	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
2023-09-06T04:40:47.7569402Z 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
2023-09-06T04:40:47.7569894Z 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
2023-09-06T04:40:47.7570356Z 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
2023-09-06T04:40:47.7570841Z 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
2023-09-06T04:40:47.7571319Z 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
2023-09-06T04:40:47.7571721Z 	at java.lang.Thread.run(Thread.java:750)
```
- ERROR2: Client will send duplicated commitFiles to worker. Becuase of inconsistency unHandledPartiitions , both batchCommit and finalCommit send commitFiles
``` java
2023-09-06T04:36:48.3146773Z 23/09/06 04:36:48,314 WARN [Worker-CommitFiles-1] Controller: Get Partition Location for 1693975002919-61094c8156f918062a5fae12d551bc90-0 0-1 but didn't exist.
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

Closes #1881 from zhongqiangczq/fix-split-test.

Authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-09-06 22:33:56 +08:00
zhouyifan279
10c63e0a0f [CELEBORN-919][FOLLOWUP] Add SBT project sparkColumnarShuffle to sparkGroup
### What changes were proposed in this pull request?
Add sbt project `sparkColumnarShuffle` to `sparkGroup`

### Why are the changes needed?
Add the project `sparkColumnarShuffle` to the spark tests group `sparkGroup` to enable the columnar-related tests for SBT.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Run tests locally.

Closes #1854 from zhouyifan279/columnar-shuffle-sbt.

Authored-by: zhouyifan279 <zhouyifan279@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-06 21:26:18 +08:00
jiaoqingbo
b2e03d27bd [CELEBORN-950] Change CelebornShuffleReader log level and information
### What changes were proposed in this pull request?

As Title

### Why are the changes needed?

As Title

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

PASS GA

Closes #1882 from jiaoqingbo/950.

Authored-by: jiaoqingbo <1178404354@qq.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-06 21:07:19 +08:00
sychen
c373006618 [CELEBORN-951] Add IssueNavigationLink and icon for IDEA
### What changes were proposed in this pull request?

<img width="598" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/1488b7ad-b323-411a-98d9-285439190752">

<img width="681" alt="image" src="https://github.com/apache/incubator-celeborn/assets/3898450/26b8c8bc-2d88-4817-aeb7-d21fc7a3d55f">

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1883 from cxzl25/idea_icon_and_link.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-06 20:39:36 +08:00
xiyu.zk
d53b6e53c7 [CELEBORN-946][GLUTEN] Record read metric should be compatible with Gluten shuffle dependency
### What changes were proposed in this pull request?
Currently judging whether it is a Gluten shuffle through serde is only applicable to Velox Backend. In order to adapt to ClickHouse Backend at the same time, it is more generic to use ColumnarShuffleDependency as the judgment basis.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #1878 from kerwin-zk/gluten.

Authored-by: xiyu.zk <xiyu.zk@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-05 18:34:12 +08:00
mingji
17cfbd7dc7 [CELEBORN-948][DOC] fix quick start doc about failed to submit flink wordcount
### What changes were proposed in this pull request?
Update the script to start word count demo.

### Why are the changes needed?
A user reported that he could not run the demo while following the quick start docs.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

Closes #1880 from FMX/CELEBORN-948.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-05 17:44:16 +08:00
mingji
63164628dc [CELEBORN-944][DOC] Add link about cluster planning
### What changes were proposed in this pull request?
Add a link to expose cluster planning doc.

### Why are the changes needed?
Ditto.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Not necessary.

Closes #1879 from FMX/CELEBORN-944.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-05 14:24:17 +08:00
zky.zhoukeyong
8d005b8d39 [CELEBORN-945] Change ShutdownHook's timeout for decommission
### What changes were proposed in this pull request?
When shutdown type is decommission, we should change the `ShutdownHookManager#HookEntry`'s
timeout to `celeborn.worker.decommission.forceExitTimeout`.

### Why are the changes needed?
ditto

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manual test

Closes #1877 from waitinfuture/945.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-05 10:24:08 +08:00
zky.zhoukeyong
a42ec85a6e [CELEBORN-943][PERF] Pre-create CelebornInputStreams in CelebornShuffleReader
### What changes were proposed in this pull request?
This PR fixes performance degradation when Spark's coalescePartitions takes effect caused
by RPC latency.

### Why are the changes needed?
I encountered a performance degradation when testing  tpcds 10T q10:
||Time|
|---|---|
|ESS|14s|
|Celeborn| 24s|

After digging into it I found out that q10 triggers partition coalescence:
![image](https://github.com/apache/incubator-celeborn/assets/948245/0b4745da-8d57-4661-a35d-683d97f56e1d)

As I configured `spark.sql.adaptive.coalescePartitions.initialPartitionNum` to 1000, `CelebornShuffleReader`
will call `shuffleClient.readPartition` sequentially 1000 times, causing the delay.

This PR optimizes by calling `shuffleClient.readPartition` in parallel. After this PR q10 time becomes 14s.

### Does this PR introduce _any_ user-facing change?
No, but introduced a new client side configuration `celeborn.client.streamCreatorPool.threads`
which defaults to 32.

### How was this patch tested?
TPCDS 1T and passes GA.

Closes #1876 from waitinfuture/943.

Lead-authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Co-authored-by: Keyong Zhou <waitinfuture@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-09-04 21:46:11 +08:00