celeborn/docs/configuration
Jinqian Fan f7be341948 [CELEBORN-1902] Read client throws PartitionConnectionException
### What changes were proposed in this pull request?
`org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException` is thrown when RemoteBufferStreamReader finds that the current exception is about connection failure.

### Why are the changes needed?

If `org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException` is correctly thrown to reflect connection failure, then Flink can be aware of the lost Celeborn server side nodes and be able to re-compute affected data. Otherwise, endless retries could cause Flink job failure.

This PR is to deal with exceptions like:
```
java.io.IOException: org.apache.celeborn.common.exception.CelebornIOException: Failed to connect to ltx1-app10154.prod.linkedin.com/10.88.105.20:23924
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Tested in a Flink batch job with Celeborn.

Closes #3147 from Austinfjq/throw-Partition-Connection-Exception.

Lead-authored-by: Jinqian Fan <jinqianfan@icloud.com>
Co-authored-by: Austin Fan <jinqianfan@icloud.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-21 16:58:30 -07:00
..
client.md [CELEBORN-1902] Read client throws PartitionConnectionException 2025-05-21 16:58:30 -07:00
columnar-shuffle.md [CELEBORN-1051] Add isDynamic property for CelebornConf 2024-02-20 14:20:44 +08:00
ha.md [CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1 2024-05-30 17:22:22 +08:00
index.md [MINOR] Add documentation for CELEBORN_NO_DAEMONIZE 2024-12-23 10:31:37 +08:00
master.md [CELEBORN-1916][FOLLOWUP] Improve Aliyun OSS support 2025-05-21 11:44:50 +08:00
metrics.md [CELEBORN-1974] ApplicationId as metrics label should be behind a config flag 2025-05-12 21:05:45 -07:00
network-module.md [CELEBORN-1353] Document Celeborn security - authentication and SSL support 2024-04-30 14:37:56 +08:00
network.md [MINOR] Change some config version 2025-05-21 16:39:02 -07:00
quota.md [CELEBORN-1577][PHASE2] QuotaManager should support interrupt shuffle 2025-03-24 22:05:45 +08:00
worker.md [CELEBORN-1965] Rely on all default hadoop providers for S3 auth 2025-05-09 14:16:47 +08:00