celeborn/common
Jinqian Fan f7be341948 [CELEBORN-1902] Read client throws PartitionConnectionException
### What changes were proposed in this pull request?
`org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException` is thrown when RemoteBufferStreamReader finds that the current exception is about connection failure.

### Why are the changes needed?

If `org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException` is correctly thrown to reflect connection failure, then Flink can be aware of the lost Celeborn server side nodes and be able to re-compute affected data. Otherwise, endless retries could cause Flink job failure.

This PR is to deal with exceptions like:
```
java.io.IOException: org.apache.celeborn.common.exception.CelebornIOException: Failed to connect to ltx1-app10154.prod.linkedin.com/10.88.105.20:23924
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Tested in a Flink batch job with Celeborn.

Closes #3147 from Austinfjq/throw-Partition-Connection-Exception.

Lead-authored-by: Jinqian Fan <jinqianfan@icloud.com>
Co-authored-by: Austin Fan <jinqianfan@icloud.com>
Co-authored-by: Wang, Fei <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-05-21 16:58:30 -07:00
..
benchmarks
src [CELEBORN-1902] Read client throws PartitionConnectionException 2025-05-21 16:58:30 -07:00
pom.xml [CELEBORN-1530] support MPU for S3 2024-11-22 15:03:53 +08:00