### What changes were proposed in this pull request? `org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException` is thrown when RemoteBufferStreamReader finds that the current exception is about connection failure. ### Why are the changes needed? If `org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException` is correctly thrown to reflect connection failure, then Flink can be aware of the lost Celeborn server side nodes and be able to re-compute affected data. Otherwise, endless retries could cause Flink job failure. This PR is to deal with exceptions like: ``` java.io.IOException: org.apache.celeborn.common.exception.CelebornIOException: Failed to connect to ltx1-app10154.prod.linkedin.com/10.88.105.20:23924 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested in a Flink batch job with Celeborn. Closes #3147 from Austinfjq/throw-Partition-Connection-Exception. Lead-authored-by: Jinqian Fan <jinqianfan@icloud.com> Co-authored-by: Austin Fan <jinqianfan@icloud.com> Co-authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> |
||
|---|---|---|
| .. | ||
| benchmarks | ||
| src | ||
| pom.xml | ||