### What changes were proposed in this pull request? Inc COMMIT_FILES_FAIL_COUNT when TimerWriter::close timeout ### Why are the changes needed? 1. the COMMIT_FILES_FAIL_COUNT is 0 even we meet SHUFFLE_DATA_LOST caused by commit files failure Spark executor log: ``` 25/07/30 10:10:39 WARN CelebornShuffleReader: Handle fetch exceptions for 0-0org.apache.celeborn.common.exception.CelebornIOException: Failed to load file group of shuffle 0 partition 441! Request GetReducerFileGroup(0,false,V1) return SHUFFLE_DATA_LOST for 0. ``` Spark driver log: ``` 25/07/30 10:10:38 ERROR ReducePartitionCommitHandler: Failed to handle stageEnd for 0, lost file! 25/07/30 10:10:38 ERROR ReducePartitionCommitHandler: For shuffle application_1750652300305_10219240_1-0 partition data lost: Lost partition 307-0 in worker [Host:hdc42-mcc10-01-0910-2704-064-tess0028.stratus.rno.ebay.com:RpcPort:9200:PushPort:9202:FetchPort:9201:ReplicatePort:9203] Lost partition 1289-0 in worker [Host:hdc42-mcc10-01-0910-2704-064-tess0028.stratus.rno.ebay.com:RpcPort:9200:PushPort:9202:FetchPort:9201:ReplicatePort:9203] ``` Worker log: ``` java.io.IOException: Wait pending actions timeout. at org.apache.celeborn.service.deploy.worker.storage.TierWriterBase.waitOnNoPending(TierWriter.scala:158) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Closes #3403 from turboFei/commit_failed. Authored-by: Wang, Fei <fwang12@ebay.com> Signed-off-by: Wang, Fei <fwang12@ebay.com> |
||
|---|---|---|
| .. | ||
| src | ||
| pom.xml | ||