celeborn/worker
Fu Chen 9b1805e2ef [CELEBORN-1082] Fixing partition sorter task failures due to duplicate sorting
### What changes were proposed in this pull request?

As title

### Why are the changes needed?

Recently, while testing on the main branch, we discovered that the partition sorter task might fail with a `NoSuchFileException`, leading to the entire job's failure. Upon further investigation, we identified that the root cause of this issue is the potential addition of the same sorting task to the sorter queue multiple times.

```
23/10/22 01:02:15,334 DEBUG [worker-file-sorter-execute-9530] PartitionFilesSorter: sort complete for application_1653035898918_4284043-9975 /data1/celeborn/worker/celeborn-worker/shuffle_data/application_1653035898918_4284043/9975/0-0-0
...
23/10/22 01:02:15,335 ERROR [worker-file-sorter-execute-9532] PartitionFilesSorter: Sorting shuffle file for application_1653035898918_4284043-9975-0-0-0 /data1/celeborn/worker/cele
born-worker/shuffle_data/application_1653035898918_4284043/9975/0-0-0 failed, detail:
java.nio.file.NoSuchFileException: /data1/celeborn/worker/celeborn-worker/shuffle_data/application_1653035898918_4284043/9975/0-0-0
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
        at java.nio.channels.FileChannel.open(FileChannel.java:287)
        at java.nio.channels.FileChannel.open(FileChannel.java:335)
        at org.apache.celeborn.common.util.FileChannelUtils.openReadableFileChannel(FileChannelUtils.java:33)
        at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter$FileSorter.initializeFiles(PartitionFilesSorter.java:641)
        at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter$FileSorter.sort(PartitionFilesSorter.java:559)
        at org.apache.celeborn.service.deploy.worker.storage.PartitionFilesSorter.lambda$null$0(PartitionFilesSorter.java:146)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
...
```

Before this PR, there was a scenario where sorter tasks for the same `fileId` could arrive after being removed from the `sorting` state, and they could be mistakenly added to the sorter queue. To address this, we moved the code block that checks the `fileId`'s status in `sorted` inside the `synchronized (sorting)` block. This change ensures that tasks are not added to the sorter queue multiple times because if a `fileId`'s sorter task has already completed and its status has been removed from `sorting`, it will definitely be present in `sorted`. This behavior is consistent with how it worked prior to version 0.3.0.

### Does this PR introduce _any_ user-facing change?

No, only bug fix

### How was this patch tested?

Pass GA

Closes #2031 from cfmcgrady/fix-no-such-file-exception.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
2023-10-24 15:55:41 +08:00
..
src [CELEBORN-1082] Fixing partition sorter task failures due to duplicate sorting 2023-10-24 15:55:41 +08:00
pom.xml [CELEBORN-977] Support RocksDB as recover DB backend 2023-09-19 09:20:33 +08:00