celeborn/client-spark
SteNicholas 9cd6d96167 [CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle implementation
### What changes were proposed in this pull request?

Flink supports fallback to vanilla Flink built-in shuffle implementation.

### Why are the changes needed?

When quota is unenough or workers are unavailable, `RemoteShuffleMaster` does not support fallback to `NettyShuffleMaster`, and `RemoteShuffleEnvironment` does not support fallback to `NettyShuffleEnvironment` at present. Flink should support fallback to vanilla Flink built-in shuffle implementation for unenough quota and unavailable workers.

![Flink Shuffle Fallback](https://github.com/user-attachments/assets/538374b4-f14c-40f4-abfc-76e25b7af3ff)

### Does this PR introduce _any_ user-facing change?

- Introduce `ShuffleFallbackPolicy` interface to determine whether fallback to vanilla Flink built-in shuffle implementation.

```
/**
 * The shuffle fallback policy determines whether fallback to vanilla Flink built-in shuffle
 * implementation.
 */
public interface ShuffleFallbackPolicy {

  /**
   * Returns whether fallback to vanilla flink built-in shuffle implementation.
   *
   * param shuffleContext The job shuffle context of Flink.
   * param celebornConf The configuration of Celeborn.
   * param lifecycleManager The {link LifecycleManager} of Celeborn.
   * return Whether fallback to vanilla flink built-in shuffle implementation.
   */
  boolean needFallback(
      JobShuffleContext shuffleContext,
      CelebornConf celebornConf,
      LifecycleManager lifecycleManager);
}
```

- Introduce `celeborn.client.flink.shuffle.fallback.policy` config to support shuffle fallback policy configuration.

### How was this patch tested?

- `RemoteShuffleMasterSuiteJ#testRegisterJobWithForceFallbackPolicy`
- `WordCountTestBase#celeborn flink integration test with fallback - word count`

Closes #2932 from SteNicholas/CELEBORN-1700.

Authored-by: SteNicholas <programgeek@163.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2024-11-27 21:44:07 +08:00
..
common [CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle implementation 2024-11-27 21:44:07 +08:00
spark-2 [CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle implementation 2024-11-27 21:44:07 +08:00
spark-2-shaded [INFRA] Remove incubator/incubating for graduation 2024-03-27 13:54:47 +08:00
spark-3 [CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle implementation 2024-11-27 21:44:07 +08:00
spark-3-columnar-common [CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5 2024-09-05 14:26:54 +08:00
spark-3-columnar-shuffle [CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5 2024-09-05 14:26:54 +08:00
spark-3-shaded [CELEBORN-1616] Shade com.google.thirdparty to prevent dependency conflicts 2024-09-27 17:50:23 +08:00
spark-3.5-columnar-shuffle [CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5 2024-09-05 16:52:47 +08:00