celeborn/client-spark
Wang, Fei f1bda46de4 [CELEBORN-1680] Introduce ShuffleFallbackCount metrics
### What changes were proposed in this pull request?

As title, introduce metrics_ShuffleFallbackCount_Value.

### Why are the changes needed?
To provide the insights that how many shuffles fallback to spark built-in shuffle service. It is helpful for us  to deprecate the ESS progressively.

Currently, we plan to set the `celeborn.client.spark.shuffle.fallback.numPartitionsThreshold` to fallback the shuffle with too large shuffle partitions number, for example: 50k.

In the future, we plan to limit the acceptable maximum shuffle partition number so that the bad job would be rejected and not impact the celeborn master health.

### Does this PR introduce _any_ user-facing change?
Yes, new metrics.

### How was this patch tested?
UT.
<img width="1188" alt="image" src="https://github.com/user-attachments/assets/8193c12c-5dc9-4783-b64b-6a8449a1bea4">

Closes #2866 from turboFei/record_fallback.

Lead-authored-by: Wang, Fei <fwang12@ebay.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2024-11-07 11:42:17 +08:00
..
common [CELEBORN-1645] Introduce ShuffleFallbackPolicy to support custom implementation of shuffle fallback policy for CelebornShuffleFallbackPolicyRunner 2024-10-15 21:57:04 +08:00
spark-2 [CELEBORN-1577][PHASE1] Storage quota should support interrupt shuffle 2024-10-30 16:28:09 +08:00
spark-2-shaded [INFRA] Remove incubator/incubating for graduation 2024-03-27 13:54:47 +08:00
spark-3 [CELEBORN-1680] Introduce ShuffleFallbackCount metrics 2024-11-07 11:42:17 +08:00
spark-3-columnar-common [CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5 2024-09-05 14:26:54 +08:00
spark-3-columnar-shuffle [CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5 2024-09-05 14:26:54 +08:00
spark-3-shaded [CELEBORN-1616] Shade com.google.thirdparty to prevent dependency conflicts 2024-09-27 17:50:23 +08:00
spark-3.5-columnar-shuffle [CELEBORN-912][FOLLOWUP] Support columnar shuffle for Spark 3.5 2024-09-05 16:52:47 +08:00