celeborn/client-mr
Gaurav Mittal cde33d953b [CELEBORN-894] End to End Integrity Checks
### What changes were proposed in this pull request?
Design doc - https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0#heading=h.n5ldma432qnd

- End to End integrity checks provide additional confidence that Celeborn is producing complete as well as correct data
- The checks are hidden behind a client side config that is false by default. Provides users optionality to enable these if required on a per app basis
- Only compatible with Spark at the moment
- No support for Flink (can be considered in future)
- No support for Columnar Shuffle (can be considered in future)

Writer
- Whenever a mapper completes, it reports crc32 and bytes written on a per partition basis to the driver

Driver
- Driver aggregates the mapper reports - and computes aggregated CRC32 and bytes written on per partitionID basis

Reader
- Each CelebornInputStream will report (int shuffleId, int partitionId, int startMapIndex, int endMapIndex, int crc32, long bytes) to driver when it finished reading all data on the stream
- On every report
  - Driver will aggregate the CRC32 and bytesRead for the partitionID
  - Driver will aggregate mapRange to determine when all sub-paritions have been read for partitionID have been read
  - It will then compare the aggregated CRC32 and bytes read with the expected CRC32 and bytes written for the partition
  - There is special handling for skewhandlingwithoutMapRangeSplit scenario as well
  - In this case, we report the number of sub-partitions and index of the sub-partition instead of startMapIndex and endMapIndex

There is separate handling for skew handling with and without map range split

As a follow up, I will do another PR that will harden up the checks and perform additional checks to add book keeping that every CelebornInputStream makes the required checks

### Why are the changes needed?
https://issues.apache.org/jira/browse/CELEBORN-894

Note: I am putting up this PR even though some tests are failing, since I want to get some early feedback on the code changes.

### Does this PR introduce _any_ user-facing change?
Not sure how to answer this. A new client side config is available to enable the checks if required

### How was this patch tested?
Unit tests + Integration tests

Closes #3261 from gauravkm/gaurav/e2e_checks_v3.

Lead-authored-by: Gaurav Mittal <gaurav@stripe.com>
Co-authored-by: Gaurav Mittal <gauravkm@gmail.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-06-28 09:19:57 +08:00
..
mr [CELEBORN-894] End to End Integrity Checks 2025-06-28 09:19:57 +08:00
mr-shaded [INFRA] Remove incubator/incubating for graduation 2024-03-27 13:54:47 +08:00