### What changes were proposed in this pull request?
Design doc - https://docs.google.com/document/d/1YqK0kua-5rMufJw57kEIrHHGbLnAF9iXM5GdDweMzzg/edit?tab=t.0#heading=h.n5ldma432qnd
- End to End integrity checks provide additional confidence that Celeborn is producing complete as well as correct data
- The checks are hidden behind a client side config that is false by default. Provides users optionality to enable these if required on a per app basis
- Only compatible with Spark at the moment
- No support for Flink (can be considered in future)
- No support for Columnar Shuffle (can be considered in future)
Writer
- Whenever a mapper completes, it reports crc32 and bytes written on a per partition basis to the driver
Driver
- Driver aggregates the mapper reports - and computes aggregated CRC32 and bytes written on per partitionID basis
Reader
- Each CelebornInputStream will report (int shuffleId, int partitionId, int startMapIndex, int endMapIndex, int crc32, long bytes) to driver when it finished reading all data on the stream
- On every report
- Driver will aggregate the CRC32 and bytesRead for the partitionID
- Driver will aggregate mapRange to determine when all sub-paritions have been read for partitionID have been read
- It will then compare the aggregated CRC32 and bytes read with the expected CRC32 and bytes written for the partition
- There is special handling for skewhandlingwithoutMapRangeSplit scenario as well
- In this case, we report the number of sub-partitions and index of the sub-partition instead of startMapIndex and endMapIndex
There is separate handling for skew handling with and without map range split
As a follow up, I will do another PR that will harden up the checks and perform additional checks to add book keeping that every CelebornInputStream makes the required checks
### Why are the changes needed?
https://issues.apache.org/jira/browse/CELEBORN-894
Note: I am putting up this PR even though some tests are failing, since I want to get some early feedback on the code changes.
### Does this PR introduce _any_ user-facing change?
Not sure how to answer this. A new client side config is available to enable the checks if required
### How was this patch tested?
Unit tests + Integration tests
Closes#3261 from gauravkm/gaurav/e2e_checks_v3.
Lead-authored-by: Gaurav Mittal <gaurav@stripe.com>
Co-authored-by: Gaurav Mittal <gauravkm@gmail.com>
Co-authored-by: Fei Wang <cn.feiwang@gmail.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>