celeborn/client-flink
zhongqiang.czq b66eaff880 [CELEBORN-627][FLINK] Support split partitions
### What changes were proposed in this pull request?
In MapPartiitoin, datas are split into regions.

1. Unlike ReducePartition whose partition split can occur on pushing data
to keep MapPartition data ordering,  PartitionSplit only be done on the time of sending PushDataHandShake or RegionStart messages (As shown in the following image). That's to say that the partition split only appear at the beginnig of a region but not inner a region.
> Notice: if the client side think that it's failed to push HandShake or RegionStart messages. but the worker side can still receive normal HandShake/RegionStart message. After client revive succss, it don't push any messages to old partition, so the worker having the old partition will create a empty file. After committing files, the worker will return empty commitids. That's to say that empty file will be filterd after committing files and ReduceTask will not read any empty files.

![image](https://github.com/apache/incubator-celeborn/assets/96606293/468fd660-afbc-42c1-b111-6643f5c1e944)

2. PushData/RegioinFinish don't care the following cases:
 - Diskfull
 - ExceedPartitionSplitThreshold
 - Worker ShuttingDown
so if one of the above three conditions appears, PushData and RegionFinish cant still do as normal. Workers should consider the ShuttingDown case and  try best to wait all the regions finished before shutting down.

if PushData or RegionFinish failed like network timeout and so on, then MapTask will failed and start another attempte maptask.

![image](https://github.com/apache/incubator-celeborn/assets/96606293/db9f9166-2085-4be1-b09e-cf73b469c55b)

3. how shuffle read supports partition split?
ReduceTask should get split paritions by order and open the stream by partition epoc orderly

### Why are the changes needed?
PartiitonSplit is not supported by MapPartition from now.
There still a risk that  a partition file'size is too large to store the file on worker disk.
To avoid this risk, this pr introduces partition split in shuffle read and shuffle write.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT and manual TPCDS test

Closes #1550 from FMX/CELEBORN-627.

Lead-authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com>
2023-09-01 19:25:51 +08:00
..
common [CELEBORN-627][FLINK] Support split partitions 2023-09-01 19:25:51 +08:00
flink-1.14 [CELEBORN-717][FLINK][FOLLOWUP] Fix ResultPartition lost numBytesOut/numBuffersOut metrics 2023-06-27 21:47:41 +08:00
flink-1.14-shaded [CELEBORN-716][BUILD] Correct the to name when renaming the Netty native library 2023-06-26 21:57:06 +08:00
flink-1.15 [CELEBORN-717][FLINK][FOLLOWUP] Fix ResultPartition lost numBytesOut/numBuffersOut metrics 2023-06-27 21:47:41 +08:00
flink-1.15-shaded [CELEBORN-716][BUILD] Correct the to name when renaming the Netty native library 2023-06-26 21:57:06 +08:00
flink-1.17 [CELEBORN-717][FLINK][FOLLOWUP] Fix ResultPartition lost numBytesOut/numBuffersOut metrics 2023-06-27 21:47:41 +08:00
flink-1.17-shaded [CELEBORN-716][BUILD] Correct the to name when renaming the Netty native library 2023-06-26 21:57:06 +08:00