celeborn/multipart-uploader
nicolas.fraison@datadoghq.com 061cdc3820 [CELEBORN-2003] Add retry mechanism when completing S3 multipart upload
### What changes were proposed in this pull request?

Add a retry mechanism when completing S3 multipart upload to ensure that completeMultipartUpload is retry when facing retryable exception like SlowDown one

### Why are the changes needed?

While running a “simple” spark jobs creating 10TiB of shuffle data (repartition from 100k partition to 20) the job was constantly failing when all files should be committed. relying on SOFT `celeborn.client.shuffle.partitionSplit.mode`

Despite an increase of `celeborn.storage.s3.mpu.maxRetries` up to `200`. Job was still failing due to SlowDown exception
Adding some debug logs on the retry policy from AWS S3 SDK I've seen that the policy is never called when doing completeMultipartUpload action while it is well called on other actions. See https://issues.apache.org/jira/browse/CELEBORN-2003

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Created a cluster on a kubernetes server relying on S3 storage.
Launch a 10TiB shuffle from 100000 partitions to 200 partitions with SOFT `celeborn.client.shuffle.partitionSplit.mode`
The job succeed and well display some warn logs indicating that the `completeMultipartUpload` is retried due to SlowDown:
```
bucket ******* key poc/spark-2c86663c948243d19c127e90f704a3d5/0/35-39-0 uploadId Pbaq.pp1qyLvtGbfZrMwA8RgLJ4QYanAMhmv0DvKUk0m6.GlCKdC3ICGngn7Q7iIa0Dw1h3wEn78EoogMlYgFD6.tDqiatOTbFprsNkk0qzLu9KY8YCC48pqaINcvgi8c1gQKKhsf1zZ.5Et5j40wQ-- upload failed to complete, will retry (1/10)
com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: null; Status Code: 0; Error Code: SlowDown; Request ID: RAV5MXX3B9Z3ZHTG; S3 Extended Request ID: 9Qqm3vfJVLFNY1Y3yKAobJHv7JkHQP2+v8hGSW2HYIOputAtiPdkqkY5MfD66lEzAl45m71aiPVB0f1TxTUD+upUo0NxXp6S; Proxy: null), S3 Extended Request ID: 9Qqm3vfJVLFNY1Y3yKAobJHv7JkHQP2+v8hGSW2HYIOputAtiPdkqkY5MfD66lEzAl45m71aiPVB0f1TxTUD+upUo0NxXp6S 	at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler.doEndElement(XmlResponsesSaxParser.java:1906)
```

Closes #3293 from ashangit/nfraison/CELEBORN-2003.

Authored-by: nicolas.fraison@datadoghq.com <nicolas.fraison@datadoghq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-06-06 10:15:26 +08:00
..
multipart-uploader-oss [CELEBORN-1916] Support Aliyun OSS Based on MPU Extension Interface 2025-04-08 15:10:33 +08:00
multipart-uploader-s3 [CELEBORN-2003] Add retry mechanism when completing S3 multipart upload 2025-06-06 10:15:26 +08:00