celeborn/docs/configuration
nicolas.fraison@datadoghq.com 061cdc3820 [CELEBORN-2003] Add retry mechanism when completing S3 multipart upload
### What changes were proposed in this pull request?

Add a retry mechanism when completing S3 multipart upload to ensure that completeMultipartUpload is retry when facing retryable exception like SlowDown one

### Why are the changes needed?

While running a “simple” spark jobs creating 10TiB of shuffle data (repartition from 100k partition to 20) the job was constantly failing when all files should be committed. relying on SOFT `celeborn.client.shuffle.partitionSplit.mode`

Despite an increase of `celeborn.storage.s3.mpu.maxRetries` up to `200`. Job was still failing due to SlowDown exception
Adding some debug logs on the retry policy from AWS S3 SDK I've seen that the policy is never called when doing completeMultipartUpload action while it is well called on other actions. See https://issues.apache.org/jira/browse/CELEBORN-2003

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Created a cluster on a kubernetes server relying on S3 storage.
Launch a 10TiB shuffle from 100000 partitions to 200 partitions with SOFT `celeborn.client.shuffle.partitionSplit.mode`
The job succeed and well display some warn logs indicating that the `completeMultipartUpload` is retried due to SlowDown:
```
bucket ******* key poc/spark-2c86663c948243d19c127e90f704a3d5/0/35-39-0 uploadId Pbaq.pp1qyLvtGbfZrMwA8RgLJ4QYanAMhmv0DvKUk0m6.GlCKdC3ICGngn7Q7iIa0Dw1h3wEn78EoogMlYgFD6.tDqiatOTbFprsNkk0qzLu9KY8YCC48pqaINcvgi8c1gQKKhsf1zZ.5Et5j40wQ-- upload failed to complete, will retry (1/10)
com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: null; Status Code: 0; Error Code: SlowDown; Request ID: RAV5MXX3B9Z3ZHTG; S3 Extended Request ID: 9Qqm3vfJVLFNY1Y3yKAobJHv7JkHQP2+v8hGSW2HYIOputAtiPdkqkY5MfD66lEzAl45m71aiPVB0f1TxTUD+upUo0NxXp6S; Proxy: null), S3 Extended Request ID: 9Qqm3vfJVLFNY1Y3yKAobJHv7JkHQP2+v8hGSW2HYIOputAtiPdkqkY5MfD66lEzAl45m71aiPVB0f1TxTUD+upUo0NxXp6S 	at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$CompleteMultipartUploadHandler.doEndElement(XmlResponsesSaxParser.java:1906)
```

Closes #3293 from ashangit/nfraison/CELEBORN-2003.

Authored-by: nicolas.fraison@datadoghq.com <nicolas.fraison@datadoghq.com>
Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
2025-06-06 10:15:26 +08:00
..
client.md [CELEBORN-2005][FOLLOWUP] Introduce ShuffleMetricGroup for numBytesIn, numBytesOut, numRecordsOut, numBytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics 2025-05-30 14:54:28 +08:00
columnar-shuffle.md [CELEBORN-1051] Add isDynamic property for CelebornConf 2024-02-20 14:20:44 +08:00
ha.md [CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1 2024-05-30 17:22:22 +08:00
index.md [MINOR] Add documentation for CELEBORN_NO_DAEMONIZE 2024-12-23 10:31:37 +08:00
master.md [CELEBORN-2018] Support min number of workers selected for shuffle 2025-06-01 08:23:53 -07:00
metrics.md [CELEBORN-1974] ApplicationId as metrics label should be behind a config flag 2025-05-12 21:05:45 -07:00
network-module.md [CELEBORN-1353] Document Celeborn security - authentication and SSL support 2024-04-30 14:37:56 +08:00
network.md [MINOR] Change some config version 2025-05-21 16:39:02 -07:00
quota.md [CELEBORN-1577][PHASE2] QuotaManager should support interrupt shuffle 2025-03-24 22:05:45 +08:00
worker.md [CELEBORN-2003] Add retry mechanism when completing S3 multipart upload 2025-06-06 10:15:26 +08:00