celeborn/docs/configuration
mingji 7a0eee332a [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM
### What changes were proposed in this pull request?
1. Add a new sink and allow the user to store metrics to files.
2. Celeborn will scrape its metrics periodically to make sure that the metric data won't be too large to cause OOM.

### Why are the changes needed?
A long-running worker ran out of memory and found out that the metrics are huge in the heap dump.
As you can see below, the biggest object is the time metric queue, and I got 1.6 million records.
<img width="1516" alt="Screenshot 2025-06-24 at 09 59 30" src="https://github.com/user-attachments/assets/691c7bc2-b974-4cc0-8d5a-bf626ab903c0" />
<img width="1239" alt="Screenshot 2025-06-24 at 14 45 10" src="https://github.com/user-attachments/assets/ebdf5a4d-c941-4f1e-911f-647aa156b37a" />

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

Closes #3346 from FMX/b2045.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-26 18:42:20 -07:00
..
client.md [CELEBORN-2005][FOLLOWUP] Introduce ShuffleMetricGroup for numBytesIn, numBytesOut, numRecordsOut, numBytesInPerSecond, numBytesOutPerSecond, numRecordsOutPerSecond metrics 2025-05-30 14:54:28 +08:00
columnar-shuffle.md [CELEBORN-1051] Add isDynamic property for CelebornConf 2024-02-20 14:20:44 +08:00
ha.md [CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1 2024-05-30 17:22:22 +08:00
index.md [MINOR] Add documentation for CELEBORN_NO_DAEMONIZE 2024-12-23 10:31:37 +08:00
master.md [CELEBORN-2018] Support min number of workers selected for shuffle 2025-06-01 08:23:53 -07:00
metrics.md [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM 2025-06-26 18:42:20 -07:00
network-module.md [CELEBORN-1353] Document Celeborn security - authentication and SSL support 2024-04-30 14:37:56 +08:00
network.md [MINOR] Change some config version 2025-05-21 16:39:02 -07:00
quota.md [CELEBORN-1577][PHASE2] QuotaManager should support interrupt shuffle 2025-03-24 22:05:45 +08:00
worker.md [CELEBORN-2003] Add retry mechanism when completing S3 multipart upload 2025-06-06 10:15:26 +08:00