celeborn/conf
mingji 7a0eee332a [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM
### What changes were proposed in this pull request?
1. Add a new sink and allow the user to store metrics to files.
2. Celeborn will scrape its metrics periodically to make sure that the metric data won't be too large to cause OOM.

### Why are the changes needed?
A long-running worker ran out of memory and found out that the metrics are huge in the heap dump.
As you can see below, the biggest object is the time metric queue, and I got 1.6 million records.
<img width="1516" alt="Screenshot 2025-06-24 at 09 59 30" src="https://github.com/user-attachments/assets/691c7bc2-b974-4cc0-8d5a-bf626ab903c0" />
<img width="1239" alt="Screenshot 2025-06-24 at 14 45 10" src="https://github.com/user-attachments/assets/ebdf5a4d-c941-4f1e-911f-647aa156b37a" />

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

Closes #3346 from FMX/b2045.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-26 18:42:20 -07:00
..
celeborn-defaults.conf.template [CELEBORN-1455] Remove improper configs from config template 2024-06-11 15:35:53 +08:00
celeborn-env.sh.template [MINOR] Add documentation for CELEBORN_NO_DAEMONIZE 2024-12-23 10:31:37 +08:00
dynamicConfig.yaml.template [CELEBORN-1594] Refine dynamicConfig template and prevent NPE 2024-09-15 22:11:23 +08:00
hosts.template
log4j2.xml.template [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM 2025-06-26 18:42:20 -07:00
metrics.properties.template [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM 2025-06-26 18:42:20 -07:00
ratis-log4j.properties.template