celeborn/docs/configuration/metrics.md
mingji 7a0eee332a [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM
### What changes were proposed in this pull request?
1. Add a new sink and allow the user to store metrics to files.
2. Celeborn will scrape its metrics periodically to make sure that the metric data won't be too large to cause OOM.

### Why are the changes needed?
A long-running worker ran out of memory and found out that the metrics are huge in the heap dump.
As you can see below, the biggest object is the time metric queue, and I got 1.6 million records.
<img width="1516" alt="Screenshot 2025-06-24 at 09 59 30" src="https://github.com/user-attachments/assets/691c7bc2-b974-4cc0-8d5a-bf626ab903c0" />
<img width="1239" alt="Screenshot 2025-06-24 at 14 45 10" src="https://github.com/user-attachments/assets/ebdf5a4d-c941-4f1e-911f-647aa156b37a" />

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

Closes #3346 from FMX/b2045.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>
2025-06-26 18:42:20 -07:00

4.4 KiB

license
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Key Default isDynamic Description Since Deprecated
celeborn.metrics.capacity 4096 false The maximum number of metrics which a source can use to generate output strings. 0.2.0
celeborn.metrics.collectPerfCritical.enabled false false It controls whether to collect metrics which may affect performance. When enable, Celeborn collects them. 0.2.0
celeborn.metrics.conf <undefined> false Custom metrics configuration file path. Default use metrics.properties in classpath. 0.3.0
celeborn.metrics.enabled true false When true, enable metrics system. 0.2.0
celeborn.metrics.extraLabels false If default metric labels are not enough, extra metric labels can be customized. Labels' pattern is: <label1_key>=<label1_value>[,<label2_key>=<label2_value>]*; e.g. env=prod,version=1 0.3.0
celeborn.metrics.json.path /metrics/json false URI context path of json metrics HTTP server. 0.4.0
celeborn.metrics.json.pretty.enabled true false When true, view metrics in json pretty format 0.4.0
celeborn.metrics.loggerSink.output.enabled false false Whether to output scraped metrics to the logger. This config will have effect if you enabled logger sink.If you will not scrape metrics periodically, do add org.apache.celeborn.common.metrics.sink.LoggerSink to metrics.properties. 0.6.0
celeborn.metrics.loggerSink.scrape.interval 30min false The interval of logger sink to scrape its own metrics. This config will have effect if you enabled logger sink. If you will not scrape metrics periodically, do add org.apache.celeborn.common.metrics.sink.LoggerSink to metrics.properties. 0.6.0
celeborn.metrics.prometheus.path /metrics/prometheus false URI context path of prometheus metrics HTTP server. 0.4.0
celeborn.metrics.sample.rate 1.0 false It controls if Celeborn collect timer metrics for some operations. Its value should be in [0.0, 1.0]. 0.2.0
celeborn.metrics.timer.slidingWindow.size 4096 false The sliding window size of timer metric. 0.2.0
celeborn.metrics.worker.app.topResourceConsumption.bytesWrittenThreshold 0b false Threshold of bytes written for top resource consumption applications list of worker. The application which has bytes written less than this threshold will not be included in the top resource consumption list, including diskBytesWritten and hdfsBytesWritten. 0.6.0
celeborn.metrics.worker.app.topResourceConsumption.count 0 false Size for top items about top resource consumption applications list of worker. The top resource consumption is determined by sum of diskBytesWritten and hdfsBytesWritten. The top resource consumption count prevents the total number of metrics from exceeding the metrics capacity. Note: This will add applicationId as label which is considered as a high cardinality label, be careful enabling it on metrics systems that are not optimized for high cardinality columns. 0.6.0
celeborn.metrics.worker.appLevel.enabled true false When true, enable worker application level metrics. Note: applicationId is considered as a high cardinality label, be careful enabling it on metrics systems that are not optimized for high cardinality columns. 0.6.0
celeborn.metrics.worker.pauseSpentTime.forceAppend.threshold 10 false Force append worker pause spent time even if worker still in pause serving state. Help user can find worker pause spent time increase, when worker always been pause state.