celeborn/metrics.md at main

mingji 7a0eee332a [CELEBORN-2045] Add logger sinks to allow persist metrics data and avoid possible worker OOM

### What changes were proposed in this pull request?
1. Add a new sink and allow the user to store metrics to files.
2. Celeborn will scrape its metrics periodically to make sure that the metric data won't be too large to cause OOM.

### Why are the changes needed?
A long-running worker ran out of memory and found out that the metrics are huge in the heap dump.
As you can see below, the biggest object is the time metric queue, and I got 1.6 million records.
<img width="1516" alt="Screenshot 2025-06-24 at 09 59 30" src="https://github.com/user-attachments/assets/691c7bc2-b974-4cc0-8d5a-bf626ab903c0" />
<img width="1239" alt="Screenshot 2025-06-24 at 14 45 10" src="https://github.com/user-attachments/assets/ebdf5a4d-c941-4f1e-911f-647aa156b37a" />

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
Cluster.

Closes #3346 from FMX/b2045.

Lead-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Co-authored-by: Ethan Feng <ethanfeng@apache.org>
Co-authored-by: Fei Wang <fwang12@ebay.com>
Signed-off-by: Wang, Fei <fwang12@ebay.com>

2025-06-26 18:42:20 -07:00

4.4 KiB

Raw Permalink Blame History

license
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

license

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Key	Default	isDynamic	Description	Since
celeborn.metrics.capacity	4096	false	The maximum number of metrics which a source can use to generate output strings.	0.2.0
celeborn.metrics.collectPerfCritical.enabled	false	false	It controls whether to collect metrics which may affect performance. When enable, Celeborn collects them.	0.2.0
celeborn.metrics.conf	<undefined>	false	Custom metrics configuration file path. Default use `metrics.properties` in classpath.	0.3.0
celeborn.metrics.enabled	true	false	When true, enable metrics system.	0.2.0
celeborn.metrics.extraLabels		false	If default metric labels are not enough, extra metric labels can be customized. Labels' pattern is: `<label1_key>=<label1_value>[,<label2_key>=<label2_value>]*`; e.g. `env=prod,version=1`	0.3.0
celeborn.metrics.json.path	/metrics/json	false	URI context path of json metrics HTTP server.	0.4.0
celeborn.metrics.json.pretty.enabled	true	false	When true, view metrics in json pretty format	0.4.0
celeborn.metrics.loggerSink.output.enabled	false	false	Whether to output scraped metrics to the logger. This config will have effect if you enabled logger sink.If you will not scrape metrics periodically, do add `org.apache.celeborn.common.metrics.sink.LoggerSink` to metrics.properties.	0.6.0
celeborn.metrics.loggerSink.scrape.interval	30min	false	The interval of logger sink to scrape its own metrics. This config will have effect if you enabled logger sink. If you will not scrape metrics periodically, do add `org.apache.celeborn.common.metrics.sink.LoggerSink` to metrics.properties.	0.6.0
celeborn.metrics.prometheus.path	/metrics/prometheus	false	URI context path of prometheus metrics HTTP server.	0.4.0
celeborn.metrics.sample.rate	1.0	false	It controls if Celeborn collect timer metrics for some operations. Its value should be in [0.0, 1.0].	0.2.0
celeborn.metrics.timer.slidingWindow.size	4096	false	The sliding window size of timer metric.	0.2.0
celeborn.metrics.worker.app.topResourceConsumption.bytesWrittenThreshold	0b	false	Threshold of bytes written for top resource consumption applications list of worker. The application which has bytes written less than this threshold will not be included in the top resource consumption list, including diskBytesWritten and hdfsBytesWritten.	0.6.0
celeborn.metrics.worker.app.topResourceConsumption.count	0	false	Size for top items about top resource consumption applications list of worker. The top resource consumption is determined by sum of diskBytesWritten and hdfsBytesWritten. The top resource consumption count prevents the total number of metrics from exceeding the metrics capacity. Note: This will add applicationId as label which is considered as a high cardinality label, be careful enabling it on metrics systems that are not optimized for high cardinality columns.	0.6.0
celeborn.metrics.worker.appLevel.enabled	true	false	When true, enable worker application level metrics. Note: applicationId is considered as a high cardinality label, be careful enabling it on metrics systems that are not optimized for high cardinality columns.	0.6.0
celeborn.metrics.worker.pauseSpentTime.forceAppend.threshold	10	false	Force append worker pause spent time even if worker still in pause serving state. Help user can find worker pause spent time increase, when worker always been pause state.

4.4 KiB Raw Permalink Blame History

4.4 KiB

Raw Permalink Blame History