### What changes were proposed in this pull request? Introduce `IsHighWorkload` metric to monitor worker overload status. ### Why are the changes needed? There is no any metric to monitor worker overload status at present. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? [Grafana test](https://xy2953396112.grafana.net/public-dashboards/22ab1750ef874a1bb39b5879b81a24cf). Closes #3435 from xy2953396112/CELEBORN-2118. Authored-by: xxx <953396112@qq.com> Signed-off-by: SteNicholas <programgeek@163.com>
472 lines
45 KiB
Markdown
472 lines
45 KiB
Markdown
---
|
|
license: |
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
this work for additional information regarding copyright ownership.
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
(the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
https://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
---
|
|
|
|
Monitoring
|
|
===
|
|
|
|
There are two ways to monitor Celeborn cluster: Prometheus metrics and REST API.
|
|
|
|
## Metrics
|
|
|
|
Celeborn has a configurable metrics system based on the
|
|
[Dropwizard Metrics Library](https://metrics.dropwizard.io/4.2.0).
|
|
This allows users to report Celeborn metrics to a variety of sinks including HTTP, JMX, CSV
|
|
files and prometheus servlet. The metrics are generated by sources embedded in the Celeborn code base.
|
|
They provide instrumentation for specific activities and Celeborn components.
|
|
The metrics system is configured via a configuration file that Celeborn expects to be present
|
|
at `$CELEBORN_HOME/conf/metrics.properties`. A custom file location can be specified via the
|
|
`celeborn.metrics.conf` [configuration property](https://celeborn.apache.org/docs/latest/configuration/#metrics).
|
|
Instead of using the configuration file, a set of configuration parameters with prefix
|
|
`celeborn.metrics.conf.` can be used.
|
|
|
|
Celeborn's metrics are divided into two
|
|
_instances_ corresponding to Celeborn components. The following instances are currently supported:
|
|
|
|
* `master`: The Celeborn cluster master process.
|
|
* `worker`: The Celeborn cluster worker process.
|
|
|
|
Each instance can report to zero or more _sinks_. Sinks are contained in the
|
|
`org.apache.celeborn.common.metrics.sink` package:
|
|
|
|
* `CsvSink`: Exports metrics data to CSV files at regular intervals.
|
|
* `PrometheusServlet`: Adds a servlet within the existing Celeborn REST API to serve metrics data in Prometheus format.
|
|
* `JsonServlet`: Adds a servlet within the existing Celeborn REST API to serve metrics data in JSON format.
|
|
* `GraphiteSink`: Sends metrics to a Graphite node.
|
|
* `LoggerSink`: Scrape metrics periodically and output them to the logger files if you have enabled
|
|
`celeborn.metrics.loggerSink.output.enabled`. This is used as safety valve to make sure the
|
|
metrics data won't exist in the memory for a long time. If you don't have a metrics collector to
|
|
collect metrics from celeborn periodically, it's important to enable this sink.
|
|
|
|
The syntax of the metrics configuration file and the parameters available for each sink are defined
|
|
in an example configuration file,
|
|
`$CELEBORN_HOME/conf/metrics.properties.template`.
|
|
|
|
When using Celeborn configuration parameters instead of the metrics configuration file, the relevant
|
|
parameter names are composed by the prefix `celeborn.metrics.conf.` followed by the configuration
|
|
details, i.e. the parameters take the following form:
|
|
`celeborn.metrics.conf.[instance|*].sink.[sink_name].[parameter_name]`.
|
|
This example shows a list of Celeborn configuration parameters for a CSV sink:
|
|
```
|
|
"celeborn.metrics.conf.*.sink.csv.class"="org.apache.celeborn.common.metrics.sink.CsvSink"
|
|
"celeborn.metrics.conf.*.sink.csv.period"="1"
|
|
"celeborn.metrics.conf.*.sink.csv.unit"=minutes
|
|
"celeborn.metrics.conf.*.sink.csv.directory"=/tmp/
|
|
```
|
|
|
|
Default values of the Celeborn metrics configuration are as follows:
|
|
```
|
|
*.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet
|
|
*.sink.jsonServlet.class=org.apache.celeborn.common.metrics.sink.JsonServlet
|
|
*.sink.loggerSink.class=org.apache.celeborn.common.metrics.sink.LoggerSink
|
|
```
|
|
|
|
Additional sources can be configured using the metrics configuration file or the configuration
|
|
parameter `celeborn.metrics.conf.[component_name].source.jvm.class=[source_name]`. At present the
|
|
no source is the available optional source. For example the following configuration parameter
|
|
activates the Example source:
|
|
`"celeborn.metrics.conf.*.source.jvm.class"="org.apache.celeborn.common.metrics.source.ExampleSource"`
|
|
|
|
### Available metrics providers
|
|
|
|
Metrics used by Celeborn are of multiple types: gauge, counter, histogram, meter and timer,
|
|
see [Dropwizard library documentation for details](https://metrics.dropwizard.io/4.2.0/getting-started.html).
|
|
The following list of components and metrics reports the name and some details about the available metrics,
|
|
grouped per component instance and source namespace.
|
|
The most common time of metrics used in Celeborn instrumentation are gauges and counters.
|
|
Counters can be recognized as they have the `.count` suffix. Timers, meters and histograms are annotated
|
|
in the list, the rest of the list elements are metrics of type gauge.
|
|
The large majority of metrics are active as soon as their parent component instance is configured,
|
|
some metrics require also to be enabled via an additional configuration parameter, the details are
|
|
reported in the list.
|
|
|
|
#### Master
|
|
These metrics are exposed by Celeborn master.
|
|
|
|
- namespace=master
|
|
|
|
| Metric Name | Description |
|
|
|------------------------------|-------------------------------------------------------------------------------------------|
|
|
| RegisteredShuffleCount | The count of registered shuffle. |
|
|
| DeviceCelebornFreeBytes | The actual usable space of Celeborn available workers for device. |
|
|
| DeviceCelebornTotalBytes | The total space of Celeborn for device. |
|
|
| RunningApplicationCount | The count of running applications. |
|
|
| ActiveShuffleSize | The active shuffle size of workers. |
|
|
| ActiveShuffleFileCount | The active shuffle file count of workers. |
|
|
| ShuffleTotalCount | The total count of shuffle including celeborn shuffle and engine built-in shuffle. |
|
|
| ShuffleFallbackCount | The count of shuffle fallbacks. |
|
|
| ApplicationTotalCount | The total count of application running with celeborn shuffle and engine built-in shuffle. |
|
|
| ApplicationFallbackCount | The count of application fallbacks. |
|
|
| WorkerCount | The count of active workers. |
|
|
| LostWorkerCount | The count of workers in lost list. |
|
|
| ExcludedWorkerCount | The count of workers in excluded list. |
|
|
| AvailableWorkerCount | The count of workers in available list. |
|
|
| ShutdownWorkerCount | The count of workers in shutdown list. |
|
|
| DecommissionWorkerCount | The count of workers in decommission list. |
|
|
| IsActiveMaster | Whether the current master is active. |
|
|
| RatisApplyCompletedIndex | The ApplyCompletedIndex of the current master node in HA mode. |
|
|
| RatisApplyCompletedIndexDiff | The difference value of ApplyCompletedIndex of the master nodes in HA mode. |
|
|
| PartitionSize | The size of estimated shuffle partition. |
|
|
| OfferSlotsTime | The time for masters to handle `RequestSlots` request when registering shuffle. |
|
|
|
|
- namespace=CPU
|
|
|
|
| Metric Name | Description |
|
|
|--------------|-------------------------|
|
|
| JVMCPUTime | The JVM costs cpu time. |
|
|
|
|
- namespace=system
|
|
|
|
| Metric Name | Description |
|
|
|-----------------------|-----------------------------------------------|
|
|
| LastMinuteSystemLoad | The average system load for the last minute. |
|
|
| AvailableProcessors | The amount of system available processors. |
|
|
|
|
- namespace=JVM
|
|
- This source provides information on JVM metrics using the
|
|
[Dropwizard/Codahale Metric Sets for JVM instrumentation](https://metrics.dropwizard.io/4.2.0/manual/jvm.html)
|
|
and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.
|
|
|
|
- namespace=ResourceConsumption
|
|
- **notes:**
|
|
- This metrics data is generated for each user and they are identified using a metric tag.
|
|
- This metrics also include subResourceConsumptions generated for each application of user and they are identified using `applicationId` tag.
|
|
|
|
| Metric Name | Description |
|
|
|-------------------|-----------------------------------------------------|
|
|
| diskFileCount | The count of disk files consumption by each user. |
|
|
| diskBytesWritten | The amount of disk files consumption by each user. |
|
|
| hdfsFileCount | The count of hdfs files consumption by each user. |
|
|
| hdfsBytesWritten | The amount of hdfs files consumption by each user. |
|
|
|
|
- namespace=ThreadPool
|
|
- **notes:**
|
|
- This metrics data is generated for each thread pool and they are identified using a metric tag by thread pool name.
|
|
|
|
| Metric Name | Description |
|
|
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
|
|
| active_thread_count | The approximate number of threads that are actively executing tasks. |
|
|
| pending_task_count | The pending task not executed in block queue. |
|
|
| pool_size | The current number of threads in the pool. |
|
|
| core_pool_size | The core number of threads. |
|
|
| maximum_pool_size | The maximum allowed number of threads. |
|
|
| largest_pool_size | The largest number of threads that have ever simultaneously been in the pool. |
|
|
| is_terminating | Whether this executor is in the process of terminating after shutdown() or shutdownNow() but has not completely terminated. |
|
|
| is_terminated | Whether this executor is in the process of terminated after shutdown() or shutdownNow() and has completely terminated. |
|
|
| is_shutdown | Whether this executor is shutdown. |
|
|
| thread_count | The thread count of current thread group. |
|
|
| thread_is_terminated_count | The terminated thread count of current thread group. |
|
|
| thread_is_shutdown_count | The shutdown thread count of current thread group. |
|
|
|
|
#### Worker
|
|
These metrics are exposed by Celeborn worker.
|
|
|
|
- namespace=worker
|
|
|
|
| Metric Name | Description |
|
|
|----------------------------------------|-----------------------------------------------------------------------------------------------------------------|
|
|
| RegisteredShuffleCount | The count of registered shuffle. |
|
|
| RunningApplicationCount | The count of running applications. |
|
|
| ActiveShuffleSize | The active shuffle size of a worker including master replica and slave replica. |
|
|
| ActiveShuffleFileCount | The active shuffle file count of a worker including master replica and slave replica. |
|
|
| OpenStreamTime | The time for a worker to process openStream RPC and return StreamHandle. |
|
|
| FetchChunkTime | The time for a worker to fetch a chunk which is 8MB by default from a reduced partition. |
|
|
| FetchChunkTransferTime | The time for a worker to transfer for fetching a chunk from a reduced partition. |
|
|
| ActiveChunkStreamCount | Active stream count for reduce partition reading streams. |
|
|
| OpenStreamSuccessCount | The count of opening stream succeed in current worker. |
|
|
| OpenStreamFailCount | The count of opening stream failed in current worker. |
|
|
| FetchChunkSuccessCount | The count of fetching chunk succeed in current worker. |
|
|
| FetchChunkFailCount | The count of fetching chunk failed in current worker. |
|
|
| FetchChunkTransferSize | The size of transfer for fetching chunk in current worker. |
|
|
| PrimaryPushDataTime | The time for a worker to handle a pushData RPC sent from a celeborn client. |
|
|
| ReplicaPushDataTime | The time for a worker to handle a pushData RPC sent from a celeborn worker by replicating. |
|
|
| PrimarySegmentStartTime | The time for a worker to handle a segmentStart RPC sent from a celeborn client. |
|
|
| ReplicaSegmentStartTime | The time for a worker to handle a segmentStart RPC sent from a celeborn worker by replicating. |
|
|
| WriteDataHardSplitCount | The count of writing PushData or PushMergedData to HARD_SPLIT partition in current worker. |
|
|
| WriteDataSuccessCount | The count of writing PushData or PushMergedData succeed in current worker. |
|
|
| WriteDataFailCount | The count of writing PushData or PushMergedData failed in current worker. |
|
|
| ReplicateDataFailCount | The count of replicating PushData or PushMergedData failed in current worker. |
|
|
| ReplicateDataWriteFailCount | The count of replicating PushData or PushMergedData failed caused by write failure in peer worker. |
|
|
| ReplicateDataCreateConnectionFailCount | The count of replicating PushData or PushMergedData failed caused by creating connection failed in peer worker. |
|
|
| ReplicateDataConnectionExceptionCount | The count of replicating PushData or PushMergedData failed caused by connection exception in peer worker. |
|
|
| ReplicateDataFailNonCriticalCauseCount | The count of replicating PushData or PushMergedData failed caused by non-critical exception in peer worker. |
|
|
| ReplicateDataTimeoutCount | The count of replicating PushData or PushMergedData failed caused by push timeout in peer worker. |
|
|
| PushDataHandshakeFailCount | The count of PushDataHandshake failed in current worker. |
|
|
| RegionStartFailCount | The count of RegionStart failed in current worker. |
|
|
| RegionFinishFailCount | The count of RegionFinish failed in current worker. |
|
|
| SegmentStartFailCount | The count of SegmentStart failed in current worker. |
|
|
| PrimaryPushDataHandshakeTime | PrimaryPushDataHandshake means handle PushData of primary partition location. |
|
|
| ReplicaPushDataHandshakeTime | ReplicaPushDataHandshake means handle PushData of replica partition location. |
|
|
| PrimaryRegionStartTime | PrimaryRegionStart means handle RegionStart of primary partition location. |
|
|
| ReplicaRegionStartTime | ReplicaRegionStart means handle RegionStart of replica partition location. |
|
|
| PrimaryRegionFinishTime | PrimaryRegionFinish means handle RegionFinish of primary partition location. |
|
|
| ReplicaRegionFinishTime | ReplicaRegionFinish means handle RegionFinish of replica partition location. |
|
|
| PausePushDataStatus | The status for a worker to stop receiving pushData from clients because of back pressure. |
|
|
| PausePushDataTime | The time for a worker to stop receiving pushData from clients because of back pressure. |
|
|
| PausePushDataAndReplicateTime | The time for a worker to stop receiving pushData from clients and other workers because of back pressure. |
|
|
| PausePushDataAndReplicateStatus | The status for a worker to stop receiving pushData from clients because of back pressure. |
|
|
| PausePushData | The count for a worker to stop receiving pushData from clients because of back pressure. |
|
|
| PausePushDataAndReplicate | The count for a worker to stop receiving pushData from clients and other workers because of back pressure. |
|
|
| PartitionFileSizeBytes | The size of partition files committed in current worker. |
|
|
| TakeBufferTime | The time for a worker to take out a buffer from a disk flusher. |
|
|
| FlushDataTime | The time for a worker to write a buffer which is 256KB by default to storage. |
|
|
| CommitFilesTime | The time for a worker to flush buffers and close files related to specified shuffle. |
|
|
| CommitFilesFailCount | The count of commit files request failed in current worker. |
|
|
| SlotsAllocated | Slots allocated in last hour. |
|
|
| ActiveSlotsCount | The number of slots currently being used in a worker. |
|
|
| ReserveSlotsTime | ReserveSlots means acquire a disk buffer and record partition location. |
|
|
| ActiveConnectionCount | The count of active network connection. |
|
|
| NettyMemory | The total amount of off-heap memory used by celeborn worker. |
|
|
| SortTime | The time for a worker to sort a shuffle file. |
|
|
| SortMemory | The memory used by sorting shuffle files. |
|
|
| SortingFiles | The count of sorting shuffle files. |
|
|
| PendingSortTasks | The count of sort tasks waiting to be submitted to FileSorterExecutors. |
|
|
| SortedFiles | The count of sorted shuffle files. |
|
|
| SortedFileSize | The count of sorted shuffle files 's total size. |
|
|
| DiskBuffer | The memory occupied by pushData and pushMergedData which should be written to disk. |
|
|
| BufferStreamReadBuffer | The memory used by credit stream read buffer. |
|
|
| ReadBufferDispatcherRequestsLength | The queue size of read buffer allocation requests. |
|
|
| ReadBufferAllocatedCount | Allocated read buffer count. |
|
|
| ActiveCreditStreamCount | Active stream count for map partition reading streams. |
|
|
| ActiveMapPartitionCount | The count of active map partition reading streams. |
|
|
| SorterCacheHitRate | The cache hit rate for worker partition sorter index. |
|
|
| CleanTaskQueueSize | The count of task for cleaning up expired shuffle keys. |
|
|
| CleanExpiredShuffleKeysTime | The time for a worker to clean up shuffle data of expired shuffle keys. |
|
|
| DeviceOSFreeBytes | The actual usable space of OS for device monitor. |
|
|
| DeviceOSTotalBytes | The total usable space of OS for device monitor. |
|
|
| DeviceCelebornFreeBytes | The actual usable space of Celeborn for device. |
|
|
| DeviceCelebornTotalBytes | The total space of Celeborn for device. |
|
|
| PotentialConsumeSpeed | The speed of potential consumption for congestion control. |
|
|
| UserProduceSpeed | The speed of user production for congestion control. |
|
|
| WorkerConsumeSpeed | The speed of worker consumption for congestion control. |
|
|
| IsDecommissioningWorker | 1 means worker decommissioning, 0 means not decommissioning. |
|
|
| IsHighWorkload | 1 means worker high workload, 0 means not high workload. |
|
|
| UnreleasedShuffleCount | Unreleased shuffle count when worker is decommissioning. |
|
|
| UnreleasedPartitionLocationCount | Unreleased partition location count when worker is shutting down. |
|
|
| MemoryStorageFileCount | The count of files in Memory Storage of a worker. |
|
|
| MemoryFileStorageSize | The total amount of memory used by Memory Storage. |
|
|
| EvictedFileCount | The count of files evicted from Memory Storage to Disk |
|
|
| DirectMemoryUsageRatio | Ratio of direct memory used and max direct memory. |
|
|
| RegisterWithMasterFailCount | The count of failures in register with master request. |
|
|
| FlushWorkingQueueSize | The size of flush working queue for mount point. |
|
|
| LocalFlushCount | The amount of data flushed to local. |
|
|
| LocalFlushSize | The size of data flushed to local. |
|
|
| HdfsFlushCount | The amount of data flushed to HDFS. |
|
|
| HdfsFlushSize | The size of data flushed to HDFS. |
|
|
| OssFlushCount | The amount of data flushed to OSS. |
|
|
| OssFlushSize | The size of data flushed to OSS. |
|
|
| S3FlushCount | The amount of data flushed to S3. |
|
|
| S3FlushSize | The size of data flushed to S3. |
|
|
| push_usedHeapMemory | |
|
|
| push_usedDirectMemory | |
|
|
| push_numHeapArenas | |
|
|
| push_numDirectArenas | |
|
|
| push_tinyCacheSize | |
|
|
| push_smallCacheSize | |
|
|
| push_normalCacheSize | |
|
|
| push_numThreadLocalCaches | |
|
|
| push_chunkSize | |
|
|
| push_numAllocations | |
|
|
| push_numTinyAllocations | |
|
|
| push_numSmallAllocations | |
|
|
| push_numNormalAllocations | |
|
|
| push_numHugeAllocations | |
|
|
| push_numDeallocations | |
|
|
| push_numTinyDeallocations | |
|
|
| push_numSmallDeallocations | |
|
|
| push_numNormalDeallocations | |
|
|
| push_numHugeDeallocations | |
|
|
| push_numActiveAllocations | |
|
|
| push_numActiveTinyAllocations | |
|
|
| push_numActiveSmallAllocations | |
|
|
| push_numActiveNormalAllocations | |
|
|
| push_numActiveHugeAllocations | |
|
|
| push_numActiveBytes | |
|
|
| replicate_usedHeapMemory | |
|
|
| replicate_usedDirectMemory | |
|
|
| replicate_numHeapArenas | |
|
|
| replicate_numDirectArenas | |
|
|
| replicate_tinyCacheSize | |
|
|
| replicate_smallCacheSize | |
|
|
| replicate_normalCacheSize | |
|
|
| replicate_numThreadLocalCaches | |
|
|
| replicate_chunkSize | |
|
|
| replicate_numAllocations | |
|
|
| replicate_numTinyAllocations | |
|
|
| replicate_numSmallAllocations | |
|
|
| replicate_numNormalAllocations | |
|
|
| replicate_numHugeAllocations | |
|
|
| replicate_numDeallocations | |
|
|
| replicate_numTinyDeallocations | |
|
|
| replicate_numSmallDeallocations | |
|
|
| replicate_numNormalDeallocations | |
|
|
| replicate_numHugeDeallocations | |
|
|
| replicate_numActiveAllocations | |
|
|
| replicate_numActiveTinyAllocations | |
|
|
| replicate_numActiveSmallAllocations | |
|
|
| replicate_numActiveNormalAllocations | |
|
|
| replicate_numActiveHugeAllocations | |
|
|
| replicate_numActiveBytes | |
|
|
| fetch_usedHeapMemory | |
|
|
| fetch_usedDirectMemory | |
|
|
| fetch_numHeapArenas | |
|
|
| fetch_numDirectArenas | |
|
|
| fetch_tinyCacheSize | |
|
|
| fetch_smallCacheSize | |
|
|
| fetch_normalCacheSize | |
|
|
| fetch_numThreadLocalCaches | |
|
|
| fetch_chunkSize | |
|
|
| fetch_numAllocations | |
|
|
| fetch_numTinyAllocations | |
|
|
| fetch_numSmallAllocations | |
|
|
| fetch_numNormalAllocations | |
|
|
| fetch_numHugeAllocations | |
|
|
| fetch_numDeallocations | |
|
|
| fetch_numTinyDeallocations | |
|
|
| fetch_numSmallDeallocations | |
|
|
| fetch_numNormalDeallocations | |
|
|
| fetch_numHugeDeallocations | |
|
|
| fetch_numActiveAllocations | |
|
|
| fetch_numActiveTinyAllocations | |
|
|
| fetch_numActiveSmallAllocations | |
|
|
| fetch_numActiveNormalAllocations | |
|
|
| fetch_numActiveHugeAllocations | |
|
|
| fetch_numActiveBytes | |
|
|
|
|
- namespace=CPU
|
|
|
|
| Metric Name | Description |
|
|
|--------------|---------------------------|
|
|
| JVMCPUTime | The JVM costs cpu time. |
|
|
|
|
- namespace=system
|
|
|
|
| Metric Name | Description |
|
|
|------------------------|-------------------------------------------------------|
|
|
| LastMinuteSystemLoad | Returns the system load average for the last minute. |
|
|
| AvailableProcessors | The amount of system available processors. |
|
|
|
|
- namespace=JVM
|
|
- This source provides information on JVM metrics using the
|
|
[Dropwizard/Codahale Metric Sets for JVM instrumentation](https://metrics.dropwizard.io/4.2.0/manual/jvm.html)
|
|
and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.
|
|
|
|
- namespace=ResourceConsumption
|
|
- **notes:**
|
|
- This metrics data is generated for each user and they are identified using a metric tag.
|
|
- This metrics also include subResourceConsumptions generated for each application of user and they are identified using `applicationId` tag.
|
|
|
|
| Metric Name | Description |
|
|
|-------------------|-----------------------------------------------------|
|
|
| diskFileCount | The count of disk files consumption by each user. |
|
|
| diskBytesWritten | The amount of disk files consumption by each user. |
|
|
| hdfsFileCount | The count of hdfs files consumption by each user. |
|
|
| hdfsBytesWritten | The amount of hdfs files consumption by each user. |
|
|
|
|
- namespace=ThreadPool
|
|
- **notes:**
|
|
- This metrics data is generated for each thread pool and they are identified using a metric tag by thread pool name.
|
|
|
|
| Metric Name | Description |
|
|
|------------------------|-----------------------------------------------------------------------------------------------------------------------------|
|
|
| active_thread_count | The approximate number of threads that are actively executing tasks. |
|
|
| pending_task_count | The pending task not executed in block queue. |
|
|
| pool_size | The current number of threads in the pool. |
|
|
| core_pool_size | The core number of threads. |
|
|
| maximum_pool_size | The maximum allowed number of threads. |
|
|
| largest_pool_size | The largest number of threads that have ever simultaneously been in the pool. |
|
|
| is_terminating | Whether this executor is in the process of terminating after shutdown() or shutdownNow() but has not completely terminated. |
|
|
| is_terminated | Whether this executor is in the process of terminated after shutdown() or shutdownNow() and has completely terminated. |
|
|
| is_shutdown | Whether this executor is shutdown. |
|
|
|
|
**Note:**
|
|
|
|
The Netty DirectArenaMetrics named like `push/fetch/replicate_numXX` are not exposed by default, nor in Grafana dashboard.
|
|
If there is a need, you can enable `celeborn.network.memory.allocator.verbose.metric` to expose these metrics.
|
|
|
|
### Setup Prometheus dashboard
|
|
|
|
1. Install Prometheus (https://prometheus.io/). We provide an example for Prometheus config file:
|
|
|
|
```yaml
|
|
# Prometheus example config
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
scrape_configs:
|
|
- job_name: "Celeborn"
|
|
metrics_path: /metrics/prometheus
|
|
scrape_interval: 15s
|
|
static_configs:
|
|
- targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]
|
|
```
|
|
|
|
2. Install Grafana server (https://grafana.com/grafana/download).
|
|
|
|
3. Import Celeborn dashboard into Grafana.
|
|
|
|
You can find the Celeborn dashboard templates under the `assets/grafana` directory.
|
|
`celeborn-dashboard.json` displays Celeborn internal metrics and `celeborn-jvm-dashboard.json` displays Celeborn JVM related metrics.
|
|
|
|
Here is an example of Grafana dashboard importing.
|
|
|
|

|
|
|
|

|
|
|
|

|
|
|
|
<img alt="g4" height="90%" src="assets/img/g4.png" width="90%"/>
|
|
|
|
<img alt="g6" height="90%" src="assets/img/g6.png" width="90%"/>
|
|
|
|
<img alt="g5" height="90%" src="assets/img/g5.png" width="90%"/>
|
|
|
|
Here are some snapshots:
|
|
|
|

|
|
|
|

|
|
|
|
#### Optional
|
|
We recommend you to install node exporter (https://github.com/prometheus/node_exporter)
|
|
on every host, and configure Prometheus to scrape information about the host.
|
|
Grafana will need a dashboard (dashboard id:8919) to display host details.
|
|
|
|
```yaml
|
|
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
scrape_configs:
|
|
- job_name: "Celeborn"
|
|
metrics_path: /metrics/prometheus
|
|
scrape_interval: 15s
|
|
static_configs:
|
|
- targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]
|
|
- job_name: "node"
|
|
static_configs:
|
|
- targets: [ "master-ip:9100","worker1-ip:9100","worker2-ip:9100","worker3-ip:9100","worker4-ip:9100" ]
|
|
```
|
|
|
|
## REST API
|
|
|
|
In addition to viewing the metrics, Celeborn also supports [REST API](restapi.md).
|
|
This gives developers an easy way to create new visualizations and monitoring tools for Celeborn and
|
|
also easy for users to get the running status of the service.
|