### What changes were proposed in this pull request? Introduce `ShutdownWorkerCount` metric to record the count of workers in shutdown list. <img width="1432" alt="image" src="https://github.com/apache/incubator-celeborn/assets/10048174/bc84b281-30ca-40a1-92e4-fb9cf10b5aeb"> ### Why are the changes needed? `/shutdownWorkers` lists all shutdown workers of the master at present. Therefore it's recommended to introduce ShutdownWorkerCount metric to record the count of workers in shutdown list. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - [Celeborn Dashboard](https://stenicholas.grafana.net/public-dashboards/c44822917403401690edb15617ec9f08) Closes #2379 from SteNicholas/CELEBORN-1323. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
26 KiB
Metrics
We provide various metrics about memory, disk, and important procedures. These metrics could help identify performance issue or monitor Celeborn cluster.
Prerequisites
-
Enable Celeborn metrics. Set configuration
celeborn.metrics.enabledto true (true by default). -
Configure Celeborn metrics properties.
cd $CELEBORN_HOME/conf
cp metrics.properties.template metrics.properties
The default values of the Celeborn metrics configuration are as follows:
*.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet
- Install Prometheus (https://prometheus.io/). We provide an example for Prometheus config file:
# Prometheus example config
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "Celeborn"
metrics_path: /metrics/prometheus
scrape_interval: 15s
static_configs:
- targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]
-
Install Grafana server (https://grafana.com/grafana/download).
-
Import Celeborn dashboard into Grafana.
You can find the Celeborn dashboard templates under the assets/grafana directory.
celeborn-dashboard.json displays Celeborn internal metrics and celeborn-jvm-dashboard.json displays Celeborn JVM related metrics.
Optional
We recommend you to install node exporter (https://github.com/prometheus/node_exporter) on every host, and configure Prometheus to scrape information about the host. Grafana will need a dashboard (dashboard id:8919) to display host details.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "Celeborn"
metrics_path: /metrics/prometheus
scrape_interval: 15s
static_configs:
- targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]
- job_name: "node"
static_configs:
- targets: [ "master-ip:9100","worker1-ip:9100","worker2-ip:9100","worker3-ip:9100","worker4-ip:9100" ]
Import Dashboard Steps
Here is an example of Grafana dashboard importing.
Details
| MetricName | Scope | Description |
|---|---|---|
| RegisteredShuffleCount | master and worker | The value means count of registered shuffle. |
| RunningApplicationCount | master and worker | The value means count of running applications. |
| ActiveShuffleSize | master and worker | The value means the active shuffle size for workers or a worker including master replica and slave replica. |
| ActiveShuffleFileCount | master and worker | The value means the active shuffle file count for workers or a worker including master replica and slave replica. |
| diskFileCount | master and worker | The count of disk files consumption by each user. |
| diskBytesWritten | master and worker | The amount of disk files consumption by each user. |
| hdfsFileCount | master and worker | The count of hdfs files consumption by each user. |
| hdfsBytesWritten | master and worker | The amount of hdfs files consumption by each user. |
| WorkerCount | master | The count of active workers. |
| LostWorkerCount | master | The count of workers in lost list. |
| ExcludedWorkerCount | master | The count of workers in excluded list. |
| ShutdownWorkerCount | master | The count of workers in shutdown list. |
| IsActiveMaster | master | Whether the current master is active. |
| PartitionSize | master | The estimated partition size of last 20 flush window whose length is 15 seconds by defaults. |
| OfferSlotsTime | master | The time of offer slots. |
| OpenStreamTime | worker | OpenStream means read a shuffle file and send client about chunks size and stream index. |
| FetchChunkTime | worker | FetchChunk means read a chunk from a shuffle file and send to client. |
| ActiveChunkStreamCount | worker | The active stream count for reduce partition reading streams in current worker. |
| OpenStreamSuccessCount | worker | The count of opening stream succeed in current worker. |
| OpenStreamFailCount | worker | The count of opening stream failed in current worker. |
| FetchChunkSuccessCount | worker | The count of fetching chunk succeed in current worker. |
| FetchChunkFailCount | worker | The count of fetching chunk failed in current worker. |
| PrimaryPushDataTime | worker | PrimaryPushData means handle PushData of primary partition location. |
| ReplicaPushDataTime | worker | ReplicaPushData means handle PushData of replica partition location. |
| WriteDataHardSplitCount | worker | The count of writing PushData or PushMergedData to HARD_SPLIT partition in current worker. |
| WriteDataSuccessCount | worker | The count of writing PushData or PushMergedData succeed in current worker. |
| WriteDataFailCount | worker | The count of writing PushData or PushMergedData failed in current worker. |
| ReplicateDataFailCount | worker | The count of replicating PushData or PushMergedData failed in current worker. |
| ReplicateDataWriteFailCount | worker | The count of replicating PushData or PushMergedData failed caused by write failure in peer worker. |
| ReplicateDataCreateConnectionFailCount | worker | The count of replicating PushData or PushMergedData failed caused by creating connection failed in peer worker. |
| ReplicateDataConnectionExceptionCount | worker | The count of replicating PushData or PushMergedData failed caused by connection exception in peer worker. |
| ReplicateDataFailNonCriticalCauseCount | worker | The count of replicating PushData or PushMergedData failed caused by non-critical exception in peer worker. |
| ReplicateDataTimeoutCount | worker | The count of replicating PushData or PushMergedData failed caused by push timeout in peer worker. |
| PushDataHandshakeFailCount | worker | The count of PushDataHandshake failed in current worker. |
| RegionStartFailCount | worker | The count of RegionStart failed in current worker. |
| RegionFinishFailCount | worker | The count of RegionFinish failed in current worker. |
| PrimaryPushDataHandshakeTime | worker | PrimaryPushDataHandshake means handle PushData of primary partition location. |
| ReplicaPushDataHandshakeTime | worker | ReplicaPushDataHandshake means handle PushData of replica partition location. |
| PrimaryRegionStartTime | worker | PrimaryRegionStart means handle RegionStart of primary partition location. |
| ReplicaRegionStartTime | worker | ReplicaRegionStart means handle RegionStart of replica partition location. |
| PrimaryRegionFinishTime | worker | PrimaryRegionFinish means handle RegionFinish of primary partition location. |
| ReplicaRegionFinishTime | worker | ReplicaRegionFinish means handle RegionFinish of replica partition location. |
| PausePushDataTime | worker | PausePushData means stop receiving data from client. |
| PausePushDataAndReplicateTime | worker | PausePushDataAndReplicate means stop receiving data from client and other workers. |
| PausePushData | worker | The count of stopping receiving data from client in current worker. |
| PausePushDataAndReplicate | worker | The count of stopping receiving data from client and other workers in current worker. |
| TakeBufferTime | worker | TakeBuffer means get a disk buffer from disk flusher. |
| FlushDataTime | worker | FlushData means flush a disk buffer to disk. |
| CommitFilesTime | worker | CommitFiles means flush and close a shuffle partition file. |
| SlotsAllocated | worker | Slots allocated in last hour |
| ReserveSlotsTime | worker | ReserveSlots means acquire a disk buffer and record partition location. |
| ActiveConnectionCount | worker | The value means count of active network connection. |
| NettyMemory | worker | The value measures all kinds of transport memory used by netty. |
| SortTime | worker | SortTime measures the time used by sorting a shuffle file. |
| SortMemory | worker | SortMemory means total reserved memory for sorting shuffle files . |
| SortingFiles | worker | This value means the count of sorting shuffle files. |
| SortedFiles | worker | This value means the count of sorted shuffle files. |
| SortedFileSize | worker | This value means the count of sorted shuffle files 's total size. |
| DiskBuffer | worker | Disk buffers are part of netty used memory, means data need to write to disk but haven't been written to disk. |
| BufferStreamReadBuffer | worker | This value means memory used by credit stream read buffer. |
| ReadBufferDispatcherRequestsLength | worker | This value means the queue size of read buffer allocation requests. |
| ReadBufferAllocatedCount | worker | This value means count of allocated read buffer. |
| ActiveCreditStreamCount | worker | This value means active count of stream for map partition reading streams. |
| ActiveMapPartitionCount | worker | This value means count of active map partition reading streams. |
| DeviceOSFreeBytes | worker | This value means actual usable space of OS for device monitor. |
| DeviceOSTotalBytes | worker | This value means total usable space of OS for device monitor. |
| DeviceCelebornFreeBytes | worker | This value means actual usable space of Celeborn for device monitor. |
| DeviceCelebornTotalBytes | worker | This value means configured usable space of Celeborn for device monitor. |
| PotentialConsumeSpeed | worker | This value means speed of potential consumption for congestion control. |
| UserProduceSpeed | worker | This value means speed of user production for congestion control. |
| WorkerConsumeSpeed | worker | This value means speed of worker consumption for congestion control. |
| jvm_gc_count | JVM | The GC count of each garbage collector. |
| jvm_gc_time | JVM | The GC cost time of each garbage collector. |
| jvm_memory_heap_init | JVM | The amount of heap init memory. |
| jvm_memory_heap_max | JVM | The amount of heap max memory. |
| jvm_memory_heap_used | JVM | The amount of heap used memory. |
| jvm_memory_heap_committed | JVM | The amount of heap committed memory. |
| jvm_memory_heap_usage | JVM | The percentage of heap memory usage. |
| jvm_memory_non_heap_init | JVM | The amount of non-heap init memory. |
| jvm_memory_non_heap_max | JVM | The amount of non-heap max memory. |
| jvm_memory_non_heap_used | JVM | The amount of non-heap uesd memory. |
| jvm_memory_non_heap_committed | JVM | The amount of non-heap committed memory. |
| jvm_memory_non_heap_usage | JVM | The percentage of non-heap memory usage. |
| jvm_memory_pools_init | JVM | The amount of each memory pool's init memory. |
| jvm_memory_pools_max | JVM | The amount of each memory pool's max memory. |
| jvm_memory_pools_used | JVM | The amount of each memory pool's used memory. |
| jvm_memory_pools_committed | JVM | The amount of each memory pool's committed memory. |
| jvm_memory_pools_used_after_gc | JVM | The amount of each memory pool's used memory after GC. |
| jvm_memory_pools_usage | JVM | The percentage of each memory pool's memory usage. |
| jvm_memory_total_init | JVM | The amount of total init memory. |
| jvm_memory_total_max | JVM | The amount of total max memory. |
| jvm_memory_total_used | JVM | The amount of total used memory. |
| jvm_memory_total_committed | JVM | The amount of each memory pool's committed memory. |
| jvm_direct_capacity | JVM | An estimate of the total capacity of the buffers in this pool |
| jvm_direct_count | JVM | An estimate of the number of buffers in the pool |
| jvm_direct_used | JVM | An estimate of the memory that JVM is using for this buffer pool |
| jvm_mapped_capacity | JVM | An estimate of the total capacity of the buffers in this pool |
| jvm_mapped_count | JVM | An estimate of the number of buffers in the pool |
| jvm_mapped_used | JVM | An estimate of the memory that JVM is using for this buffer pool |
| jvm_thread_count | JVM | The current number of threads. |
| jvm_thread_daemon_count | JVM | The current number of daemon threads. |
| jvm_thread_blocked_count | JVM | The current number of threads having blocked state. |
| jvm_thread_deadlock_count | JVM | The current number of threads having deadlock state. |
| jvm_thread_new_count | JVM | The current number of threads having new state. |
| jvm_thread_runnable_count | JVM | The current number of threads having runnable state. |
| jvm_thread_terminated_count | JVM | The current number of threads having terminated state. |
| jvm_thread_timed_waiting_count | JVM | The current number of threads having timed_waiting state. |
| jvm_thread_waiting_count | JVM | The current number of threads having waiting state. |
| jvm_classloader_loaded | JVM | The total number of classes loaded since the start of the JVM. |
| jvm_classloader_unloaded | JVM | The total number of classes unloaded since the start of the JVM. |
| JVMCPUTime | system | The JVM costs cpu time. |
| AvailableProcessors | system | The amount of system available processors. |
| LastMinuteSystemLoad | system | The last minute load of system. |
| active_thread_count | JVM | The approximate number of threads that are actively executing tasks. |
| pending_task_count | JVM | The pending task not executed in block queue. |
| pool_size | JVM | The current number of threads in the pool. |
| core_pool_size | JVM | The core number of threads. |
| maximum_pool_size | JVM | The maximum allowed number of threads. |
| largest_pool_size | JVM | The largest number of threads that have ever simultaneously been in the pool. |
| is_terminating | JVM | If this executor is in the process of terminating after shutdown() or shutdownNow() but has not completely terminated. |
| is_terminated | JVM | If this executor is in the process of terminated after shutdown() or shutdownNow() and has completely terminated. |
| is_shutdown | JVM | If this executor is shutdown. |
| thread_count | JVM | The thread count of current thread group. |
| thread_is_terminated_count | JVM | The terminated thread count of current thread group. |
| thread_is_shutdown_count | JVM | The shutdown thread count of current thread group. |
Implementation
Celeborn master metrics : org/apache/celeborn/service/deploy/master/MasterSource.scala.
Celeborn worker metrics : org/apache/celeborn/service/deploy/worker/WorkerSource.scala.
Other common metrics are implemented in org.apache.celeborn.common.metrics.source package.
Dashboard Snapshots
The dashboard Celeborn-dashboard was generated by Grafana of version 10.0.3.
Here are some snapshots:







