celeborn/docs/monitoring.md

---
license: |
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      https://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
---

Monitoring
===

There are two ways to monitor Celeborn cluster: Prometheus metrics and REST API.

## Metrics

Celeborn has a configurable metrics system based on the
[Dropwizard Metrics Library](https://metrics.dropwizard.io/4.2.0).
This allows users to report Celeborn metrics to a variety of sinks including HTTP, JMX, CSV
files and prometheus servlet. The metrics are generated by sources embedded in the Celeborn code base.
They provide instrumentation for specific activities and Celeborn components.
The metrics system is configured via a configuration file that Celeborn expects to be present
at `$CELEBORN_HOME/conf/metrics.properties`. A custom file location can be specified via the
`celeborn.metrics.conf` [configuration property](https://celeborn.apache.org/docs/latest/configuration/#metrics).
Instead of using the configuration file, a set of configuration parameters with prefix
`celeborn.metrics.conf.` can be used.

Celeborn's metrics are divided into two
_instances_ corresponding to Celeborn components.  The following instances are currently supported:

* `master`: The Celeborn cluster master process.
* `worker`: The Celeborn cluster worker process.

Each instance can report to zero or more _sinks_. Sinks are contained in the
`org.apache.celeborn.common.metrics.sink` package:

* `CsvSink`: Exports metrics data to CSV files at regular intervals.
* `PrometheusServlet`: Adds a servlet within the existing Celeborn REST API to serve metrics data in Prometheus format.
* `JsonServlet`: Adds a servlet within the existing Celeborn REST API to serve metrics data in JSON format.
* `GraphiteSink`: Sends metrics to a Graphite node.
* `LoggerSink`: Scrape metrics periodically and output them to the logger files if you have enabled
  `celeborn.metrics.loggerSink.output.enabled`. This is used as safety valve to make sure the
  metrics data won't exist in the memory for a long time. If you don't have a metrics collector to
  collect metrics from celeborn periodically, it's important to enable this sink.

The syntax of the metrics configuration file and the parameters available for each sink are defined
in an example configuration file,
`$CELEBORN_HOME/conf/metrics.properties.template`.

When using Celeborn configuration parameters instead of the metrics configuration file, the relevant
parameter names are composed by the prefix `celeborn.metrics.conf.` followed by the configuration
details, i.e. the parameters take the following form:
`celeborn.metrics.conf.[instance|*].sink.[sink_name].[parameter_name]`.
This example shows a list of Celeborn configuration parameters for a CSV sink:
```
"celeborn.metrics.conf.*.sink.csv.class"="org.apache.celeborn.common.metrics.sink.CsvSink"
"celeborn.metrics.conf.*.sink.csv.period"="1"
"celeborn.metrics.conf.*.sink.csv.unit"=minutes
"celeborn.metrics.conf.*.sink.csv.directory"=/tmp/
```

Default values of the Celeborn metrics configuration are as follows:
```
*.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet
*.sink.jsonServlet.class=org.apache.celeborn.common.metrics.sink.JsonServlet
*.sink.loggerSink.class=org.apache.celeborn.common.metrics.sink.LoggerSink
```

Additional sources can be configured using the metrics configuration file or the configuration
parameter `celeborn.metrics.conf.[component_name].source.jvm.class=[source_name]`. At present the
no source is the available optional source. For example the following configuration parameter
activates the Example source:
`"celeborn.metrics.conf.*.source.jvm.class"="org.apache.celeborn.common.metrics.source.ExampleSource"`

### Available metrics providers

Metrics used by Celeborn are of multiple types: gauge, counter, histogram, meter and timer,
see [Dropwizard library documentation for details](https://metrics.dropwizard.io/4.2.0/getting-started.html).
The following list of components and metrics reports the name and some details about the available metrics,
grouped per component instance and source namespace.
The most common time of metrics used in Celeborn instrumentation are gauges and counters.
Counters can be recognized as they have the `.count` suffix. Timers, meters and histograms are annotated
in the list, the rest of the list elements are metrics of type gauge.
The large majority of metrics are active as soon as their parent component instance is configured,
some metrics require also to be enabled via an additional configuration parameter, the details are
reported in the list.

#### Master
These metrics are exposed by Celeborn master.

  - namespace=master

    | Metric Name                  | Description                                                                               |
    |------------------------------|-------------------------------------------------------------------------------------------|
    | RegisteredShuffleCount       | The count of registered shuffle.                                                          |
    | DeviceCelebornFreeBytes      | The actual usable space of Celeborn available workers for device.                         |
    | DeviceCelebornTotalBytes     | The total space of Celeborn for device.                                                   |
    | RunningApplicationCount      | The count of running applications.                                                        |
    | ActiveShuffleSize            | The active shuffle size of workers.                                                       |
    | ActiveShuffleFileCount       | The active shuffle file count of workers.                                                 |
    | ShuffleTotalCount            | The total count of shuffle including celeborn shuffle and engine built-in shuffle.        |
    | ShuffleFallbackCount         | The count of shuffle fallbacks.                                                           |
    | ApplicationTotalCount        | The total count of application running with celeborn shuffle and engine built-in shuffle. |
    | ApplicationFallbackCount     | The count of application fallbacks.                                                       |
    | WorkerCount                  | The count of active workers.                                                              |
    | LostWorkerCount              | The count of workers in lost list.                                                        |
    | ExcludedWorkerCount          | The count of workers in excluded list.                                                    |
    | AvailableWorkerCount         | The count of workers in available list.                                                   |
    | ShutdownWorkerCount          | The count of workers in shutdown list.                                                    |
    | DecommissionWorkerCount      | The count of workers in decommission list.                                                |
    | IsActiveMaster               | Whether the current master is active.                                                     |
    | RatisApplyCompletedIndex     | The ApplyCompletedIndex of the current master node in HA mode.                            |
    | RatisApplyCompletedIndexDiff | The difference value of ApplyCompletedIndex of the master nodes in HA mode.               |
    | PartitionSize                | The size of estimated shuffle partition.                                                  |
    | OfferSlotsTime               | The time for masters to handle `RequestSlots` request when registering shuffle.           |

  - namespace=CPU

    | Metric Name  | Description             |
    |--------------|-------------------------|
    | JVMCPUTime   | The JVM costs cpu time. |

  - namespace=system

    | Metric Name           | Description                                   |
    |-----------------------|-----------------------------------------------|
    | LastMinuteSystemLoad  | The average system load for the last minute.  |
    | AvailableProcessors   | The amount of system available processors.    |

  - namespace=JVM
    - This source provides information on JVM metrics using the
    [Dropwizard/Codahale Metric Sets for JVM instrumentation](https://metrics.dropwizard.io/4.2.0/manual/jvm.html)
    and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.

  - namespace=ResourceConsumption
    - **notes:**
        - This metrics data is generated for each user and they are identified using a metric tag.
        - This metrics also include subResourceConsumptions generated for each application of user and they are identified using `applicationId` tag.

    | Metric Name       | Description                                         |
    |-------------------|-----------------------------------------------------|
    | diskFileCount     | The count of disk files consumption by each user.   |
    | diskBytesWritten  | The amount of disk files consumption by each user.  |
    | hdfsFileCount     | The count of hdfs files consumption by each user.   |
    | hdfsBytesWritten  | The amount of hdfs files consumption by each user.  |

  - namespace=ThreadPool
    - **notes:**
        - This metrics data is generated for each thread pool and they are identified using a metric tag by thread pool name.

    | Metric Name                  | Description                                                                                                                 |
    |------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
    | active_thread_count          | The approximate number of threads that are actively executing tasks.                                                        |
    | pending_task_count           | The pending task not executed in block queue.                                                                               |
    | pool_size                    | The current number of threads in the pool.                                                                                  |
    | core_pool_size               | The core number of threads.                                                                                                 |
    | maximum_pool_size            | The maximum allowed number of threads.                                                                                      |
    | largest_pool_size            | The largest number of threads that have ever simultaneously been in the pool.                                               |
    | is_terminating               | Whether this executor is in the process of terminating after shutdown() or shutdownNow() but has not completely terminated. |
    | is_terminated                | Whether this executor is in the process of terminated after shutdown() or shutdownNow() and has completely terminated.      |
    | is_shutdown                  | Whether this executor is shutdown.                                                                                          |
    | thread_count                 | The thread count of current thread group.                                                                                   |
    | thread_is_terminated_count   | The terminated thread count of current thread group.                                                                        |
    | thread_is_shutdown_count     | The shutdown thread count of current thread group.                                                                          |

#### Worker
These metrics are exposed by Celeborn worker.

  - namespace=worker

    | Metric Name                            | Description                                                                                                     |
    |----------------------------------------|-----------------------------------------------------------------------------------------------------------------|
    | RegisteredShuffleCount                 | The count of registered shuffle.                                                                                |
    | RunningApplicationCount                | The count of running applications.                                                                              |
    | ActiveShuffleSize                      | The active shuffle size of a worker including master replica and slave replica.                                 |
    | ActiveShuffleFileCount                 | The active shuffle file count of a worker including master replica and slave replica.                           |
    | OpenStreamTime                         | The time for a worker to process openStream RPC and return StreamHandle.                                        |
    | FetchChunkTime                         | The time for a worker to fetch a chunk which is 8MB by default from a reduced partition.                        |
    | FetchChunkTransferTime                 | The time for a worker to transfer for fetching a chunk from a reduced partition.                                |
    | ActiveChunkStreamCount                 | Active stream count for reduce partition reading streams.                                                       |
    | OpenStreamSuccessCount                 | The count of opening stream succeed in current worker.                                                          |
    | OpenStreamFailCount                    | The count of opening stream failed in current worker.                                                           |
    | FetchChunkSuccessCount                 | The count of fetching chunk succeed in current worker.                                                          |
    | FetchChunkFailCount                    | The count of fetching chunk failed in current worker.                                                           |
    | FetchChunkTransferSize                 | The size of transfer for fetching chunk in current worker.                                                      |
    | PrimaryPushDataTime                    | The time for a worker to handle a pushData RPC sent from a celeborn client.                                     |
    | ReplicaPushDataTime                    | The time for a worker to handle a pushData RPC sent from a celeborn worker by replicating.                      |
    | PrimarySegmentStartTime                | The time for a worker to handle a segmentStart RPC sent from a celeborn client.                                 |
    | ReplicaSegmentStartTime                | The time for a worker to handle a segmentStart RPC sent from a celeborn worker by replicating.                  |
    | WriteDataHardSplitCount                | The count of writing PushData or PushMergedData to HARD_SPLIT partition in current worker.                      |
    | WriteDataSuccessCount                  | The count of writing PushData or PushMergedData succeed in current worker.                                      |
    | WriteDataFailCount                     | The count of writing PushData or PushMergedData failed in current worker.                                       |
    | ReplicateDataFailCount                 | The count of replicating PushData or PushMergedData failed in current worker.                                   |
    | ReplicateDataWriteFailCount            | The count of replicating PushData or PushMergedData failed caused by write failure in peer worker.              |
    | ReplicateDataCreateConnectionFailCount | The count of replicating PushData or PushMergedData failed caused by creating connection failed in peer worker. |
    | ReplicateDataConnectionExceptionCount  | The count of replicating PushData or PushMergedData failed caused by connection exception in peer worker.       |
    | ReplicateDataFailNonCriticalCauseCount | The count of replicating PushData or PushMergedData failed caused by non-critical exception in peer worker.     |
    | ReplicateDataTimeoutCount              | The count of replicating PushData or PushMergedData failed caused by push timeout in peer worker.               |
    | PushDataHandshakeFailCount             | The count of PushDataHandshake failed in current worker.                                                        |
    | RegionStartFailCount                   | The count of RegionStart failed in current worker.                                                              |
    | RegionFinishFailCount                  | The count of RegionFinish failed in current worker.                                                             |
    | SegmentStartFailCount                  | The count of SegmentStart failed in current worker.                                                             |
    | PrimaryPushDataHandshakeTime           | PrimaryPushDataHandshake means handle PushData of primary partition location.                                   |
    | ReplicaPushDataHandshakeTime           | ReplicaPushDataHandshake means handle PushData of replica partition location.                                   |
    | PrimaryRegionStartTime                 | PrimaryRegionStart means handle RegionStart of primary partition location.                                      |
    | ReplicaRegionStartTime                 | ReplicaRegionStart means handle RegionStart of replica partition location.                                      |
    | PrimaryRegionFinishTime                | PrimaryRegionFinish means handle RegionFinish of primary partition location.                                    |
    | ReplicaRegionFinishTime                | ReplicaRegionFinish means handle RegionFinish of replica partition location.                                    |
    | PausePushDataStatus                    | The status for a worker to stop receiving pushData from clients because of back pressure.                       |
    | PausePushDataTime                      | The time for a worker to stop receiving pushData from clients because of back pressure.                         |
    | PausePushDataAndReplicateTime          | The time for a worker to stop receiving pushData from clients and other workers because of back pressure.       |
    | PausePushDataAndReplicateStatus        | The status for a worker to stop receiving pushData from clients because of back pressure.                       |
    | PausePushData                          | The count for a worker to stop receiving pushData from clients because of back pressure.                        |
    | PausePushDataAndReplicate              | The count for a worker to stop receiving pushData from clients and other workers because of back pressure.      |
    | PartitionFileSizeBytes                 | The size of partition files committed in current worker.                                                        |
    | TakeBufferTime                         | The time for a worker to take out a buffer from a disk flusher.                                                 |
    | FlushDataTime                          | The time for a worker to write a buffer which is 256KB by default to storage.                                   |
    | CommitFilesTime                        | The time for a worker to flush buffers and close files related to specified shuffle.                            |
    | CommitFilesFailCount                   | The count of commit files request failed in current worker.                                                     |
    | SlotsAllocated                         | Slots allocated in last hour.                                                                                   |
    | ActiveSlotsCount                       | The number of slots currently being used in a worker.                                                           |
    | ReserveSlotsTime                       | ReserveSlots means acquire a disk buffer and record partition location.                                         |
    | ActiveConnectionCount                  | The count of active network connection.                                                                         |
    | NettyMemory                            | The total amount of off-heap memory used by celeborn worker.                                                    |
    | SortTime                               | The time for a worker to sort a shuffle file.                                                                   |
    | SortMemory                             | The memory used by sorting shuffle files.                                                                       |
    | SortingFiles                           | The count of sorting shuffle files.                                                                             |
    | PendingSortTasks                       | The count of sort tasks waiting to be submitted to FileSorterExecutors.                                         |
    | SortedFiles                            | The count of sorted shuffle files.                                                                              |
    | SortedFileSize                         | The count of sorted shuffle files 's total size.                                                                |
    | DiskBuffer                             | The memory occupied by pushData and pushMergedData which should be written to disk.                             |
    | BufferStreamReadBuffer                 | The memory used by credit stream read buffer.                                                                   |
    | ReadBufferDispatcherRequestsLength     | The queue size of read buffer allocation requests.                                                              |
    | ReadBufferAllocatedCount               | Allocated read buffer count.                                                                                    |
    | ActiveCreditStreamCount                | Active stream count for map partition reading streams.                                                          |
    | ActiveMapPartitionCount                | The count of active map partition reading streams.                                                              |
    | SorterCacheHitRate                     | The cache hit rate for worker partition sorter index.                                                           |
    | CleanTaskQueueSize                     | The count of task for cleaning up expired shuffle keys.                                                         |
    | CleanExpiredShuffleKeysTime            | The time for a worker to clean up shuffle data of expired shuffle keys.                                         |
    | DeviceOSFreeBytes                      | The actual usable space of OS for device monitor.                                                               |
    | DeviceOSTotalBytes                     | The total usable space of OS for device monitor.                                                                |
    | DeviceCelebornFreeBytes                | The actual usable space of Celeborn for device.                                                                 |
    | DeviceCelebornTotalBytes               | The total space of Celeborn for device.                                                                         |
    | PotentialConsumeSpeed                  | The speed of potential consumption for congestion control.                                                      |
    | UserProduceSpeed                       | The speed of user production for congestion control.                                                            |
    | WorkerConsumeSpeed                     | The speed of worker consumption for congestion control.                                                         |
    | IsDecommissioningWorker                | 1 means worker decommissioning, 0 means not decommissioning.                                                    |
    | IsHighWorkload                         | 1 means worker high workload, 0 means not high workload.                                                    |
    | UnreleasedShuffleCount                 | Unreleased shuffle count when worker is decommissioning.                                                        |
    | UnreleasedPartitionLocationCount       | Unreleased partition location count when worker is shutting down.                                               |
    | MemoryStorageFileCount                 | The count of files in Memory Storage of a worker.                                                               |
    | MemoryFileStorageSize                  | The total amount of memory used by Memory Storage.                                                              |
    | EvictedFileCount                       | The count of files evicted from Memory Storage to Disk                                                          |
    | DirectMemoryUsageRatio                 | Ratio of direct memory used and max direct memory.                                                              |
    | RegisterWithMasterFailCount            | The count of failures in register with master request.                                                          |
    | FlushWorkingQueueSize                  | The size of flush working queue for mount point.                                                                |
    | LocalFlushCount                        | The amount of data flushed to local.                                                                            |
    | LocalFlushSize                         | The size of data flushed to local.                                                                              |
    | HdfsFlushCount                         | The amount of data flushed to HDFS.                                                                             |
    | HdfsFlushSize                          | The size of data flushed to HDFS.                                                                               |
    | OssFlushCount                          | The amount of data flushed to OSS.                                                                              |
    | OssFlushSize                           | The size of data flushed to OSS.                                                                                |
    | S3FlushCount                           | The amount of data flushed to S3.                                                                               |
    | S3FlushSize                            | The size of data flushed to S3.                                                                                 |
    | push_usedHeapMemory                    |                                                                                                                 |
    | push_usedDirectMemory                  |                                                                                                                 |
    | push_numHeapArenas                     |                                                                                                                 |
    | push_numDirectArenas                   |                                                                                                                 |
    | push_tinyCacheSize                     |                                                                                                                 |
    | push_smallCacheSize                    |                                                                                                                 |
    | push_normalCacheSize                   |                                                                                                                 |
    | push_numThreadLocalCaches              |                                                                                                                 |
    | push_chunkSize                         |                                                                                                                 |
    | push_numAllocations                    |                                                                                                                 |
    | push_numTinyAllocations                |                                                                                                                 |
    | push_numSmallAllocations               |                                                                                                                 |
    | push_numNormalAllocations              |                                                                                                                 |
    | push_numHugeAllocations                |                                                                                                                 |
    | push_numDeallocations                  |                                                                                                                 |
    | push_numTinyDeallocations              |                                                                                                                 |
    | push_numSmallDeallocations             |                                                                                                                 |
    | push_numNormalDeallocations            |                                                                                                                 |
    | push_numHugeDeallocations              |                                                                                                                 |
    | push_numActiveAllocations              |                                                                                                                 |
    | push_numActiveTinyAllocations          |                                                                                                                 |
    | push_numActiveSmallAllocations         |                                                                                                                 |
    | push_numActiveNormalAllocations        |                                                                                                                 |
    | push_numActiveHugeAllocations          |                                                                                                                 |
    | push_numActiveBytes                    |                                                                                                                 |
    | replicate_usedHeapMemory               |                                                                                                                 |
    | replicate_usedDirectMemory             |                                                                                                                 |
    | replicate_numHeapArenas                |                                                                                                                 |
    | replicate_numDirectArenas              |                                                                                                                 |
    | replicate_tinyCacheSize                |                                                                                                                 |
    | replicate_smallCacheSize               |                                                                                                                 |
    | replicate_normalCacheSize              |                                                                                                                 |
    | replicate_numThreadLocalCaches         |                                                                                                                 |
    | replicate_chunkSize                    |                                                                                                                 |
    | replicate_numAllocations               |                                                                                                                 |
    | replicate_numTinyAllocations           |                                                                                                                 |
    | replicate_numSmallAllocations          |                                                                                                                 |
    | replicate_numNormalAllocations         |                                                                                                                 |
    | replicate_numHugeAllocations           |                                                                                                                 |
    | replicate_numDeallocations             |                                                                                                                 |
    | replicate_numTinyDeallocations         |                                                                                                                 |
    | replicate_numSmallDeallocations        |                                                                                                                 |
    | replicate_numNormalDeallocations       |                                                                                                                 |
    | replicate_numHugeDeallocations         |                                                                                                                 |
    | replicate_numActiveAllocations         |                                                                                                                 |
    | replicate_numActiveTinyAllocations     |                                                                                                                 |
    | replicate_numActiveSmallAllocations    |                                                                                                                 |
    | replicate_numActiveNormalAllocations   |                                                                                                                 |
    | replicate_numActiveHugeAllocations     |                                                                                                                 |
    | replicate_numActiveBytes               |                                                                                                                 |
    | fetch_usedHeapMemory                   |                                                                                                                 |
    | fetch_usedDirectMemory                 |                                                                                                                 |
    | fetch_numHeapArenas                    |                                                                                                                 |
    | fetch_numDirectArenas                  |                                                                                                                 |
    | fetch_tinyCacheSize                    |                                                                                                                 |
    | fetch_smallCacheSize                   |                                                                                                                 |
    | fetch_normalCacheSize                  |                                                                                                                 |
    | fetch_numThreadLocalCaches             |                                                                                                                 |
    | fetch_chunkSize                        |                                                                                                                 |
    | fetch_numAllocations                   |                                                                                                                 |
    | fetch_numTinyAllocations               |                                                                                                                 |
    | fetch_numSmallAllocations              |                                                                                                                 |
    | fetch_numNormalAllocations             |                                                                                                                 |
    | fetch_numHugeAllocations               |                                                                                                                 |
    | fetch_numDeallocations                 |                                                                                                                 |
    | fetch_numTinyDeallocations             |                                                                                                                 |
    | fetch_numSmallDeallocations            |                                                                                                                 |
    | fetch_numNormalDeallocations           |                                                                                                                 |
    | fetch_numHugeDeallocations             |                                                                                                                 |
    | fetch_numActiveAllocations             |                                                                                                                 |
    | fetch_numActiveTinyAllocations         |                                                                                                                 |
    | fetch_numActiveSmallAllocations        |                                                                                                                 |
    | fetch_numActiveNormalAllocations       |                                                                                                                 |
    | fetch_numActiveHugeAllocations         |                                                                                                                 |
    | fetch_numActiveBytes                   |                                                                                                                 |

  - namespace=CPU

    | Metric Name  | Description               |
    |--------------|---------------------------|
    | JVMCPUTime   | The JVM costs cpu time.   |

  - namespace=system

    | Metric Name            | Description                                           |
    |------------------------|-------------------------------------------------------|
    | LastMinuteSystemLoad   | Returns the system load average for the last minute.  |
    | AvailableProcessors    | The amount of system available processors.            |

  - namespace=JVM
    - This source provides information on JVM metrics using the
      [Dropwizard/Codahale Metric Sets for JVM instrumentation](https://metrics.dropwizard.io/4.2.0/manual/jvm.html)
      and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.

  - namespace=ResourceConsumption
    - **notes:**
        - This metrics data is generated for each user and they are identified using a metric tag.
        - This metrics also include subResourceConsumptions generated for each application of user and they are identified using `applicationId` tag.

    | Metric Name       | Description                                         |
    |-------------------|-----------------------------------------------------|
    | diskFileCount     | The count of disk files consumption by each user.   |
    | diskBytesWritten  | The amount of disk files consumption by each user.  |
    | hdfsFileCount     | The count of hdfs files consumption by each user.   |
    | hdfsBytesWritten  | The amount of hdfs files consumption by each user.  |

  - namespace=ThreadPool
    - **notes:**
        - This metrics data is generated for each thread pool and they are identified using a metric tag by thread pool name.

    | Metric Name            | Description                                                                                                                 |
    |------------------------|-----------------------------------------------------------------------------------------------------------------------------|
    | active_thread_count    | The approximate number of threads that are actively executing tasks.                                                        |
    | pending_task_count     | The pending task not executed in block queue.                                                                               |
    | pool_size              | The current number of threads in the pool.                                                                                  |
    | core_pool_size         | The core number of threads.                                                                                                 |
    | maximum_pool_size      | The maximum allowed number of threads.                                                                                      |
    | largest_pool_size      | The largest number of threads that have ever simultaneously been in the pool.                                               |
    | is_terminating         | Whether this executor is in the process of terminating after shutdown() or shutdownNow() but has not completely terminated. |
    | is_terminated          | Whether this executor is in the process of terminated after shutdown() or shutdownNow() and has completely terminated.      |
    | is_shutdown            | Whether this executor is shutdown.                                                                                          |

**Note:**

The Netty DirectArenaMetrics named like `push/fetch/replicate_numXX` are not exposed by default, nor in Grafana dashboard.
If there is a need, you can enable `celeborn.network.memory.allocator.verbose.metric` to expose these metrics.

### Setup Prometheus dashboard

1. Install Prometheus (https://prometheus.io/). We provide an example for Prometheus config file:

```yaml
# Prometheus example config
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "Celeborn"
    metrics_path: /metrics/prometheus
    scrape_interval: 15s
    static_configs:
      - targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]
```

2. Install Grafana server (https://grafana.com/grafana/download).

3. Import Celeborn dashboard into Grafana.

You can find the Celeborn dashboard templates under the `assets/grafana` directory.
`celeborn-dashboard.json` displays Celeborn internal metrics and `celeborn-jvm-dashboard.json` displays Celeborn JVM related metrics.

Here is an example of Grafana dashboard importing.

![g1](assets/img/g1.png)

![g2](assets/img/g2.png)

![g3](assets/img/g3.png)

<img alt="g4" height="90%" src="assets/img/g4.png" width="90%"/>

<img alt="g6" height="90%" src="assets/img/g6.png" width="90%"/>

<img alt="g5" height="90%" src="assets/img/g5.png" width="90%"/>

Here are some snapshots:

![d1](assets/img/dashboard1.png)

![d2](assets/img/dashboard_full.webp)

#### Optional
We recommend you to install node exporter (https://github.com/prometheus/node_exporter)
on every host, and configure Prometheus to scrape information about the host.
Grafana will need a dashboard (dashboard id:8919) to display host details.

```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "Celeborn"
    metrics_path: /metrics/prometheus
    scrape_interval: 15s
    static_configs:
      - targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]
  - job_name: "node"
    static_configs:
      - targets: [ "master-ip:9100","worker1-ip:9100","worker2-ip:9100","worker3-ip:9100","worker4-ip:9100" ]
```

## REST API

In addition to viewing the metrics, Celeborn also supports [REST API](restapi.md).
This gives developers an easy way to create new visualizations and monitoring tools for Celeborn and
also easy for users to get the running status of the service.