### What changes were proposed in this pull request? Correct `celeborn.metrics.conf.*.sink.csv.class` configuration example for a CSV sink. ### Why are the changes needed? `celeborn.metrics.conf.*.sink.csv.class` configuration example for a CSV sink is wrong, which value should be `org.apache.celeborn.common.metrics.sink.CsvSink`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? None. Closes #1865 from SteNicholas/CELEBORN-927. Authored-by: SteNicholas <programgeek@163.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
318 lines
16 KiB
Markdown
318 lines
16 KiB
Markdown
---
|
||
license: |
|
||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||
contributor license agreements. See the NOTICE file distributed with
|
||
this work for additional information regarding copyright ownership.
|
||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||
(the "License"); you may not use this file except in compliance with
|
||
the License. You may obtain a copy of the License at
|
||
|
||
https://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing, software
|
||
distributed under the License is distributed on an "AS IS" BASIS,
|
||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||
See the License for the specific language governing permissions and
|
||
limitations under the License.
|
||
---
|
||
|
||
Monitoring
|
||
===
|
||
|
||
There are two ways to monitor Celeborn cluster: Prometheus metrics and REST API.
|
||
|
||
## Metrics
|
||
|
||
Celeborn has a configurable metrics system based on the
|
||
[Dropwizard Metrics Library](http://metrics.dropwizard.io/4.2.0).
|
||
This allows users to report Celeborn metrics to a variety of sinks including HTTP, JMX, CSV
|
||
files and prometheus servlet. The metrics are generated by sources embedded in the Celeborn code base.
|
||
They provide instrumentation for specific activities and Celeborn components.
|
||
The metrics system is configured via a configuration file that Celeborn expects to be present
|
||
at `$CELEBORN_HOME/conf/metrics.properties`. A custom file location can be specified via the
|
||
`spark.metrics.conf` [configuration property](https://celeborn.apache.org/configuration/#metrics).
|
||
Instead of using the configuration file, a set of configuration parameters with prefix
|
||
`celeborn.metrics.conf.` can be used.
|
||
|
||
Celeborn's metrics are divided into two
|
||
_instances_ corresponding to Celeborn components. The following instances are currently supported:
|
||
|
||
* `master`: The Celeborn cluster master process.
|
||
* `worker`: The Celeborn cluster worker process.
|
||
|
||
Each instance can report to zero or more _sinks_. Sinks are contained in the
|
||
`org.apache.celeborn.common.metrics.sink` package:
|
||
|
||
* `CSVSink`: Exports metrics data to CSV files at regular intervals.
|
||
* `PrometheusServlet`: Adds a servlet within the existing Celeborn REST API to serve metrics data in Prometheus format.
|
||
* `GraphiteSink`: Sends metrics to a Graphite node.
|
||
|
||
The syntax of the metrics configuration file and the parameters available for each sink are defined
|
||
in an example configuration file,
|
||
`$CELEBORN_HOME/conf/metrics.properties.template`.
|
||
|
||
When using Celeborn configuration parameters instead of the metrics configuration file, the relevant
|
||
parameter names are composed by the prefix `celeborn.metrics.conf.` followed by the configuration
|
||
details, i.e. the parameters take the following form:
|
||
`celeborn.metrics.conf.[instance|*].sink.[sink_name].[parameter_name]`.
|
||
This example shows a list of Celeborn configuration parameters for a CSV sink:
|
||
```
|
||
"celeborn.metrics.conf.*.sink.csv.class"="org.apache.celeborn.common.metrics.sink.CsvSink"
|
||
"celeborn.metrics.conf.*.sink.csv.period"="1"
|
||
"celeborn.metrics.conf.*.sink.csv.unit"=minutes
|
||
"celeborn.metrics.conf.*.sink.csv.directory"=/tmp/
|
||
```
|
||
|
||
Default values of the Celeborn metrics configuration are as follows:
|
||
```
|
||
*.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet
|
||
```
|
||
|
||
Additional sources can be configured using the metrics configuration file or the configuration
|
||
parameter `spark.metrics.conf.[component_name].source.jvm.class=[source_name]`. At present the
|
||
no source is the available optional source. For example the following configuration parameter
|
||
activates the Example source:
|
||
`"celeborn.metrics.conf.*.source.jvm.class"="org.apache.celeborn.common.metrics.source.ExampleSource"`
|
||
|
||
### Available metrics providers
|
||
|
||
Metrics used by Celeborn are of multiple types: gauge, counter, histogram, meter and timer,
|
||
see [Dropwizard library documentation for details](https://metrics.dropwizard.io/4.2.0/getting-started.html).
|
||
The following list of components and metrics reports the name and some details about the available metrics,
|
||
grouped per component instance and source namespace.
|
||
The most common time of metrics used in Celeborn instrumentation are gauges and counters.
|
||
Counters can be recognized as they have the `.count` suffix. Timers, meters and histograms are annotated
|
||
in the list, the rest of the list elements are metrics of type gauge.
|
||
The large majority of metrics are active as soon as their parent component instance is configured,
|
||
some metrics require also to be enabled via an additional configuration parameter, the details are
|
||
reported in the list.
|
||
|
||
#### Master
|
||
These metrics are exposed by Celeborn master.
|
||
|
||
- namespace=master
|
||
- WorkerCount
|
||
- LostWorkers
|
||
- ExcludedWorkerCount
|
||
- RegisteredShuffleCount
|
||
- IsActiveMaster
|
||
- PartitionSize
|
||
- The size of estimated shuffle partition.
|
||
- OfferSlotsTime
|
||
- The time for masters to handle `RequestSlots` request when registering shuffle.
|
||
|
||
- namespace=CPU
|
||
- JVMCPUTime
|
||
|
||
- namespace=system
|
||
- LastMinuteSystemLoad
|
||
- The average system load for the last minute.
|
||
- AvailableProcessors
|
||
|
||
- namespace=JVM
|
||
- This source provides information on JVM metrics using the
|
||
[Dropwizard/Codahale Metric Sets for JVM instrumentation](https://metrics.dropwizard.io/4.2.0/manual/jvm.html)
|
||
and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.
|
||
|
||
- namespace=ResourceConsumption
|
||
- **notes:**
|
||
- This metrics data is generated for each user and they are identified using a metric tag.
|
||
- diskFileCount
|
||
- diskBytesWritten
|
||
- hdfsFileCount
|
||
- hdfsBytesWritten
|
||
|
||
#### Worker
|
||
These metrics are exposed by Celeborn worker.
|
||
|
||
- namespace=worker
|
||
- CommitFilesTime
|
||
- The time for a worker to flush buffers and close files related to specified shuffle.
|
||
- ReserveSlotsTime
|
||
- FlushDataTime
|
||
- The time for a worker to write a buffer which is 256KB by default to storage.
|
||
- OpenStreamTime
|
||
- The time for a worker to process openStream RPC and return StreamHandle.
|
||
- FetchChunkTime
|
||
- The time for a worker to fetch a chunk which is 8MB by default from a reduced partition.
|
||
- PrimaryPushDataTime
|
||
- The time for a worker to handle a pushData RPC sent from a celeborn client.
|
||
- ReplicaPushDataTime
|
||
- The time for a worker to handle a pushData RPC sent from a celeborn worker by replicating.
|
||
- WriteDataFailCount
|
||
- ReplicateDataFailCount
|
||
- ReplicateDataWriteFailCount
|
||
- ReplicateDataCreateConnectionFailCount
|
||
- ReplicateDataConnectionExceptionCount
|
||
- ReplicateDataTimeoutCount
|
||
- PushDataHandshakeFailCount
|
||
- RegionStartFailCount
|
||
- RegionFinishFailCount
|
||
- PrimaryPushDataHandshakeTime
|
||
- ReplicaPushDataHandshakeTime
|
||
- PrimaryRegionStartTime
|
||
- ReplicaRegionStartTime
|
||
- PrimaryRegionFinishTime
|
||
- ReplicaRegionFinishTime
|
||
- TakeBufferTime
|
||
- The time for a worker to take out a buffer from a disk flusher.
|
||
- RegisteredShuffleCount
|
||
- SlotsAllocated
|
||
- NettyMemory
|
||
- The total amount of off-heap memory used by celeborn worker.
|
||
- SortTime
|
||
- The time for a worker to sort a shuffle file.
|
||
- SortMemory
|
||
- The memory used by sorting shuffle files.
|
||
- SortingFiles
|
||
- SortedFiles
|
||
- SortedFileSize
|
||
- DiskBuffer
|
||
- The memory occupied by pushData and pushMergedData which should be written to disk.
|
||
- PausePushData
|
||
- The count for a worker to stop receiving pushData from clients because of back pressure.
|
||
- PausePushDataAndReplicate
|
||
- The count for a worker to stop receiving pushData from clients and other workers because of back pressure.
|
||
- BufferStreamReadBuffer
|
||
- The memory used by credit stream read buffer.
|
||
- ReadBufferDispatcherRequestsLength
|
||
- The queue size of read buffer allocation requests.
|
||
- ReadBufferAllocatedCount
|
||
- Allocated read buffer count.
|
||
- CreditStreamCount
|
||
- Stream count for map partition reading streams.
|
||
- ActiveMapPartitionCount
|
||
- DeviceOSFreeBytes
|
||
- DeviceOSTotalBytes
|
||
- DeviceCelebornFreeBytes
|
||
- DeviceCelebornTotalBytes
|
||
- PotentialConsumeSpeed
|
||
- UserProduceSpeed
|
||
- WorkerConsumeSpeed
|
||
- push_server_usedHeapMemory
|
||
- push_server_usedDirectMemory
|
||
- push_server_numAllocations
|
||
- push_server_numTinyAllocations
|
||
- push_server_numSmallAllocations
|
||
- push_server_numNormalAllocations
|
||
- push_server_numHugeAllocations
|
||
- push_server_numDeallocations
|
||
- push_server_numTinyDeallocations
|
||
- push_server_numSmallDeallocations
|
||
- push_server_numNormalDeallocations
|
||
- push_server_numHugeDeallocations
|
||
- push_server_numActiveAllocations
|
||
- push_server_numActiveTinyAllocations
|
||
- push_server_numActiveSmallAllocations
|
||
- push_server_numActiveNormalAllocations
|
||
- push_server_numActiveHugeAllocations
|
||
- push_server_numActiveBytes
|
||
- replicate_server_usedHeapMemory
|
||
- replicate_server_usedDirectMemory
|
||
- replicate_server_numAllocations
|
||
- replicate_server_numTinyAllocations
|
||
- replicate_server_numSmallAllocations
|
||
- replicate_server_numNormalAllocations
|
||
- replicate_server_numHugeAllocations
|
||
- replicate_server_numDeallocations
|
||
- replicate_server_numTinyDeallocations
|
||
- replicate_server_numSmallDeallocations
|
||
- replicate_server_numNormalDeallocations
|
||
- replicate_server_numHugeDeallocations
|
||
- replicate_server_numActiveAllocations
|
||
- replicate_server_numActiveTinyAllocations
|
||
- replicate_server_numActiveSmallAllocations
|
||
- replicate_server_numActiveNormalAllocations
|
||
- replicate_server_numActiveHugeAllocations
|
||
- replicate_server_numActiveBytes
|
||
- fetch_server_usedHeapMemory
|
||
- fetch_server_usedDirectMemory
|
||
- fetch_server_numAllocations
|
||
- fetch_server_numTinyAllocations
|
||
- fetch_server_numSmallAllocations
|
||
- fetch_server_numNormalAllocations
|
||
- fetch_server_numHugeAllocations
|
||
- fetch_server_numDeallocations
|
||
- fetch_server_numTinyDeallocations
|
||
- fetch_server_numSmallDeallocations
|
||
- fetch_server_numNormalDeallocations
|
||
- fetch_server_numHugeDeallocations
|
||
- fetch_server_numActiveAllocations
|
||
- fetch_server_numActiveTinyAllocations
|
||
- fetch_server_numActiveSmallAllocations
|
||
- fetch_server_numActiveNormalAllocations
|
||
- fetch_server_numActiveHugeAllocations
|
||
- fetch_server_numActiveBytes
|
||
|
||
- namespace=CPU
|
||
- JVMCPUTime
|
||
|
||
- namespace=system
|
||
- LastMinuteSystemLoad
|
||
- Returns the system load average for the last minute.
|
||
- AvailableProcessors
|
||
|
||
- namespace=JVM
|
||
- This source provides information on JVM metrics using the
|
||
[Dropwizard/Codahale Metric Sets for JVM instrumentation](https://metrics.dropwizard.io/4.2.0/manual/jvm.html)
|
||
and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.
|
||
|
||
**Note:**
|
||
|
||
The Netty DirectArenaMetrics named like `push/fetch/replicate_server_numXX` are not exposed by default, nor in Grafana dashboard.
|
||
If there is a need, you can enable `celeborn.network.memory.allocator.verbose.metric` to expose these metrics.
|
||
|
||
## REST API
|
||
|
||
In addition to viewing the metrics, Celeborn also support REST API. This gives developers
|
||
an easy way to create new visualizations and monitoring tools for Celeborn and
|
||
also easy for users to get the running status of the service. The REST API is available for
|
||
both master and worker. The endpoints are mounted at `host:port`. For example,
|
||
for the master, they would typically be accessible
|
||
at `http://<master-prometheus-host>:<master-prometheus-port><path>`, and
|
||
for the worker, at `http://<worker-prometheus-host>:<worker-prometheus-port><path>`.
|
||
|
||
The configuration of `<master-prometheus-host>`, `<master-prometheus-port>`, `<worker-prometheus-host>`, `<worker-prometheus-port>` as below:
|
||
|
||
| Key | Default | Description | Since |
|
||
|-----------------------------------------|---------|----------------------------|-------|
|
||
| celeborn.metrics.master.prometheus.host | 0.0.0.0 | Master's Prometheus host. | 0.2.0 |
|
||
| celeborn.metrics.master.prometheus.port | 9098 | Master's Prometheus port. | 0.2.0 |
|
||
| celeborn.metrics.worker.prometheus.host | 0.0.0.0 | Worker's Prometheus host. | 0.2.0 |
|
||
| celeborn.metrics.worker.prometheus.port | 9096 | Worker's Prometheus port. | 0.2.0 |
|
||
|
||
### Available API providers
|
||
|
||
API path listed as below:
|
||
|
||
#### Master
|
||
|
||
| Path | Meaning |
|
||
|-----------------------|-------------------------------------------------------------------------------------------------------------|
|
||
| /metrics/prometheus | List the metrics data in prometheus format of the master. |
|
||
| /conf | List the conf setting of the master. |
|
||
| /workerInfo | List worker information of the service. It will list all registered workers 's information. |
|
||
| /lostWorkers | List all lost workers of the master. |
|
||
| /excludedWorkers | List all excluded workers of the master. |
|
||
| /threadDump | List the current thread dump of the master. |
|
||
| /hostnames | List all running application's LifecycleManager's hostnames of the cluster. |
|
||
| /applications | List all running application's ids of the cluster. |
|
||
| /shuffles | List all running shuffle keys of the service. It will return all running shuffle's key of the cluster. |
|
||
| /listTopDiskUsedApps | List the top disk usage application ids. It will return the top disk usage application ids for the cluster. |
|
||
|
||
#### Worker
|
||
|
||
| Path | Meaning |
|
||
|----------------------------|----------------------------------------------------------------------------------------------------------|
|
||
| /metrics/prometheus | List the metrics data in prometheus format of the worker. |
|
||
| /conf | List the conf setting of the worker. |
|
||
| /workerInfo | List the worker information of the worker. |
|
||
| /threadDump | List the current thread dump of the worker. |
|
||
| /shuffles | List all the running shuffle keys of the worker. It only return keys of shuffles running in that worker. |
|
||
| /listTopDiskUsedApps | List the top disk usage application ids. It only return application ids running in that worker. |
|
||
| /listPartitionLocationInfo | List all the living PartitionLocation information in that worker. |
|
||
| /unavailablePeers | List the unavailable peers of the worker, this always means the worker connect to the peer failed. |
|
||
| /isShutdown | Show if the worker is during the process of shutdown. |
|
||
| /isRegistered | Show if the worker is registered to the master success. |
|
||
| /exit?type=${TYPE} | Trigger this worker to exit. Legal `type`s are 'DECOMMISSION‘, 'GRACEFUL' and 'IMMEDIATELY' |
|