[CELEBORN-726] Update data replication terminology from master/slave to primary/replica for configurations and metrics

### What changes were proposed in this pull request?

This pull PR is an integral component of #1639 . It primarily focuses on updating configuration settings and metrics terminology, while ensuring compatibility with older client versions by refraining from introducing changes related to RPC.

### Why are the changes needed?

In order to distinguish it from the existing master/worker, refactor data replication terminology to 'primary/replica' for improved clarity and inclusivity in the codebase

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests.

Closes #1650 from cfmcgrady/primary-replica-metrics.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>

2023-06-29 09:47:02 +08:00

16 KiB

Raw Blame History

license
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

license

Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Monitoring

There are two ways to monitor Celeborn cluster: Prometheus metrics and REST API.

Metrics

Celeborn has a configurable metrics system based on the Dropwizard Metrics Library. This allows users to report Celeborn metrics to a variety of sinks including HTTP, JMX, CSV files and prometheus servlet. The metrics are generated by sources embedded in the Celeborn code base. They provide instrumentation for specific activities and Celeborn components. The metrics system is configured via a configuration file that Celeborn expects to be present at $CELEBORN_HOME/conf/metrics.properties. A custom file location can be specified via the spark.metrics.conf configuration property. Instead of using the configuration file, a set of configuration parameters with prefix celeborn.metrics.conf. can be used.

Celeborn's metrics are divided into two instances corresponding to Celeborn components. The following instances are currently supported:

master: The Celeborn cluster master process.
worker: The Celeborn cluster worker process.

Each instance can report to zero or more sinks. Sinks are contained in the org.apache.celeborn.common.metrics.sink package:

CSVSink: Exports metrics data to CSV files at regular intervals.
PrometheusServlet: Adds a servlet within the existing Celeborn REST API to serve metrics data in Prometheus format.
GraphiteSink: Sends metrics to a Graphite node.

The syntax of the metrics configuration file and the parameters available for each sink are defined in an example configuration file, $CELEBORN_HOME/conf/metrics.properties.template.

When using Celeborn configuration parameters instead of the metrics configuration file, the relevant parameter names are composed by the prefix celeborn.metrics.conf. followed by the configuration details, i.e. the parameters take the following form: celeborn.metrics.conf.[instance|*].sink.[sink_name].[parameter_name]. This example shows a list of Celeborn configuration parameters for a CSV sink:

"celeborn.metrics.conf.*.sink.csv.class"="org.apache.celeborn.common.metrics.sink.GraphiteSink"
"celeborn.metrics.conf.*.sink.csv.period"="1"
"celeborn.metrics.conf.*.sink.csv.unit"=minutes
"celeborn.metrics.conf.*.sink.csv.directory"=/tmp/

Default values of the Celeborn metrics configuration are as follows:

*.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet

Additional sources can be configured using the metrics configuration file or the configuration parameter spark.metrics.conf.[component_name].source.jvm.class=[source_name]. At present the no source is the available optional source. For example the following configuration parameter activates the Example source: "celeborn.metrics.conf.*.source.jvm.class"="org.apache.celeborn.common.metrics.source.ExampleSource"

List of available metrics providers

Metrics used by Celeborn are of multiple types: gauge, counter, histogram, meter and timer, see Dropwizard library documentation for details. The following list of components and metrics reports the name and some details about the available metrics, grouped per component instance and source namespace. The most common time of metrics used in Celeborn instrumentation are gauges and counters. Counters can be recognized as they have the .count suffix. Timers, meters and histograms are annotated in the list, the rest of the list elements are metrics of type gauge. The large majority of metrics are active as soon as their parent component instance is configured, some metrics require also to be enabled via an additional configuration parameter, the details are reported in the list.

Component instance = Master

These metrics are exposed by Celeborn master.

namespace=master
- WorkerCount
- LostWorkers
- ExcludedWorkerCount
- RegisteredShuffleCount
- IsActiveMaster
- PartitionSize
  - The size of estimated shuffle partition.
- OfferSlotsTime
  - The time for masters to handle RequestSlots request when registering shuffle.
namespace=CPU
- JVMCPUTime
namespace=system
- LastMinuteSystemLoad
  - The average system load for the last minute.
- AvailableProcessors
namespace=JVM
- This source provides information on JVM metrics using the Dropwizard/Codahale Metric Sets for JVM instrumentation and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.
namespace=ResourceConsumption
- notes:
  - This merics data is generated for each user and they are identified using a metric tag.
- diskFileCount
- diskBytesWritten
- hdfsFileCount
- hdfsBytesWritten

Component instance = Worker

These metrics are exposed by Celeborn worker.

namespace=worker
- CommitFilesTime
  - The time for a worker to flush buffers and close files related to specified shuffle.
- ReserveSlotsTime
- FlushDataTime
  - The time for a worker to write a buffer which is 256KB by default to storage.
- OpenStreamTime
  - The time for a worker to process openStream RPC and return StreamHandle.
- FetchChunkTime
  - The time for a worker to fetch a chunk which is 8MB by default from a reduced partition.
- PrimaryPushDataTime
  - The time for a worker to handle a pushData RPC sent from a celeborn client.
- ReplicaPushDataTime
  - The time for a worker to handle a pushData RPC sent from a celeborn worker by replicating.
- WriteDataFailCount
- ReplicateDataFailCount
- ReplicateDataWriteFailCount
- ReplicateDataCreateConnectionFailCount
- ReplicateDataConnectionExceptionCount
- ReplicateDataTimeoutCount
- PushDataHandshakeFailCount
- RegionStartFailCount
- RegionFinishFailCount
- PrimaryPushDataHandshakeTime
- ReplicaPushDataHandshakeTime
- PrimaryRegionStartTime
- ReplicaRegionStartTime
- PrimaryRegionFinishTime
- ReplicaRegionFinishTime
- TakeBufferTime
  - The time for a worker to take out a buffer from a disk flusher.
- RegisteredShuffleCount
- SlotsAllocated
- NettyMemory
  - The total amount of off-heap memory used by celeborn worker.
- SortTime
  - The time for a worker to sort a shuffle file.
- SortMemory
  - The memory used by sorting shuffle files.
- SortingFiles
- SortedFiles
- SortedFileSize
- DiskBuffer
  - The memory occupied by pushData and pushMergedData which should be written to disk.
- PausePushData
  - The count for a worker to stop receiving pushData from clients because of back pressure.
- PausePushDataAndReplicate
  - The count for a worker to stop receiving pushData from clients and other workers because of back pressure.
- BufferStreamReadBuffer
  - The memory used by credit stream read buffer.
- ReadBufferDispatcherRequestsLength
  - The queue size of read buffer allocation requests.
- ReadBufferAllocatedCount
  - Allocated read buffer count.
- CreditStreamCount
  - Stream count for map partition reading streams.
- ActiveMapPartitionCount
- DeviceOSFreeBytes
- DeviceOSTotalBytes
- DeviceCelebornFreeBytes
- DeviceCelebornTotalBytes
- PotentialConsumeSpeed
- UserProduceSpeed
- WorkerConsumeSpeed
- push_server_usedHeapMemory
- push_server_usedDirectMemory
- push_server_numAllocations
- push_server_numTinyAllocations
- push_server_numSmallAllocations
- push_server_numNormalAllocations
- push_server_numHugeAllocations
- push_server_numDeallocations
- push_server_numTinyDeallocations
- push_server_numSmallDeallocations
- push_server_numNormalDeallocations
- push_server_numHugeDeallocations
- push_server_numActiveAllocations
- push_server_numActiveTinyAllocations
- push_server_numActiveSmallAllocations
- push_server_numActiveNormalAllocations
- push_server_numActiveHugeAllocations
- push_server_numActiveBytes
- replicate_server_usedHeapMemory
- replicate_server_usedDirectMemory
- replicate_server_numAllocations
- replicate_server_numTinyAllocations
- replicate_server_numSmallAllocations
- replicate_server_numNormalAllocations
- replicate_server_numHugeAllocations
- replicate_server_numDeallocations
- replicate_server_numTinyDeallocations
- replicate_server_numSmallDeallocations
- replicate_server_numNormalDeallocations
- replicate_server_numHugeDeallocations
- replicate_server_numActiveAllocations
- replicate_server_numActiveTinyAllocations
- replicate_server_numActiveSmallAllocations
- replicate_server_numActiveNormalAllocations
- replicate_server_numActiveHugeAllocations
- replicate_server_numActiveBytes
- fetch_server_usedHeapMemory
- fetch_server_usedDirectMemory
- fetch_server_numAllocations
- fetch_server_numTinyAllocations
- fetch_server_numSmallAllocations
- fetch_server_numNormalAllocations
- fetch_server_numHugeAllocations
- fetch_server_numDeallocations
- fetch_server_numTinyDeallocations
- fetch_server_numSmallDeallocations
- fetch_server_numNormalDeallocations
- fetch_server_numHugeDeallocations
- fetch_server_numActiveAllocations
- fetch_server_numActiveTinyAllocations
- fetch_server_numActiveSmallAllocations
- fetch_server_numActiveNormalAllocations
- fetch_server_numActiveHugeAllocations
- fetch_server_numActiveBytes

Note:

The Netty DirectArenaMetrics named like push/fetch/replicate_server_numXX are not exposed by default, nor in Grafana dashboard. If there is a need, you can enable celeborn.network.memory.allocator.verbose.metric to expose these metrics.

namespace=CPU
- JVMCPUTime
namespace=system
- LastMinuteSystemLoad
  - Returns the system load average for the last minute.
- AvailableProcessors
namespace=JVM
- This source provides information on JVM metrics using the Dropwizard/Codahale Metric Sets for JVM instrumentation and in particular the metric sets BufferPoolMetricSet, GarbageCollectorMetricSet and MemoryUsageGaugeSet.

REST API

In addition to viewing the metrics, Celeborn also support REST API. This gives developers an easy way to create new visualizations and monitoring tools for Celeborn and also easy for users to get the running status of the service. The REST API is available for both master and worker. The endpoints are mounted at host:port. For example, for the master, they would typically be accessible at http://<master-prometheus-host>:<master-prometheus-port><path>, and for the worker, at http://<worker-prometheus-host>:<worker-prometheus-port><path>.

The configuration of <master-prometheus-host>, <master-prometheus-port>, <worker-prometheus-host>, <worker-prometheus-port> as below:

Key	Default	Description	Since
celeborn.metrics.master.prometheus.host	0.0.0.0	Master's Prometheus host.	0.2.0
celeborn.metrics.master.prometheus.port	9098	Master's Prometheus port.	0.2.0
celeborn.metrics.worker.prometheus.host	0.0.0.0	Worker's Prometheus host.	0.2.0
celeborn.metrics.worker.prometheus.port	9096	Worker's Prometheus port.	0.2.0

API path listed as below:

Path	Service	Meaning
/metrics/prometheus	master, worker	List service metrics data in prometheus format.
/conf	master, worker	List the conf setting of the service.
/workerInfo	master, worker	List worker information of the service. For the master, it will list all registered workers 's information.
/lostWorkers	master	List all lost workers of the master.
/excludedWorkers	master	List all excluded workers of the master.
/threadDump	master, worker	List the current thread dump of the service.
/hostnames	master	List all running application's LifecycleManager's hostnames of the cluster.
/applications	master	List all running application's ids of the cluster.
/shuffles	master, worker	List all running shuffle keys of the service. For master, will return all running shuffle's key of the cluster, for worker, only return keys of shuffles running in that worker.
/listTopDiskUsedApps	master, worker	List the top disk usage application ids. For master, will return the top disk usage application ids for the cluster, for worker, only return application ids running in that worker.
/listPartitionLocationInfo	worker	List all living PartitionLocation information in that worker.
/unavailablePeers	worker	List the unavailable peers of the worker, this always means the worker connect to the peer failed.
/isShutdown	worker	Show if the worker is during the process of shutdown.
/isRegistered	worker	Show if the worker is registered to the master success.

16 KiB Raw Blame History