liangyongyuan 4ddc91afda [CELEBRON-1282] Optimize push data replica error message

### What changes were proposed in this pull request?
Optimize the handling of exceptions during the push of replica data, now only throwing PUSH_DATA_CONNECTION_EXCEPTION_REPLICA in specific scenarios.

### Why are the changes needed?
When handling exceptions related to pushing replica data in the worker, unmatched exceptions, such as 'file already closed,' are uniformly transformed into REPLICATE_DATA_CONNECTION_EXCEPTION_COUNT and returned to the client. The client then excludes the peer node based on this count, which may not be appropriate in certain scenarios. For instance, in the case of an exception like 'file already closed,' it typically occurs during multiple splits and commitFile operations. Excluding a large number of nodes under such circumstances is clearly not in line with expectations.
![image](https://github.com/apache/incubator-celeborn/assets/46274164/816d21ad-1f79-45f0-bbe7-e93e15389edd)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
through exist uts

Closes #2323 from lyy-pineapple/CELEBORN-1282.

Authored-by: liangyongyuan <liangyongyuan@xiaomi.com>
Signed-off-by: waitinfuture <zky.zhoukeyong@alibaba-inc.com>

2024-02-26 12:55:26 +08:00

26 KiB

Raw Blame History

Metrics

We provide various metrics about memory, disk, and important procedures. These metrics could help identify performance issue or monitor Celeborn cluster.

Prerequisites

Enable Celeborn metrics. Set configuration celeborn.metrics.enabled to true (true by default).
Configure Celeborn metrics properties.

cd $CELEBORN_HOME/conf
cp metrics.properties.template metrics.properties

The default values of the Celeborn metrics configuration are as follows:

*.sink.prometheusServlet.class=org.apache.celeborn.common.metrics.sink.PrometheusServlet

Install Prometheus (https://prometheus.io/). We provide an example for Prometheus config file:

# Prometheus example config
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "Celeborn"
    metrics_path: /metrics/prometheus
    scrape_interval: 15s
    static_configs:
      - targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]

Install Grafana server (https://grafana.com/grafana/download).
Import Celeborn dashboard into Grafana.

You can find the Celeborn dashboard templates under the assets/grafana directory. celeborn-dashboard.json displays Celeborn internal metrics and celeborn-jvm-dashboard.json displays Celeborn JVM related metrics.

Optional

We recommend you to install node exporter (https://github.com/prometheus/node_exporter) on every host, and configure Prometheus to scrape information about the host. Grafana will need a dashboard (dashboard id:8919) to display host details.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "Celeborn"
    metrics_path: /metrics/prometheus
    scrape_interval: 15s
    static_configs:
      - targets: [ "master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096" ]
  - job_name: "node"
    static_configs:
      - targets: [ "master-ip:9100","worker1-ip:9100","worker2-ip:9100","worker3-ip:9100","worker4-ip:9100" ]

Import Dashboard Steps

Here is an example of Grafana dashboard importing.

Details

MetricName	Scope	Description
RegisteredShuffleCount	master and worker	The value means count of registered shuffle.
RunningApplicationCount	master and worker	The value means count of running applications.
ActiveShuffleSize	master and worker	The value means the active shuffle size for workers or a worker including master replica and slave replica.
ActiveShuffleFileCount	master and worker	The value means the active shuffle file count for workers or a worker including master replica and slave replica.
diskFileCount	master and worker	The count of disk files consumption by each user.
diskBytesWritten	master and worker	The amount of disk files consumption by each user.
hdfsFileCount	master and worker	The count of hdfs files consumption by each user.
hdfsBytesWritten	master and worker	The amount of hdfs files consumption by each user.
WorkerCount	master	The count of active workers.
LostWorkers	master	The count of workers in lost list.
ExcludedWorkerCount	master	The count of workers in excluded list.
IsActiveMaster	master	Whether the current master is active.
PartitionSize	master	The estimated partition size of last 20 flush window whose length is 15 seconds by defaults.
OfferSlotsTime	master	The time of offer slots.
OpenStreamTime	worker	OpenStream means read a shuffle file and send client about chunks size and stream index.
FetchChunkTime	worker	FetchChunk means read a chunk from a shuffle file and send to client.
ActiveChunkStreamCount	worker	The active stream count for reduce partition reading streams in current worker.
OpenStreamSuccessCount	worker	The count of opening stream succeed in current worker.
OpenStreamFailCount	worker	The count of opening stream failed in current worker.
FetchChunkSuccessCount	worker	The count of fetching chunk succeed in current worker.
FetchChunkFailCount	worker	The count of fetching chunk failed in current worker.
PrimaryPushDataTime	worker	PrimaryPushData means handle PushData of primary partition location.
ReplicaPushDataTime	worker	ReplicaPushData means handle PushData of replica partition location.
WriteDataHardSplitCount	worker	The count of writing PushData or PushMergedData to HARD_SPLIT partition in current worker.
WriteDataSuccessCount	worker	The count of writing PushData or PushMergedData succeed in current worker.
WriteDataFailCount	worker	The count of writing PushData or PushMergedData failed in current worker.
ReplicateDataFailCount	worker	The count of replicating PushData or PushMergedData failed in current worker.
ReplicateDataWriteFailCount	worker	The count of replicating PushData or PushMergedData failed caused by write failure in peer worker.
ReplicateDataCreateConnectionFailCount	worker	The count of replicating PushData or PushMergedData failed caused by creating connection failed in peer worker.
ReplicateDataConnectionExceptionCount	worker	The count of replicating PushData or PushMergedData failed caused by connection exception in peer worker.
ReplicateDataFailNonCriticalCauseCount	worker	The count of replicating PushData or PushMergedData failed caused by non-critical exception in peer worker.
ReplicateDataTimeoutCount	worker	The count of replicating PushData or PushMergedData failed caused by push timeout in peer worker.
PushDataHandshakeFailCount	worker	The count of PushDataHandshake failed in current worker.
RegionStartFailCount	worker	The count of RegionStart failed in current worker.
RegionFinishFailCount	worker	The count of RegionFinish failed in current worker.
PrimaryPushDataHandshakeTime	worker	PrimaryPushDataHandshake means handle PushData of primary partition location.
ReplicaPushDataHandshakeTime	worker	ReplicaPushDataHandshake means handle PushData of replica partition location.
PrimaryRegionStartTime	worker	PrimaryRegionStart means handle RegionStart of primary partition location.
ReplicaRegionStartTime	worker	ReplicaRegionStart means handle RegionStart of replica partition location.
PrimaryRegionFinishTime	worker	PrimaryRegionFinish means handle RegionFinish of primary partition location.
ReplicaRegionFinishTime	worker	ReplicaRegionFinish means handle RegionFinish of replica partition location.
PausePushDataTime	worker	PausePushData means stop receiving data from client.
PausePushDataAndReplicateTime	worker	PausePushDataAndReplicate means stop receiving data from client and other workers.
PausePushData	worker	The count of stopping receiving data from client in current worker.
PausePushDataAndReplicate	worker	The count of stopping receiving data from client and other workers in current worker.
TakeBufferTime	worker	TakeBuffer means get a disk buffer from disk flusher.
FlushDataTime	worker	FlushData means flush a disk buffer to disk.
CommitFilesTime	worker	CommitFiles means flush and close a shuffle partition file.
SlotsAllocated	worker	Slots allocated in last hour
ReserveSlotsTime	worker	ReserveSlots means acquire a disk buffer and record partition location.
ActiveConnectionCount	worker	The value means count of active network connection.
NettyMemory	worker	The value measures all kinds of transport memory used by netty.
SortTime	worker	SortTime measures the time used by sorting a shuffle file.
SortMemory	worker	SortMemory means total reserved memory for sorting shuffle files .
SortingFiles	worker	This value means the count of sorting shuffle files.
SortedFiles	worker	This value means the count of sorted shuffle files.
SortedFileSize	worker	This value means the count of sorted shuffle files 's total size.
DiskBuffer	worker	Disk buffers are part of netty used memory, means data need to write to disk but haven't been written to disk.
BufferStreamReadBuffer	worker	This value means memory used by credit stream read buffer.
ReadBufferDispatcherRequestsLength	worker	This value means the queue size of read buffer allocation requests.
ReadBufferAllocatedCount	worker	This value means count of allocated read buffer.
ActiveCreditStreamCount	worker	This value means active count of stream for map partition reading streams.
ActiveMapPartitionCount	worker	This value means count of active map partition reading streams.
DeviceOSFreeBytes	worker	This value means actual usable space of OS for device monitor.
DeviceOSTotalBytes	worker	This value means total usable space of OS for device monitor.
DeviceCelebornFreeBytes	worker	This value means actual usable space of Celeborn for device monitor.
DeviceCelebornTotalBytes	worker	This value means configured usable space of Celeborn for device monitor.
PotentialConsumeSpeed	worker	This value means speed of potential consumption for congestion control.
UserProduceSpeed	worker	This value means speed of user production for congestion control.
WorkerConsumeSpeed	worker	This value means speed of worker consumption for congestion control.
jvm_gc_count	JVM	The GC count of each garbage collector.
jvm_gc_time	JVM	The GC cost time of each garbage collector.
jvm_memory_heap_init	JVM	The amount of heap init memory.
jvm_memory_heap_max	JVM	The amount of heap max memory.
jvm_memory_heap_used	JVM	The amount of heap used memory.
jvm_memory_heap_committed	JVM	The amount of heap committed memory.
jvm_memory_heap_usage	JVM	The percentage of heap memory usage.
jvm_memory_non_heap_init	JVM	The amount of non-heap init memory.
jvm_memory_non_heap_max	JVM	The amount of non-heap max memory.
jvm_memory_non_heap_used	JVM	The amount of non-heap uesd memory.
jvm_memory_non_heap_committed	JVM	The amount of non-heap committed memory.
jvm_memory_non_heap_usage	JVM	The percentage of non-heap memory usage.
jvm_memory_pools_init	JVM	The amount of each memory pool's init memory.
jvm_memory_pools_max	JVM	The amount of each memory pool's max memory.
jvm_memory_pools_used	JVM	The amount of each memory pool's used memory.
jvm_memory_pools_committed	JVM	The amount of each memory pool's committed memory.
jvm_memory_pools_used_after_gc	JVM	The amount of each memory pool's used memory after GC.
jvm_memory_pools_usage	JVM	The percentage of each memory pool's memory usage.
jvm_memory_total_init	JVM	The amount of total init memory.
jvm_memory_total_max	JVM	The amount of total max memory.
jvm_memory_total_used	JVM	The amount of total used memory.
jvm_memory_total_committed	JVM	The amount of each memory pool's committed memory.
jvm_direct_capacity	JVM	An estimate of the total capacity of the buffers in this pool
jvm_direct_count	JVM	An estimate of the number of buffers in the pool
jvm_direct_used	JVM	An estimate of the memory that JVM is using for this buffer pool
jvm_mapped_capacity	JVM	An estimate of the total capacity of the buffers in this pool
jvm_mapped_count	JVM	An estimate of the number of buffers in the pool
jvm_mapped_used	JVM	An estimate of the memory that JVM is using for this buffer pool
jvm_thread_count	JVM	The current number of threads.
jvm_thread_daemon_count	JVM	The current number of daemon threads.
jvm_thread_blocked_count	JVM	The current number of threads having blocked state.
jvm_thread_deadlock_count	JVM	The current number of threads having deadlock state.
jvm_thread_new_count	JVM	The current number of threads having new state.
jvm_thread_runnable_count	JVM	The current number of threads having runnable state.
jvm_thread_terminated_count	JVM	The current number of threads having terminated state.
jvm_thread_timed_waiting_count	JVM	The current number of threads having timed_waiting state.
jvm_thread_waiting_count	JVM	The current number of threads having waiting state.
jvm_classloader_loaded	JVM	The total number of classes loaded since the start of the JVM.
jvm_classloader_unloaded	JVM	The total number of classes unloaded since the start of the JVM.
JVMCPUTime	system	The JVM costs cpu time.
AvailableProcessors	system	The amount of system available processors.
LastMinuteSystemLoad	system	The last minute load of system.
active_thread_count	JVM	The approximate number of threads that are actively executing tasks.
pending_task_count	JVM	The pending task not executed in block queue.
pool_size	JVM	The current number of threads in the pool.
core_pool_size	JVM	The core number of threads.
maximum_pool_size	JVM	The maximum allowed number of threads.
largest_pool_size	JVM	The largest number of threads that have ever simultaneously been in the pool.
is_terminating	JVM	If this executor is in the process of terminating after shutdown() or shutdownNow() but has not completely terminated.
is_terminated	JVM	If this executor is in the process of terminated after shutdown() or shutdownNow() and has completely terminated.
is_shutdown	JVM	If this executor is shutdown.
thread_count	JVM	The thread count of current thread group.
thread_is_terminated_count	JVM	The terminated thread count of current thread group.
thread_is_shutdown_count	JVM	The shutdown thread count of current thread group.

Implementation

Celeborn master metrics : org/apache/celeborn/service/deploy/master/MasterSource.scala.

Celeborn worker metrics : org/apache/celeborn/service/deploy/worker/WorkerSource.scala.

Other common metrics are implemented in org.apache.celeborn.common.metrics.source package.

Dashboard Snapshots

The dashboard Celeborn-dashboard was generated by Grafana of version 10.0.3.

Here are some snapshots:

26 KiB Raw Blame History