celeborn/docs/configuration/metrics.md
Cheng Pan 1753556565
[CELEBORN-713] Local network binding support IP or FQDN
### What changes were proposed in this pull request?

This PR aims to make network local address binding support both IP and FQDN strategy.

Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-`

### Why are the changes needed?

Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting.

For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause

```
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874)
	at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735)
	at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
```

### Does this PR introduce _any_ user-facing change?

Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior.

### How was this patch tested?

Manually testing with `celeborn.network.bind.preferIpAddress=false`
```
Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.143.252

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.173.94

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.149.42

starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303.
```

Closes #1622 from pan3793/CELEBORN-713.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-27 09:42:11 +08:00

2.6 KiB

license
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Key Default Description Since
celeborn.metrics.app.topDiskUsage.count 50 Size for top items about top disk usage applications list. 0.2.0
celeborn.metrics.app.topDiskUsage.interval 10min Time length for a window about top disk usage application list. 0.2.0
celeborn.metrics.app.topDiskUsage.windowSize 24 Window size about top disk usage application list. 0.2.0
celeborn.metrics.capacity 4096 The maximum number of metrics which a source can use to generate output strings. 0.2.0
celeborn.metrics.collectPerfCritical.enabled false It controls whether to collect metrics which may affect performance. When enable, Celeborn collects them. 0.2.0
celeborn.metrics.conf <undefined> Custom metrics configuration file path. Default use metrics.properties in classpath. 0.3.0
celeborn.metrics.enabled true When true, enable metrics system. 0.2.0
celeborn.metrics.extraLabels If default metric labels are not enough, extra metric labels can be customized. Labels' pattern is: <label1_key>=<label1_value>[,<label2_key>=<label2_value>]*; e.g. env=prod,version=1 0.3.0
celeborn.metrics.master.prometheus.host <localhost> Master's Prometheus host. 0.3.0
celeborn.metrics.master.prometheus.port 9098 Master's Prometheus port. 0.3.0
celeborn.metrics.sample.rate 1.0 It controls if Celeborn collect timer metrics for some operations. Its value should be in [0.0, 1.0]. 0.2.0
celeborn.metrics.timer.slidingWindow.size 4096 The sliding window size of timer metric. 0.2.0
celeborn.metrics.worker.prometheus.host <localhost> Worker's Prometheus host. 0.3.0
celeborn.metrics.worker.prometheus.port 9096 Worker's Prometheus port. 0.3.0