celeborn/docs/deploy_on_k8s.md
Cheng Pan 1753556565
[CELEBORN-713] Local network binding support IP or FQDN
### What changes were proposed in this pull request?

This PR aims to make network local address binding support both IP and FQDN strategy.

Additional, it refactors the `ShuffleClientImpl#genAddressPair`, from `${hostAndPort}-${hostAndPort}` to `Pair<String, String>`, which works properly when using IP but may not on FQDN because FQDN may contain `-`

### Why are the changes needed?

Currently, when the bind hostname is not set explicitly, Celeborn will find the first non-loopback address and always uses the IP to bind, this is not suitable for K8s cases, as the STS has a stable FQDN but Pod IP will be changed once Pod restarting.

For `ShuffleClientImpl#genAddressPair`, it must be changed otherwise may cause

```
java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11657 in stage 0.0 failed 4 times, most recent failure: Lost task 11657.3 in stage 0.0 (TID 12747) (10.153.253.198 executor 157): java.lang.ArrayIndexOutOfBoundsException: 1
	at org.apache.celeborn.client.ShuffleClientImpl.doPushMergedData(ShuffleClientImpl.java:874)
	at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:735)
	at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:827)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.pushData(SortBasedPusher.java:140)
	at org.apache.spark.shuffle.celeborn.SortBasedPusher.insertRecord(SortBasedPusher.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.fastWrite0(SortBasedShuffleWriter.java:192)
	at org.apache.spark.shuffle.celeborn.SortBasedShuffleWriter.write(SortBasedShuffleWriter.java:145)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1508)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
```

### Does this PR introduce _any_ user-facing change?

Yes, a new configuration `celeborn.network.bind.preferIpAddress` is introduced, and the default value is `true` to preserve the existing behavior.

### How was this patch tested?

Manually testing with `celeborn.network.bind.preferIpAddress=false`
```
Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-0.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.143.252

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-1.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.173.94

Server:		10.178.96.64
Address:	10.178.96.64#53

Name:	celeborn-master-2.celeborn-master-svc.spark.svc.cluster.local
Address: 10.153.149.42

starting org.apache.celeborn.service.deploy.worker.Worker, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.worker.Worker-1-celeborn-worker-4.out
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.Dispatcher#51 - Dispatcher numThreads: 4
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.network.client.TransportClientFactory#91 - mode NIO threads 64
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.rpc.netty.NettyRpcEnvFactory#51 - Starting RPC Server [WorkerSys] on celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0 with advisor endpoint celeborn-worker-4.celeborn-worker-svc.spark.svc.cluster.local:0
2023-06-25 23:49:52 [INFO] [main] org.apache.celeborn.common.util.Utils#51 - Successfully started service 'WorkerSys' on port 38303.
```

Closes #1622 from pan3793/CELEBORN-713.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2023-06-27 09:42:11 +08:00

6.1 KiB

license
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Deploy Celeborn on Kubernetes

Celeborn currently supports rapid deployment by using helm.

Before Deploy

  1. You should have a Running Kubernetes Cluster.
  2. You should understand simple Kubernetes deploy related, e.g. Kubernetes Resources.
  3. You have enough permissions to create resources.
  4. Installed Helm.

Deploy

1. Get Celeborn Binary Package

You can find released version of Celeborn on https://celeborn.apache.org/download/.

Of course, you can build binary package from master branch or your own branch by using ./build/make-distribution.sh in source code.

Anyway, you should unzip and into binary package.

2. Modify Celeborn Configurations

Notice: Celeborn Charts Template Files is in the experimental instability stage, the subsequent optimization will be adjusted.

The configuration in ./charts/celeborn/values.yaml you should focus on modifying is:

  • image repository - Get images from which repository
  • image tag - Which version of image to use
  • masterReplicas - Number of celeborn master replicas
  • workerReplicas - Number of celeborn worker replicas
  • volumes - How and where to mount volumes (For more information, Volumes)

[Optional] Build Celeborn Docker Image

Maybe you want to make your own celeborn docker image, you can use docker build . -f docker/Dockerfile in Celeborn Binary.

3. Helm Install Celeborn Charts

More details in Helm Install

cd ./charts/celeborn

helm install celeborn -n <namespace> .

4. Check Celeborn

After the above operation, you should be able to find the corresponding Celeborn Master/Worker by kubectl get pods -n <namespace>

Etc.

NAME                READY     STATUS             RESTARTS   AGE
celeborn-master-0   1/1       Running            0          1m
...
celeborn-worker-0   1/1       Running            0          1m
...

Given that Celeborn Master/Worker Pod takes time to start, you can see the following phenomenon:

** server can't find celeborn-master-0.celeborn-master-svc.default.svc.cluster.local: NXDOMAIN

waiting for master
Server:         172.17.0.10
Address:        172.17.0.10#53

...

Name:   celeborn-master-0.celeborn-master-svc.default.svc.cluster.local
Address: 10.225.139.80

Server:         172.17.0.10
Address:        172.17.0.10#53

starting org.apache.celeborn.service.deploy.master.Master, logging to /opt/celeborn/logs/celeborn--org.apache.celeborn.service.deploy.master.Master-1-celeborn-master-0.out

...

23/03/23 14:10:56,081 INFO [main] RaftServer: 0: start RPC server
23/03/23 14:10:56,132 INFO [nioEventLoopGroup-2-1] LoggingHandler: [id: 0x83032bf1] REGISTERED
23/03/23 14:10:56,132 INFO [nioEventLoopGroup-2-1] LoggingHandler: [id: 0x83032bf1] BIND: 0.0.0.0/0.0.0.0:9872
23/03/23 14:10:56,134 INFO [nioEventLoopGroup-2-1] LoggingHandler: [id: 0x83032bf1, L:/0:0:0:0:0:0:0:0:9872] ACTIVE
23/03/23 14:10:56,135 INFO [JvmPauseMonitor0] JvmPauseMonitor: JvmPauseMonitor-0: Started
23/03/23 14:10:56,208 INFO [main] Master: Metrics system enabled.
23/03/23 14:10:56,216 INFO [main] HttpServer: master: HttpServer started on port 9098.
23/03/23 14:10:56,216 INFO [main] Master: Master started.

5. Access Celeborn Service

The Celeborn Master/Worker nodes deployed via official Helm charts run as StatefulSet, it can be accessed through Pod IP or Stable Network ID (DNS name), in above case, the Master/Worker nodes can be accessed through:

celeborn-master-0.celeborn-master-svc.default.svc.cluster.local`
...
celeborn-worker-0.celeborn-worker-svc.default.svc.cluster.local`
...

After a restart, the StatefulSet Pod IP changes but the DNS name remains, this is important for rolling upgrade.

When bind address is not set explicitly, Celeborn worker is going to find the first non-loopback address to bind. By default, it use IP address both for address binding and registering, that causes the Master and Client use the IP address to access the Worker, it's problematic after Worker restart as explained above, especially when Graceful Shutdown is enabled.

You may want to set celeborn.network.bind.preferIpAddress=false to address such issue. Note that, depends on your Kubernetes network infrastructure, this may cause pressure on DNS service or other network issues compared with using IP address directly.

6. Build Celeborn Client

Here, without going into detail on how to configure spark/flink to find celeborn master/worker, mention the key configuration:

spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.<namespace>:9097,celeborn-master-1.celeborn-master-svc.<namespace>:9097,celeborn-master-2.celeborn-master-svc.<namespace>:9097

You can find why config endpoints such way in Kubernetes DNS for Service And Pods

Notice: You should ensure that Spark/Flink can find the Celeborn Master/Worker via IP or the Kubernetes DNS mentioned above