### What changes were proposed in this pull request? In MapPartiitoin, datas are split into regions. 1. Unlike ReducePartition whose partition split can occur on pushing data to keep MapPartition data ordering, PartitionSplit only be done on the time of sending PushDataHandShake or RegionStart messages (As shown in the following image). That's to say that the partition split only appear at the beginnig of a region but not inner a region. > Notice: if the client side think that it's failed to push HandShake or RegionStart messages. but the worker side can still receive normal HandShake/RegionStart message. After client revive succss, it don't push any messages to old partition, so the worker having the old partition will create a empty file. After committing files, the worker will return empty commitids. That's to say that empty file will be filterd after committing files and ReduceTask will not read any empty files.  2. PushData/RegioinFinish don't care the following cases: - Diskfull - ExceedPartitionSplitThreshold - Worker ShuttingDown so if one of the above three conditions appears, PushData and RegionFinish cant still do as normal. Workers should consider the ShuttingDown case and try best to wait all the regions finished before shutting down. if PushData or RegionFinish failed like network timeout and so on, then MapTask will failed and start another attempte maptask.  3. how shuffle read supports partition split? ReduceTask should get split paritions by order and open the stream by partition epoc orderly ### Why are the changes needed? PartiitonSplit is not supported by MapPartition from now. There still a risk that a partition file'size is too large to store the file on worker disk. To avoid this risk, this pr introduces partition split in shuffle read and shuffle write. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? UT and manual TPCDS test Closes #1550 from FMX/CELEBORN-627. Lead-authored-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com> Co-authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com> Co-authored-by: Ethan Feng <ethanfeng@apache.org> Signed-off-by: zhongqiang.czq <zhongqiang.czq@alibaba-inc.com> |
||
|---|---|---|
| .. | ||
| assets | ||
| configuration | ||
| developers | ||
| celeborn_ratis_shell.md | ||
| cluster_planning.md | ||
| deploy_on_k8s.md | ||
| deploy.md | ||
| migration.md | ||
| monitoring.md | ||
| README.md | ||
| upgrading.md | ||
| hide | license | |
|---|---|---|
|
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. |
Quick Start
This documentation gives a quick start guide for running Apache Spark/Flink with Apache Celeborn(Incubating).
Download Celeborn
Download the latest Celeborn binary from the Downloading Page.
Decompress the binary and set $CELEBORN_HOME
tar -C <DST_DIR> -zxvf apache-celeborn-<VERSION>-bin.tgz
export $CELEBORN_HOME=<Decompressed path>
Configure Logging and Storage
Configure Logging
cd $CELEBORN_HOME/conf
cp log4j2.xml.template log4j2.xml
Configure Storage
Configure the directory to store shuffle data, for example $CELEBORN_HOME/shuffle
cd $CELEBORN_HOME/conf
echo "celeborn.worker.storage.dirs=$CELEBORN_HOME/shuffle" > celeborn-defaults.conf
Start Celeborn Service
Start Master
cd $CELEBORN_HOME
./sbin/start-master.sh
You should see Master's ip:port in the log:
INFO [main] NettyRpcEnvFactory: Starting RPC Server [MasterSys] on 192.168.2.109:9097 with advisor endpoint 192.168.2.109:9097
Start Worker
Use the Master's IP and Port to start Worker:
cd $CELEBORN_HOME
./sbin/start-worker.sh celeborn://<Master IP>:<Master Port>
You should see the following message in Worker's log:
INFO [main] MasterClient: connect to master 192.168.2.109:9097.
INFO [main] Worker: Register worker successfully.
INFO [main] Worker: Worker started.
And also the following message in Master's log:
INFO [dispatcher-event-loop-9] Master: Registered worker
Host: 192.168.2.109
RpcPort: 57806
PushPort: 57807
FetchPort: 57809
ReplicatePort: 57808
SlotsUsed: 0
LastHeartbeat: 0
HeartbeatElapsedSeconds: xxx
Disks:
DiskInfo0: xxx
UserResourceConsumption: empty
WorkerRef: null
Start Spark with Celeborn
Copy Celeborn Client to Spark's jars
Celeborn release binary contains clients for Spark 2.x and Spark 3.x, copy the corresponding client jar into Spark's
jars/ directory:
cp $CELEBORN_HOME/spark/<Celeborn Client Jar> $SPARK_HOME/jars/
Start spark-shell
Set spark.shuffle.manager to Celeborn's ShuffleManager, and turn off spark.shuffle.service.enabled:
cd $SPARK_HOME
./bin/spark-shell \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false
Then run the following test case:
spark.sparkContext
.parallelize(1 to 10, 10)
.flatMap(_ => (1 to 100).iterator.map(num => num))
.repartition(10)
.count
During the Spark Job, you should see the following message in Celeborn Master's log:
Master: Offer slots successfully for 10 reducers of local-1690000152711-0 on 1 workers.
And the following message in Celeborn Worker's log:
INFO [dispatcher-event-loop-9] Controller: Reserved 10 primary location and 0 replica location for local-1690000152711-0
INFO [dispatcher-event-loop-8] Controller: Start commitFiles for local-1690000152711-0
INFO [async-reply] Controller: CommitFiles for local-1690000152711-0 success with 10 committed primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 committed replica partitions, 0 empty replica partitions, 0 failed replica partitions.
Start Flink with Celeborn
Copy Celeborn Client to Flink's lib
Celeborn release binary contains clients for Flink 1.14.x, Flink 1.15.x and Flink 1.17.x, copy the corresponding client jar into Flink's
lib/ directory:
cp $CELEBORN_HOME/flink/<Celeborn Client Jar> $FLINK_HOME/lib/
Add Celeborn configuration to Flink's conf
Set shuffle-service-factory.class to Celeborn's ShuffleServiceFactory in Flink configuration file:
cd $FLINK_HOME
vi conf/flink-conf.yaml
shuffle-service-factory.class: org.apache.celeborn.plugin.flink.RemoteShuffleServiceFactory
Then deploy the example word count job to the running cluster:
cd $FLINK_HOME
./bin/flink run -Dexecution.runtime-mode=BATCH examples/streaming/WordCount.jar
During the Flink Job, you should see the following message in Celeborn Master's log:
Master: Offer slots successfully for 1 reducers of local-1690000152711-0 on 1 workers.
And the following message in Celeborn Worker's log:
INFO [dispatcher-event-loop-4] Controller: Reserved 1 primary location and 0 replica location for local-1690000152711-0
INFO [dispatcher-event-loop-3] Controller: Start commitFiles for local-1690000152711-0
INFO [async-reply] Controller: CommitFiles for local-1690000152711-0 success with 1 committed primary partitions, 0 empty primary partitions, 0 failed primary partitions, 0 committed replica partitions, 0 empty replica partitions, 0 failed replica partitions.