Go to file

Ethan Feng 59474c2f11 [INFRA]Update scripts and templates for new name. (#724 )		2022-10-09 14:56:06 +08:00
.github	[DOCS] Build website (#579 )	2022-09-10 00:45:13 +08:00
assets	[DOC] Replace RSS with Celeborn in docs (#715 )	2022-10-06 10:37:46 +08:00
bin	[INFRA]Update scripts and templates for new name. (#724 )	2022-10-09 14:56:06 +08:00
build	Build: Enhance build system (#349 )	2022-08-15 14:59:01 +08:00
client	[ISSUE-739][REFACTOR] Use object wrap pb message method (#740 )	2022-10-09 11:53:48 +08:00
client-spark	[ISSUE-588][FOLLOWUP] Keep same code as spark2 (#732 )	2022-10-08 21:05:19 +08:00
common	[ISSUE-739][REFACTOR] Use object wrap pb message method (#740 )	2022-10-09 11:53:48 +08:00
conf	[INFRA]Update scripts and templates for new name. (#724 )	2022-10-09 14:56:06 +08:00
dev	[INFRA] Rename modules w/ celeborn prefix (#723 )	2022-10-08 08:05:57 +08:00
docker	Rename helm template and values	2022-10-08 20:14:37 +08:00
docs	[DOC] Replace RSS with Celeborn in docs (#715 )	2022-10-06 10:37:46 +08:00
master	[ISSUE-739][REFACTOR] Use object wrap pb message method (#740 )	2022-10-09 11:53:48 +08:00
sbin	[INFRA]Update scripts and templates for new name. (#724 )	2022-10-09 14:56:06 +08:00
service	[INFRA] Rename modules w/ celeborn prefix (#723 )	2022-10-08 08:05:57 +08:00
tests/spark-it	[INFRA] Rename modules w/ celeborn prefix (#723 )	2022-10-08 08:05:57 +08:00
worker	[ISSUE-739][REFACTOR] Use object wrap pb message method (#740 )	2022-10-09 11:53:48 +08:00
.gitignore	[INFRA] Rename modules w/ celeborn prefix (#723 )	2022-10-08 08:05:57 +08:00
.rat-excludes	Enable Apache Rat and fix license header (#492 )	2022-08-31 23:53:33 +08:00
.scalafmt.conf	[REFACTOR] Change package name to org.apache.celeborn (#710 )	2022-10-02 18:10:29 +08:00
CONTRIBUTING.md	[REFACTOR] Change package name to org.apache.celeborn (#710 )	2022-10-02 18:10:29 +08:00
LICENSE	Initial commit	2021-12-10 16:57:16 +08:00
METRICS.md	[DOC] Replace RSS with Celeborn in docs (#715 )	2022-10-06 10:37:46 +08:00
mkdocs.yml	[DOCS] Build website (#579 )	2022-09-10 00:45:13 +08:00
pom.xml	[INFRA] Rename modules w/ celeborn prefix (#723 )	2022-10-08 08:05:57 +08:00
README.md	[INFRA]Update scripts and templates for new name. (#724 )	2022-10-09 14:56:06 +08:00
requirements.txt	[DOCS] Build website (#579 )	2022-09-10 00:45:13 +08:00

README.md

Apache Celeborn

Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines. Celeborn provides an elastic and high efficient management service for shuffle data.

Internals

Architecture

Celeborn has three primary components: Master, Worker, and Client. Master manages all resources and syncs shard states with each other based on Raft. Worker processes read-write requests and merges data for each reducer. LifecycleManager maintains metadata of each shuffle and runs within the Spark driver.

Feature

Disaggregate Compute and storage.
Push-based shuffle write and merged shuffle read.
High availability and high fault tolerance.

Shuffle Process

Mappers lazily ask LifecycleManager to registerShuffle.
LifecycleManager requests slots from Master.
Workers reserve slots and create corresponding files.
Mappers get worker locations from LifecycleManager.
Mappers push data to specified workers.
Workers merge and replicate data to its peer.
Workers flush to disk periodically.
Mapper tasks accomplish and trigger MapperEnd event.
When all mapper tasks are complete, workers commit files.
Reducers ask for file locations.
Reducers read shuffle data.

Load Balance

We introduce slots to achieve load balance. We will equally distribute partitions on every Celeborn worker by tracking slots usage. The Slot is a logical concept in Celeborn Worker that represents how many partitions can be allocated on each Celeborn Worker. Celeborn Worker's slot count is decided by total usable disk size / average shuffle file size. Celeborn worker's slot count decreases when a partition is allocated and increments when a partition is freed.

Build

Celeborn supports Spark 2.4/3.0/3.1/3.2/3.3 and only tested under Java 8.

Build for Spark

./dev/make-distribution.sh -Pspark-2.4/-Pspark-3.0/-Pspark-3.1/-Pspark-3.2/Spark-3.3

package celeborn-${project.version}-bin.tgz will be generated.

Package Details

Build procedure will create a compressed package.

    ├── RELEASE                         
    ├── bin                             
    ├── conf                            
    ├── master-jars                     
    ├── worker-jars                     
    ├── sbin                            
    └── spark          // Spark client jars

Compatibility

Celeborn server is compatible with all supported Spark versions. You can run different Spark versions with the same Celeborn server. It doesn't matter whether Celeborn server is compiled with -Pspark-2.4/3.0/3.1/3.2/3.3. However, Celeborn client must be consistent with the version of the Spark. For example, if you are running Spark 2.4, you must compile Celeborn client with -Pspark-2.4; if you are running Spark 3.2, you must compile Celeborn client with -Pspark-3.2.

Usage

Celeborn supports HA mode deployment.

Deploy Celeborn

Unzip the package to $CELEBORN_HOME
Modify environment variables in $CELEBORN_HOME/conf/celeborn-env.sh

EXAMPLE:

#!/usr/bin/env bash
CELEBORN_MASTER_MEMORY=4g
CELEBORN_WORKER_MEMORY=2g
CELEBORN_WORKER_OFFHEAP_MEMORY=4g

Modify configurations in $CELEBORN_HOME/conf/rss-defaults.conf

EXAMPLE: single master cluster

rss.master.address master-host:port
rss.metrics.system.enabled true
rss.worker.flush.buffer.size 256k
rss.worker.flush.queue.capacity 4096
rss.worker.base.dirs /mnt/disk1/,/mnt/disk2
# If your hosts have disk raid or use lvm, set rss.device.monitor.enabled to false
rss.device.monitor.enabled false

EXAMPLE: HA cluster

rss.metrics.system.enabled true
rss.worker.flush.buffer.size 256k
rss.worker.flush.queue.capacity 4096
rss.worker.base.dirs /mnt/disk1/,/mnt/disk2
rss.master.port 9097
# If your hosts have disk raid or use lvm, set rss.device.monitor.enabled to false
rss.device.monitor.enabled false

rss.ha.enabled true
rss.ha.service.id dev-cluster
rss.ha.nodes.dev-cluster node1,node2,node3
rss.ha.address.dev-cluster.node1 host1
rss.ha.address.dev-cluster.node2 host2
rss.ha.address.dev-cluster.node3 host3
rss.ha.storage.dir /mnt/disk1/rss_ratis/
rss.ha.master.hosts host1,host2,host3
# If you want to customize HA port
rss.ha.port.dev-cluster.node1 9872
rss.ha.port.dev-cluster.node2 9872
rss.ha.port.dev-cluster.node3 9872

Copy Celeborn and configurations to all nodes
Start Celeborn master $CELEBORN_HOME/sbin/start-master.sh
Start Celeborn worker For single master cluster : $CELEBORN_HOME/sbin/start-worker.sh rss://masterhost:port For HA cluster :$CELEBORN_HOME/sbin/start-worker.sh
If Celeborn start success, the output of Master's log should be like this:

22/10/08 19:29:11,805 INFO [main] Dispatcher: Dispatcher numThreads: 64
22/10/08 19:29:11,875 INFO [main] TransportClientFactory: mode NIO threads 64
22/10/08 19:29:12,057 INFO [main] Utils: Successfully started service 'MasterSys' on port 9097.
22/10/08 19:29:12,113 INFO [main] Master: Metrics system enabled.
22/10/08 19:29:12,125 INFO [main] HttpServer: master: HttpServer started on port 9098.
22/10/08 19:29:12,126 INFO [main] Master: Master started.
22/10/08 19:29:57,842 INFO [dispatcher-event-loop-19] Master: Registered worker
Host: 192.168.15.140
RpcPort: 37359
PushPort: 38303
FetchPort: 37569
ReplicatePort: 37093
SlotsUsed: 0()
LastHeartbeat: 0
Disks: {/mnt/disk1=DiskInfo(maxSlots: 6679, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk1, usableSpace: 448284381184, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs , /mnt/disk3=DiskInfo(maxSlots: 6716, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk3, usableSpace: 450755608576, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs , /mnt/disk2=DiskInfo(maxSlots: 6713, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk2, usableSpace: 450532900864, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs , /mnt/disk4=DiskInfo(maxSlots: 6712, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk4, usableSpace: 450456805376, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs }
WorkerRef: null

Deploy Spark client

Copy $CELEBORN_HOME/spark/*.jar to $SPARK_HOME/jars/

Spark Configuration

To use Celeborn, following spark configurations should be added.

spark.shuffle.manager org.apache.spark.shuffle.celeborn.RssShuffleManager
# must use kryo serializer because java serializer do not support relocation
spark.serializer org.apache.spark.serializer.KryoSerializer

# if you are running HA cluster ,set spark.rss.master.address to any Celeborn master
spark.rss.master.address rss-master-host:rss-master-port
spark.shuffle.service.enabled false

# optional:hash,sort
# Hash shuffle writer use (partition count) * (rss.push.data.buffer.size) * (spark.executor.cores) memory.
# Sort shuffle writer use less memory than hash shuffle writer, If your shuffle partition count is large, try to use sort hash writer.  
spark.rss.shuffle.writer.mode hash

# we recommend set spark.rss.push.data.replicate to true to enable server-side data replication 
spark.rss.push.data.replicate true

# Support for Spark AQE only tested under Spark 3
# we recommend set localShuffleReader to false to get better performance of Celeborn
spark.sql.adaptive.localShuffleReader.enabled false

# we recommend enabling aqe support to gain better performance
spark.sql.adaptive.enabled true
spark.sql.adaptive.skewJoin.enabled true

Best Practice

If you want to set up a production-ready Celeborn cluster, your cluster should have at least 3 masters and at least 4 workers. Masters and works can be deployed on the same node but should not deploy multiple masters or workers on the same node. See more detail in CONFIGURATIONS

Support Spark Dynamic Allocation

We provide a patch to enable users to use Spark with both dynamic allocation and Remote Shuffle Service. For Spark2.x check Spark2 Patch.
For Spark3.x check Spark3 Patch.

Metrics

Celeborn has various metrics. METRICS

Contribution

This is an active open-source project. We are always open to developers who want to use the system or contribute to it. See more detail in Contributing.

NOTICE

If you need to fully restart a Celeborn cluster in HA mode, you must clean ratis meta storage first because ratis meta will store expired states of the last running cluster.

Here are some instructions:

Stop all workers.
Stop all masters.
Clean all master's ratis meta storage directory(rss.ha.storage.dir).
Start all masters.
Start all workers.