### What changes were proposed in this pull request? As title. ### Why are the changes needed? As title. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No. Closes #1752 from waitinfuture/826. Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com> Signed-off-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
5.5 KiB
| license |
|---|
| Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. |
Push Data
This article describes the detailed design of the process of push data.
API specification
The push data API is as follows:
public abstract int pushData(
int shuffleId,
int mapId,
int attemptId,
int partitionId,
byte[] data,
int offset,
int length,
int numMappers,
int numPartitions)
throws IOException;
shuffleIdis the unique shuffle id of the applicationmapIdis the map id of the shuffleattemptIdis the attempt id of the map task, i.e. speculative task or task rerun for Apache SparkpartitionIdis the partition id the data belongs todata,offset,lengthspecifies the bytes to be pushednumMappersis the number map tasks in the shufflenumPartitionsis the number of partitions in the shuffle
Lazy Shuffle Register
The first time pushData is called, Client will check whether the shuffle id has been registered. If not,
it sends RegisterShuffle to LifecycleManager, LifecycleManager then sends RequestSlots to Master.
RequestSlots specifies how many PartitionLocations this shuffle requires, each PartitionLocation logically
responds to data of some partition id.
Upon receiving RequestSlots, Master allocates slots for the shuffle among Workers. If replication is turned on,
Master allocates a pair of Workers for each PartitionLocation to store two replicas for each PartitionLocation.
The detailed allocation strategy can be found in Slots Allocation. Master then
responds to LifecycleManager with the allocated PartitionLocations.
LifcycleManager caches the PartitionLocations for the shuffle and responds to each RegisterShuffle RPCs from
ShuffleClients.
Normal Push
In normal cases, the process of pushing data is as follows:
ShuffleClientcompresses data, currently supportszstdandlz4ShuffleClientaddes Header for the data:mapId,attemptId,batchIdandsize. ThebastchIdis a unique id for the data batch inside the (mapId,attemptId), for the purpose of de-duplicationShuffleClientsendsPushDatato theWorkeron which the currentPartitionLocationis allocated, and holds push state for this pushingWorkerreceives the data, do replication if needed, then responds success ACK toShuffleClient. For more details about how data is replicated and stored inWorkers, please refer to Worker- Upon receiving success ACK from
Worker,ShuffleClientconsiders success for this pushing and modifies the push state
Push or Merge?
If the size of data to be pushed is small, say hundreds of bytes, it will be very inefficient to send to the wire.
So ShuffleClient offers another API: mergeData to batch data locally before sending to Worker.
mergeData merges data with the same target into DataBatches. Same target means the destination for both the
primary and replica are the same. When the size of a DataBatches exceeds a threshold (defaults to64KiB),
ShuffleClient triggers pushing and sends PushMergedData to the destination.
Upon receiving PushMergedData, Worker unpacks it into data segments each for a specific PartitionLocation, then
stores them accordingly.
Async Push
Celeborn's ShuffleClient does not block compute engine's execution by asynchronous pushing, implemented in
DataPusher.
Whenever compute engine decides to push data, it calls DataPusher#addTask, DataPusher creates a PushTask which
contains the data, and added the PushTask in a non-blocking queue. DataPusher continuously poll the queue
and invokes ShuffleClient#pushData to do actual push.
Split
As mentioned before, Celeborn will split a PartitionLocation when any of the following conditions happens:
PartitionLocationfile exceeds threshold (defaults to 1GiB)- Usable space of local disk is less than threshold (defaults to 5GiB)
Workeris inGraceful Shutdownstate- Push data fails
For the first three cases, Worker informs ShuffleClient that it should trigger split; for the last case,
ShuffleClient triggers split itself.
There are two kinds of Split:
HARD_SPLIT, meaning oldPartitionLocationepoch refuses to accept any data, and future data of thePartitionLocationwill only be pushed after newPartitionLocationepoch is readySOFT_SPLIT, meaning oldPartitionLocationepoch continues to accept data, when new epoch is ready,ShuffleClientswitches to the new location transparently
The process of SOFT_SPLIT is as follows:
LifecycleManager keeps the split information and tells reducer to read from all splits of the PartitionLocation
to guarantee no data is lost.