celeborn/docs/migration.md
mingji 5e77b851c9 [CELEBORN-1081] Client support celeborn.storage.activeTypes config
### What changes were proposed in this pull request?
1.To support `celeborn.storage.activeTypes` in Client.
2.Master will ignore slots for "UNKNOWN_DISK".

### Why are the changes needed?
Enable client application to select storage types to use.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
GA and cluster.

Closes #2045 from FMX/B1081.

Authored-by: mingji <fengmingxiao.fmx@alibaba-inc.com>
Signed-off-by: Shuang <lvshuang.tb@gmail.com>
2023-11-03 20:03:11 +08:00

110 lines
6.6 KiB
Markdown

---
hide:
- navigation
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
---
# Migration Guide
## Upgrading from 0.3 to 0.4
- Since 0.4.0, Celeborn won't be compatible with Celeborn client that versions below 0.3.0.
Note that: It's strongly recommended to use the same version of Client and Celeborn Master/Worker in production.
- Since 0.4.0, Celeborn won't support `org.apache.spark.shuffle.celeborn.RssShuffleManager`.
- Since 0.4.0, Celeborn changed the default value of `celeborn.<module>.io.numConnectionsPerPeer` from `2` to `1`.
- Since 0.4.0, Celeborn has changed the names of the prometheus master and worker configuration as shown in the table below:
| Key Before v0.4.0 | Key After v0.4.0 |
|-------------------------------------------|-----------------------------|
| `celeborn.metrics.master.prometheus.host` | `celeborn.master.http.host` |
| `celeborn.metrics.master.prometheus.port` | `celeborn.master.http.port` |
| `celeborn.metrics.worker.prometheus.host` | `celeborn.worker.http.host` |
| `celeborn.metrics.worker.prometheus.port` | `celeborn.worker.http.port` |
- Since 0.4.0, Celeborn deprecate `celeborn.worker.storage.baseDir.prefix` and `celeborn.worker.storage.baseDir.number`.
Please use `celeborn.worker.storage.dirs` instead.
- Since 0.4.0, Celeborn deprecate `celeborn.storage.activeTypes`. Please use `celeborn.storage.availableTypes` instead.
## Upgrading from 0.3.1 to 0.3.2
- Since 0.3.1, Celeborn changed the default value of `raft.client.rpc.request.timeout` from `3s` to `10s`.
- Since 0.3.1, Celeborn changed the default value of `raft.client.rpc.watch.request.timeout` from `10s` to `20s`.
## Upgrading from 0.3.0 to 0.3.1
- Since 0.3.1, Celeborn changed the default value of `celeborn.worker.directMemoryRatioToResume` from `0.5` to `0.7`.
- Since 0.3.1, Celeborn changed the default value of `celeborn.worker.monitor.disk.check.interval` from `60` to `30`.
- Since 0.3.1, name of JVM metrics changed, see details at CELEBORN-1007.
## Upgrading from 0.2 to 0.3
- Celeborn 0.2 Client is compatible with 0.3 Master/Server, it allows to upgrade Master/Worker first then Client.
Note that: It's strongly recommended to use the same version of Client and Celeborn Master/Worker in production.
- Since 0.3.0, the support of deprecated configurations `rss.*` is removed.
All configurations listed in 0.2.1 docs still take effect, but some of those are deprecated too, please read
the bootstrap logs and follow the suggestion to migrate to the new configuration.
- From 0.3.0 on the default value for `celeborn.client.push.replicate.enabled` is changed from `true` to `false`, users
who want replication on should explicitly enable replication. For example, to enable replication for Spark
users should add the spark config when submitting job: `spark.celeborn.client.push.replicate.enabled=true`
- From 0.3.0 on the default value for `celeborn.worker.storage.workingDir` is changed from `hadoop/rss-worker/shuffle_data` to `celeborn-worker/shuffle_data`,
users who want to use origin working dir path should set this configuration.
- Since 0.3.0, configuration namespace `celeborn.ha.master` is deprecated, and will be removed in the future versions.
All configurations `celeborn.ha.master.*` should migrate to `celeborn.master.ha.*`.
- Since 0.3.0, environment variables `CELEBORN_MASTER_HOST` and `CELEBORN_MASTER_PORT` are removed.
Instead `CELEBORN_LOCAL_HOSTNAME` works on both master and worker, which takes high priority than configurations defined in properties file.
- Since 0.3.0, the Celeborn Master URL schema is changed from `rss://` to `celeborn://`, for users who start Worker by
`sbin/start-worker.sh rss://<master-host>:<master-port>`, should migrate to `sbin/start-worker.sh celeborn://<master-host>:<master-port>`.
- Since 0.3.0, Celeborn supports overriding Hadoop configuration(`core-site.xml`, `hdfs-site.xml`, etc.) from Celeborn configuration with the additional prefix `celeborn.hadoop.`.
On Spark client side, user should set Hadoop configuration like `spark.celeborn.hadoop.foo=bar`, note that `spark.hadoop.foo=bar` does not take effect;
on Flink client and Celeborn Master/Worker side, user should set like `celeborn.hadoop.foo=bar`.
- Since 0.3.0, Celeborn master metrics `BlacklistedWorkerCount` is renamed as `ExcludedWorkerCount`.
- Since 0.3.0, Celeborn master http request url `/blacklistedWorkers` is renamed as `/excludedWorkers`.
- Since 0.3.0, introduces a terminology update for Celeborn worker data replication, replacing the previous `master/slave` terminology with `primary/replica`. In alignment with this change, corresponding metrics keywords have been adjusted.
The following table presents a comprehensive overview of the changes:
| Key Before v0.3.0 | Key After v0.3.0 |
|-------------------------------|--------------------------------|
| `MasterPushDataTime` | `PrimaryPushDataTime` |
| `MasterPushDataHandshakeTime` | `PrimaryPushDataHandshakeTime` |
| `MasterRegionStartTime` | `PrimaryRegionStartTime` |
| `MasterRegionFinishTime` | `PrimaryRegionFinishTime` |
| `SlavePushDataTime` | `ReplicaPushDataTime` |
| `SlavePushDataHandshakeTime` | `ReplicaPushDataHandshakeTime` |
| `SlaveRegionStartTime` | `ReplicaRegionStartTime` |
| `SlaveRegionFinishTime` | `ReplicaRegionFinishTime` |
- Since 0.3.0, Celeborn's spark shuffle manager change from `org.apache.spark.shuffle.celeborn.RssShuffleManager` to `org.apache.spark.shuffle.celeborn.SparkShuffleManager`. User can set spark property `spark.shuffle.manager` to `org.apache.spark.shuffle.celeborn.SparkShuffleManager` to use Celeborn remote shuffle service.
In 0.3.0, Celeborn still support `org.apache.spark.shuffle.celeborn.RssShuffleManager`, it will be removed in 0.4.0.