celeborn/docs/configuration
Xianming Lei 9131c1e07a [CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory
### What changes were proposed in this pull request?
Congestion and MemoryManager should use pinnedDirectMemory instead of usedDirectMemory

### Why are the changes needed?
In our production environment, after worker pausing, the usedDirectMemory keep high and does not decrease. The worker node is permanently blacklisted and cannot be used.

This problem has been bothering us for a long time. When the thred cache is turned off, in fact, **after ctx.channel().config().setAutoRead(false), the netty framework will still hold some ByteBufs**. This part of ByteBuf result in a lot of PoolChunks cannot be released.

In netty, if a chunk is 16M and 8k of this chunk has been allocated, then the pinnedMemory is 8k and the activeMemory is 16M. The remaining (16M-8k) memory can be allocated, but not yet allocated, netty allocates and releases memory in chunk units, so the 8k that has been allocated will result in 16M that cannot be returned to the operating system.

Here are some scenes from our production/test environment:

We config 10gb off-heap memory for worker, other configs as below:
```
celeborn.network.memory.allocator.allowCache                         false
celeborn.worker.monitor.memory.check.interval                         100ms
celeborn.worker.monitor.memory.report.interval                        10s
celeborn.worker.directMemoryRatioToPauseReceive                       0.75
celeborn.worker.directMemoryRatioToPauseReplicate                     0.85
celeborn.worker.directMemoryRatioToResume                             0.5
```

When receiving high traffic, the worker's usedDirectMemory increases. After triggering trim and pause, usedDirectMemory still does not reach the resume threshold, and worker was excluded.

![image](https://github.com/user-attachments/assets/40f5609e-fbf9-4841-84ec-69a69256edf4)

So we checked the heap snapshot of the abnormal worker, we can see that there are a large number of DirectByteBuffers in the heap memory. These DirectByteBuffers are all 4mb in size, which is exactly the size of chunksize. According to the path to gc root, DirectByteBuffer is held by PoolChunk, and these 4m only have 160k pinnedBytes.

![image](https://github.com/user-attachments/assets/3d755ef3-164c-4b5b-bec1-aaf039c0c0a5)

![image](https://github.com/user-attachments/assets/17907753-2f42-4617-a95e-1ee980553fb9)

There are many ByteBufs that are not released

![image](https://github.com/user-attachments/assets/b87eb1a9-313f-4f42-baa8-227fd49c19b6)

The stack shows that these ByteBufs are allocated by netty
![image](https://github.com/user-attachments/assets/f8783f99-507a-44a8-9de5-7215a5eed1db)

We tried to reproduce this situation in the test environment. When the same problem occurred, we added a restful api of the worker to force the worker to resume. After the resume, the worker returned to normal, and PushDataHandler handled many delayed requests.

![image](https://github.com/user-attachments/assets/be37039b-97b8-4ae8-a64f-a2003bea613e)

![image](https://github.com/user-attachments/assets/24b1c8ad-131c-4bd6-adcb-bad655cfbdbf)

So I think that when pinnedMemory is not high enough, we should not trigger pause and congestion, because at this time a large part of the memory can still be allocated.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #3018 from leixm/CELEBORN-1792.

Authored-by: Xianming Lei <31424839+leixm@users.noreply.github.com>
Signed-off-by: Shuang <lvshuang.xjs@alibaba-inc.com>
2025-01-22 14:30:20 +08:00
..
client.md [CELEBORN-1748] Deprecate identity provider configs tied with quota 2024-12-04 09:28:40 +08:00
columnar-shuffle.md [CELEBORN-1051] Add isDynamic property for CelebornConf 2024-02-20 14:20:44 +08:00
ha.md [CELEBORN-1400] Bump Ratis version from 2.5.1 to 3.0.1 2024-05-30 17:22:22 +08:00
index.md [MINOR] Add documentation for CELEBORN_NO_DAEMONIZE 2024-12-23 10:31:37 +08:00
master.md [CELEBORN-1811] Update default value for celeborn.master.slot.assign.extraSlots 2024-12-31 15:37:28 +08:00
metrics.md [CELEBORN-1745] Remove application top disk usage code 2024-11-28 10:55:34 +08:00
network-module.md [CELEBORN-1353] Document Celeborn security - authentication and SSL support 2024-04-30 14:37:56 +08:00
network.md [CELEBORN-1774][FOLLOWUP] Change celeborn.<module>.io.mode optional to explain default behavior in description 2025-01-02 21:15:19 +08:00
quota.md [CELEBORN-1748] Deprecate identity provider configs tied with quota 2024-12-04 09:28:40 +08:00
worker.md [CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory 2025-01-22 14:30:20 +08:00