kyuubi/docs/tools/spark_block_cleaner.md
liangbowen 69d625a1be [KYUUBI #4200] [Improvement] [Docs] Introduce Markdown formatting with spotless-maven-plugin and flexmark for docs
### _Why are the changes needed?_

- to consolidate styles in markdown files from manual written or auto-generated
- apply markdown formatting rules with flexmark from [spotless-maven-plugin](https://github.com/diffplug/spotless/tree/main/plugin-maven#markdown) to *.md files in `/docs`
- use `flexmark` to format markdown generation in `TestUtils` of common module used by `AllKyuubiConfiguration` and `KyuubiDefinedFunctionSuite`, as the same way in `FlexmarkFormatterFunc ` of `spotless-maven-plugin` using with `COMMONMARK` as `FORMATTER_EMULATION_PROFILE` (https://github.com/diffplug/spotless/blob/maven/2.30.0/lib/src/flexmark/java/com/diffplug/spotless/glue/markdown/FlexmarkFormatterFunc.java)
- using `flexmark` of` 0.62.2`, as the last version requiring Java 8+ (checked from pom file and bytecode version)

```
<markdown>
    <includes>
        <include>docs/**/*.md</include>
    </includes>
    <flexmark></flexmark>
</markdown>
```

- Changes applied to markdown doc files,
  -  no style change or breakings in built docs by `make html`
  - removal all the first blank in licences and comments to conform markdown style rules
  - tables regenerated by flexmark following as in [GitHub Flavored Markdown](https://help.github.com/articles/organizing-information-with-tables/) (https://github.com/vsch/flexmark-java/wiki/Extensions#tables)

### _How was this patch tested?_
- [x] regenerate docs using `make html` successfully and check all the markdown pages available
- [x] regenerate `settings.md` and `functions.md` by `AllKyuubiConfiguration` and `KyuubiDefinedFunctionSuite`, and pass the checks by both themselves and spotless check via `dev/reformat`
- [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4200 from bowenliang123/markdown-formatting.

Closes #4200

1eeafce4 [liangbowen] revert minor changes in AllKyuubiConfiguration
4f892857 [liangbowen] use flexmark in markdown doc generation
8c978abd [liangbowen] changes on markdown files
a9190556 [liangbowen] apply markdown formatting rules with `spotless-maven-plugin` to markdown files with in `/docs`

Authored-by: liangbowen <liangbowen@gf.com.cn>
Signed-off-by: liangbowen <liangbowen@gf.com.cn>
2023-01-30 11:14:41 +08:00

5.4 KiB

Kubernetes Tools Spark Block Cleaner

Requirements

You'd better have cognition upon the following things when you want to use spark-block-cleaner.

Scenes

When you're using Spark On Kubernetes with Client mode and don't use emptyDir for Spark local-dir type, you may face the same scenario that executor pods deleted without clean all the Block files. It may cause disk overflow.

Therefore, we chose to use Spark Block Cleaner to clear the block files accumulated by Spark.

Principle

When deploying Spark Block Cleaner, we will configure volumes for the destination folder. Spark Block Cleaner will perceive the folder by the parameter CACHE_DIRS.

Spark Block Cleaner will clear the perceived folder in a fixed loop(which can be configured by SCHEDULE_INTERVAL). And Spark Block Cleaner will select folder start with blockmgr and spark for deletion using the logic Spark uses to create those folders.

Before deleting those files, Spark Block Cleaner will determine whether it is a recently modified file(depending on whether the file has not been acted on within the specified time which configured by FILE_EXPIRED_TIME). Only delete files those beyond that time interval.

And Spark Block Cleaner will check the disk utilization after clean, if the remaining space is less than the specified value(control by FREE_SPACE_THRESHOLD), will trigger deep clean(which file expired time control by DEEP_CLEAN_FILE_EXPIRED_TIME).

Usage

Before you start using Spark Block Cleaner, you should build its docker images.

Build Block Cleaner Docker Image

In the KYUUBI_HOME directory, you can use the following cmd to build docker image.

docker build ./tools/spark-block-cleaner/kubernetes/docker

Modify spark-block-cleaner.yml

You need to modify the ${KYUUBI_HOME}/tools/spark-block-cleaner/kubernetes/spark-block-cleaner.yml to fit your current environment.

In Kyuubi tools, we recommend using DaemonSet to start, and we offer default yaml file in daemonSet way.

Base file structure:

apiVersion
kind
metadata
  name
  namespace
spec
  select
  template
    metadata
    spce
      containers
      - image
      - volumeMounts
      - env
    volumes

You can use affect the performance of Spark Block Cleaner through configure parameters in containers env part of spark-block-cleaner.yml.

env:
  - name: CACHE_DIRS
    value: /data/data1,/data/data2
  - name: FILE_EXPIRED_TIME
    value: 604800
  - name: DEEP_CLEAN_FILE_EXPIRED_TIME
    value: 432000
  - name: FREE_SPACE_THRESHOLD
    value: 60
  - name: SCHEDULE_INTERVAL
    value: 3600

The most important thing, configure volumeMounts and volumes corresponding to Spark local-dirs.

For example, Spark use /spark/shuffle1 as local-dir, you can configure like:

volumes:
  - name: block-files-dir-1
    hostPath:
      path: /spark/shuffle1
volumeMounts:
  - name: block-files-dir-1
    mountPath: /data/data1
env:
  - name: CACHE_DIRS
    value: /data/data1

Start daemonSet

After you finishing modifying the above, you can use the following command kubectl apply -f ${KYUUBI_HOME}/tools/spark-block-cleaner/kubernetes/spark-block-cleaner.yml to start daemonSet.

Name Default unit Meaning
CACHE_DIRS /data/data1,/data/data2 The target dirs in container path which will clean block files.
FILE_EXPIRED_TIME 604800 seconds Cleaner will clean the block files which current time - last modified time more than the fileExpiredTime.
DEEP_CLEAN_FILE_EXPIRED_TIME 432000 seconds Deep clean will clean the block files which current time - last modified time more than the deepCleanFileExpiredTime.
FREE_SPACE_THRESHOLD 60 % After first clean, if free Space low than threshold trigger deep clean.
SCHEDULE_INTERVAL 3600 seconds Cleaner sleep between cleaning.