kyuubi/extensions/spark/kyuubi-spark-lineage
xglv1985 7c110b68f8
[KYUUBI #6912][LINEAGE] Properly handle empty attribute set on mergeRelationColumnLineage
# Why are the changes needed?
## Issue reference:
https://github.com/apache/kyuubi/issues/6912

## How to reproduce the issue?
The changes in this PR will avoid a wrong result when generating the instance of org.apache.kyuubi.plugin.lineage.Lineage, in the certain case as follows:
step 1: create a temporary view from a file
step 2: insert into a table by selecting from the temporary view in step 1
step 3: generate the lineage when executing the insert statement in step 2
In detail, please see the UT code submission in this patch.

## The issue analysis
Let's see the current code when getting the Lineage object by resolving a LogicalPlan object:
<img width="694" alt="image" src="https://github.com/user-attachments/assets/65256a0d-320d-4271-968f-59eafb74de9f" />

According to the above logic, a None org.apache.kyuubi.plugin.lineage.Lineage object will be generated due to "try-catch" self-protection, in this certain case. This None object will lead to problems in the following 2 scenes:
### Unit Test Environment
In Unit Test, when the code runs here a "None.get" exception will be raised:
<img width="682" alt="image" src="https://github.com/user-attachments/assets/102dc9bd-294f-4b1e-b1c6-01b6fee50fed" />

Here's the runtime exception stack:
```
None.get
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at org.apache.kyuubi.plugin.lineage.helper.SparkSQLLineageParserHelperSuite.extractLineageWithoutExecuting(SparkSQLLineageParserHelperSuite.scala:1485)
	at org.apache.kyuubi.plugin.lineage.helper.SparkSQLLineageParserHelperSuite.$anonfun$new$83(SparkSQLLineageParserHelperSuite.scala:1465)
```
### Production Environment
This Lineage object cannot be used in the production environment because it has a None value which lacks some necessary lineage information. The right content of the Lineage instance in the above case should be:
```
inputTables(List())
outputTables(List(spark_catalog.test_db.test_table_from_dir))
columnLineage(List(ColumnLineage(spark_catalog.test_db.test_table_from_dir.a0,Set()), ColumnLineage(spark_catalog.test_db.test_table_from_dir.b0,Set())))
```

a newly added test case(test directory to table) passed after this issue is fixed.

# How to fix the issue?
Add a "Empty judgment" logic. In detail, please see the code submission in this patch.

# How was this patch tested?
1. by adding a new test case in UT code and make sure it passes
2. by submitting a Spark application including the SQL of this case in the production environment, and make sure a right Lineage instance is generated, instead of a None object

# Was this patch authored or co-authored using generative AI tooling?
No

Closes #6911 from xglv1985/fix_spark_lineage_runtime_exception.

Closes #6912

13a71075d [Cheng Pan] Update extensions/spark/kyuubi-spark-lineage/src/test/scala/org/apache/kyuubi/plugin/lineage/helper/SparkSQLLineageParserHelperSuite.scala
4e89b95cd [Cheng Pan] Update extensions/spark/kyuubi-spark-lineage/src/test/scala/org/apache/kyuubi/plugin/lineage/helper/SparkSQLLineageParserHelperSuite.scala
59b350bfb [xglv1985] fix a runtime exception when generate column lineage tuple--more readable code
52bc0288d [xglv1985] fix a runtime exception when generate column lineage tuple--spotless sytle
fea6bbc0d [xglv1985] fix a runtime exception when generate column lineage tuple--remove tab from UT code
901879095 [xglv1985] fix a runtime exception when generate column lineage tuple--unit test
fbb4df879 [xglv1985] fix a runtime exception when generate column lineage tuple

Lead-authored-by: xglv1985 <xglv1985@gmail.com>
Co-authored-by: Cheng Pan <pan3793@gmail.com>
Signed-off-by: Cheng Pan <chengpan@apache.org>
2025-02-14 10:27:51 +08:00
..
src [KYUUBI #6912][LINEAGE] Properly handle empty attribute set on mergeRelationColumnLineage 2025-02-14 10:27:51 +08:00
pom.xml [KYUUBI #6769] [RELEASE] Bump 1.11.0-SNAPSHOT 2024-10-23 17:10:56 +08:00
README.md [KYUUBI #6163] Set default Spark version to 3.5 2024-03-12 16:22:37 +08:00

Kyuubi Spark Listener Extension

Functions

  • All listener extensions can be implemented in this module, like QueryExecutionListener and ExtraListener
  • Add SparkOperationLineageQueryExecutionListener to extends spark QueryExecutionListener
  • SQL lineage parsing will be triggered after SQL execution and will be written to the json logger file

Build

build/mvn clean package -DskipTests -pl :kyuubi-spark-lineage_2.12 -am -Dspark.version=3.2.1

Supported Apache Spark Versions

-Dspark.version=

  • master
  • 3.5.x (default)
  • 3.4.x
  • 3.3.x
  • 3.2.x
  • 3.1.x