[ML-2918] Call count() in default score() to improve timing of transform() (#159)
For Models and Transformers which are not tested with Evaluators, I think we are not timing transform() correctly here:
spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/mllib/MLPipelineStageBenchmarkable.scala
Line 65 in aa1587f
transformer.transform(trainingData)
Since transform() is lazy, we need to materialize it during timing. This PR currently just calls count() in the default implementation of score().
* call count() in score()
* changed count to UDF
This commit is contained in:
parent
1798b12077
commit
30c50dddbb
@ -35,12 +35,21 @@ trait BenchmarkAlgorithm extends Logging {
|
||||
/**
|
||||
* The unnormalized score of the training procedure on a dataset. The normalization is
|
||||
* performed by the caller.
|
||||
* This calls `count()` on the transformed data to attempt to materialize the result for
|
||||
* recording timing metrics.
|
||||
*/
|
||||
@throws[Exception]("if scoring fails")
|
||||
def score(
|
||||
ctx: MLBenchContext,
|
||||
testSet: DataFrame,
|
||||
model: Transformer): MLMetric = MLMetric.Invalid
|
||||
model: Transformer): MLMetric = {
|
||||
val output = model.transform(testSet)
|
||||
// We create a useless UDF to make sure the entire DataFrame is instantiated.
|
||||
val fakeUDF = udf { (_: Any) => 0 }
|
||||
val columns = testSet.columns
|
||||
output.select(sum(fakeUDF(struct(columns.map(col) : _*)))).first()
|
||||
MLMetric.Invalid
|
||||
}
|
||||
|
||||
def name: String = {
|
||||
this.getClass.getCanonicalName.replace("$", "")
|
||||
|
||||
Loading…
Reference in New Issue
Block a user