[ML-2918] Call count() in default score() to improve timing of transform() (#159)

For Models and Transformers which are not tested with Evaluators, I think we are not timing transform() correctly here: spark-sql-perf/src/main/scala/com/databricks/spark/sql/perf/mllib/MLPipelineStageBenchmarkable.scala Line 65 in aa1587f transformer.transform(trainingData) Since transform() is lazy, we need to materialize it during timing. This PR currently just calls count() in the default implementation of score(). * call count() in score() * changed count to UDF
2018-07-08 16:09:24 -07:00 · 2018-07-08 16:09:24 -07:00 · 30c50dddbb
commit 30c50dddbb
parent 1798b12077
1 changed files with 10 additions and 1 deletions
--- a/src/main/scala/com/databricks/spark/sql/perf/mllib/BenchmarkAlgorithm.scala
+++ b/src/main/scala/com/databricks/spark/sql/perf/mllib/BenchmarkAlgorithm.scala
@ -35,12 +35,21 @@ trait BenchmarkAlgorithm extends Logging {
  /**
   * The unnormalized score of the training procedure on a dataset. The normalization is
   * performed by the caller.
+   * This calls `count()` on the transformed data to attempt to materialize the result for
+   * recording timing metrics.
   */
  @throws[Exception]("if scoring fails")
  def score(
      ctx: MLBenchContext,
      testSet: DataFrame,
-      model: Transformer): MLMetric = MLMetric.Invalid
+      model: Transformer): MLMetric = {
+    val output = model.transform(testSet)
+    // We create a useless UDF to make sure the entire DataFrame is instantiated.
+    val fakeUDF = udf { (_: Any) => 0 }
+    val columns = testSet.columns
+    output.select(sum(fakeUDF(struct(columns.map(col) : _*)))).first()
+    MLMetric.Invalid
+  }

  def name: String = {
    this.getClass.getCanonicalName.replace("$", "")