[ML-3585] Added benchmarks to mllib-large.yaml for clustering (#149)

Benchmark for clustering is added to mllib-large.yaml. GaussianMixture, KMeans, and LDA are added. BisectingKMeans is missing in spark-sql-perf now. Need to be fixed in the following up JIRA: https://databricks.atlassian.net/browse/ML-3834 Then parameters is based on the previous benchmarks for the Spark 2.2 QA.
2018-06-08 12:06:52 -07:00 · 2018-06-08 12:06:52 -07:00 · 9ab2a8bb14
commit 9ab2a8bb14
parent 62b173d779
1 changed files with 22 additions and 0 deletions
--- a/src/main/resources/com/databricks/spark/sql/perf/mllib/config/mllib-large.yaml
+++ b/src/main/resources/com/databricks/spark/sql/perf/mllib/config/mllib-large.yaml
@ -37,6 +37,28 @@ benchmarks:
      numFeatures: 5000
      numClasses: 2
      smoothing: 1.0
+  - name: clustering.GaussianMixture
+    params:
+      numExamples: 100000
+      numTestExamples: 100000
+      numFeatures: 1000
+      k: 10
+      maxIter: 10
+      tol: 0.01
+  - name: clustering.KMeans
+    params:
+      k: 50
+      maxIter: 20
+      tol: 1e-3
+  - name: clustering.LDA
+    params:
+      docLength: 100
+      vocabSize: 5000
+      k: 60
+      maxIter: 20
+      optimizer:
+        - em
+        - online
  - name: recommendation.ALS
    params:
      numExamples: 50000000