Currently the dataset size is numExamples * numFeatures * numInputCols, which is much bigger than other ML perf tests. This PR updates its implementation and makes it more efficient at slicing vectors.
Tested on the mllib-big.yaml and 3 runs finished in < 2 minutes.
Benchmark for regression is added to mllib-large.yaml.
DecisionTreeRegression, GLMRegression, LinearRegression, and RandomForestRegression are added.
GBT, AFTSurvivalRegression, and IsotonicRegression are missing in spark-sql-perf.
Benchmark for clustering is added to mllib-large.yaml.
GaussianMixture, KMeans, and LDA are added. BisectingKMeans is missing in spark-sql-perf now. Need to be fixed in the following up JIRA: https://databricks.atlassian.net/browse/ML-3834
Then parameters is based on the previous benchmarks for the Spark 2.2 QA.
Benchmark for ALS is added to mllib-large.yaml.
Then parameters is based on the previous benchmarks for the Spark 2.2 QA. It has been tested for the stability under the same cluster setting for other benchmarks (classifcation).
Change the function MLParams.toMap so it will not output a option value in the map.
We will not get option value in the params in the output result.
* change the "option" number to direct number
* update based on the comments
* Add new case class "MLMetric" to save all different metrics
* Change "mlResult" in BenchmarkResult to Array[MLMetric]
* score function will return MLMetric
* Add mllib unit test to run mllib-small.yaml.
* Check results in unit tests and fail tests if failures are present.
* Rename to be MLLibSuite instead of MLLibTest.
* Better error message on failed benchmarks.
* Move mllib config file to resources.
* Add DecisionTreeClassification as first benchmark in mllib-large.yaml.
* Read config files as streams to be jar compatible.
* PR feedback #140.
Add additional method test for some ML algos.
In this PR, I add `associationRules` in `FPGrowth` and `findSynonyms`.
After the design is accepted, I will add other methods later.
Add an interface in `BenchmarkableAlgorithm`:
```
def testAdditionalMethods(ctx: MLBenchContext, model: Transformer): Map[String, () => _]
```
MinHashLSH and BucketedRandomProjectionLSH benchmark added.
Future questions:
* Whether we need to improve the way of testing data generation for MinHashLSH ( and add more control param, such as max/min element number in each input set )
* Whether we need to add benchmark for approxNearestNeighbors and approxSimilarityJoin
In spark 2.3 some default param values were moved from Models to matching Estimators. I added explicit sets for these values in our tests to avoid errors. Also renamed ModelBuilder to ModelBuildersSSP to avoid a name conflict with dbml-local which is included in databricks runtime.
Do not clean blocks after each run in the generic Benchmarkable trait.
It seems to have been there since #33, and an option spark.databricks.benchmark.cleanBlocksAfter to turn it off was added to it in #98, specifically to allow parallel TPCDS streams to not wipe each other's blocks. But that option is quite well hidden and obscure, and as a SparkContext config option can only be set during cluster creation, so it's not friendly to use.
Cleaning up the blocks doesn't seem necessary for the Query Benchmarkables used for TPCDS and TPCH. Remove it from there, and leave it only for MLPipelineStageBenchmarkable.
Note:
Add a `ItemSetGenerator` class, use following algo:
1.Create P=`numItems` items (integers 0 to P-1)
2. Generate `numExample` rows, where each row (an itemset) is selected as follows:
2.1 Choose the size of the itemset from a Poisson distribution
2.2 Generate `size - 2` items by choosing integers from a Poisson distribution. Eliminate duplicates as needed.
2.3 Add 2 new items in order to create actual association rules.
2.3.1 For each itemset, pick the first item, and compute a new item = (firstItem + P / 2) % P, add new item to the set.
2.3.2 For each itemset, pick the first 2 items (integers) and add them together (modulo P) to compute a new item to add to the set.
In #109 coalescing of non-partitioned tables into 1 file seems to have gotten accidentally removed.
Put it back, but only when clusterByPartitionedColumns == true
Considering that we coalesce partitions only when that setting is true, it seems to be consistent to use it also for non-partitioned tables.
It may be better to change the name of the parameter, but that changes the interface, and possibly should be left for some future clean up.
This PR follows up on #112, adding new performance tests for DecisionTreeRegression, RandomForestRegression, GMM, and HashingTF.
Summary of changes:
* Added new performance tests
* Updated configs in mllib-small.yaml
** Alphabetized configs
** Added new configs for: RandomForestRegression, DecisionTreeRegression, GMM, HashingTF
* Refactored TreeOrForestClassification into a trait (TreeOrForestEstimator) exposing methods for all tree/forest estimator performance tests.
** Copied code from DecisionTreeClassification.scala into TreeOrForestEstimator.scala
I tested this PR by running the performance tests specified in mllib-small.yaml
## What changes are proposed in this pull request?
Investigating OOMs during TPCDS data generation:
it turned out that scala standard library's ProcessBuilder.lineStream would by default create a LinkedBlockingQueue buffer of Integer.MAX_VALUE capacity.
It surfaced after in https://github.com/databricks/tpcds-kit/pull/2 we implemented 10x improvements to dsdgen speed.
Now spark-sql-perf does not keep up with ingesting data from dsdgen, and the buffer will cause OOMs.
Pulled out pieces of ProcessBuilderImpl and ProcessImpl just to create a LinkedBlockingQueue with maxQueueSize=65536 instead.
Also submitted https://github.com/scala/scala/pull/6052
## How was this patch tested?
- ssh on the worker - see that dsdgen is being throttled now, Java memory doesn't explode.
- tested that TPCDS SF100 generated correctly.
A case class (MLParams) is currently used to store/access parameters for ML tests in spark-sql-perf. With the addition of new ML tests to spark-sql-perf (in this PR: #112), the number of ML-related test params will be > 22, but Scala only allows up to 22 params in a case class.
This PR addresses the issue by:
* Introducing a new MLParameters class (class MLParameters) that provides access to the same parameters as MLParams, except as a class instead of a case class.
* Replacing usages of MLParams with MLParameters
* Storing the members of MLParameters in BenchmarkResult.parameters for logging/persistence.
Tested by running default performance tests in src/main/scala/configs/mllib-small.yaml.
* Made small updates in Benchmark.scala and Query.scala for Spark 2.2
* Added tests for NaiveBayesModel and Bucketizer
* Changed BenchmarkAlgorithm.getEstimator() -> BenchmarkAlgorithm.getPipelineStage() to allow for the benchmarking of Estimators and Transformers instead of just Estimators
Commits:
* Changes made so that spark-sql-perf compiles with Spark 2.2
* Updates for running ML tests from the command line + added Naive Bayes test
* Add Bucketizer test as example of Featurizer test; change getEstimator() to getPipelineStage() in
BenchmarkAlgorithm to allow for testing of transformers in addition to estimators.
* Add comment for main method in MLlib.scala
* Rename MLTransformerBenchmarkable --> MLPipelineStageBenchmarkable, fix issue with NaiveBayes param
* Add UnaryTransformer trait for common data/methods to be shared across all objects
testing featurizers that operate on a single column (StringIndexer, OneHotEncoder, Bucketizer, HashingTF, etc)
* Respond to review comments:
* bin/run-ml: Add newline at EOF
* Query.scala: organized imports
* MLlib.scala: organized imports, fixed SparkContext initialization
* NaiveBayes.scala: removed unused temp val, improved probability calculation in trueModel()
* Bucketizer.scala: use DataGenerator.generateContinuousFeatures instead of generating data on the driver
* Fix bug in Bucketizer.scala
* Precompute log of sum of unnormalized probabilities in NaiveBayes.scala, add NaiveBayes and Bucketizer tests to mllib-small.yaml
* Update Query.scala to use p() to access SparkPlans under a given SparkPlan
* Update README to indicate that spark-sql-perf only works with Spark 2.2+ after this PR
## What changes were proposed in this pull request?
Since SPARK-20690 and SPARK-20916 Spark requires all subqueries in FROM clause to have an alias name.
## How was this patch tested?
Tested on SF1.
Data generation:
* Add an option to change Dates to Strings, and specify it in Tables object creator.
* Add discovering partitions to createExternalTables
* Add analyzeTables function that gathers statistics.
Benchmark execution:
* Perform collect() on Dataframe, so that it is recorded by SQL SparkUI.