Commit Graph

186 Commits

Author SHA1 Message Date
Juliusz Sompolski
91604a3ab0 Update README to specify that TPCDS kit needs to be installed on all nodes. 2018-02-27 12:06:12 +01:00
Juliusz Sompolski
31f34beee5
Update README to do sql("use database") (#123) 2017-11-07 20:38:26 +01:00
Juliusz Sompolski
7bf2d45b0f Don't clean blocks after every run in Benchmarkable (#119)
Do not clean blocks after each run in the generic Benchmarkable trait.
It seems to have been there since #33, and an option spark.databricks.benchmark.cleanBlocksAfter to turn it off was added to it in #98, specifically to allow parallel TPCDS streams to not wipe each other's blocks. But that option is quite well hidden and obscure, and as a SparkContext config option can only be set during cluster creation, so it's not friendly to use.

Cleaning up the blocks doesn't seem necessary for the Query Benchmarkables used for TPCDS and TPCH. Remove it from there, and leave it only for MLPipelineStageBenchmarkable.
2017-09-18 11:51:12 +02:00
Juliusz Sompolski
fdd0e38717 TPCDS notebooks in source, not binary format (#121) 2017-09-13 14:57:59 +02:00
Nico Poggi
006f096562 Merge pull request #120 from juliuszsompolski/tpcds_notebooks
Add example notebooks for running TPCDS and update readme
2017-09-12 17:22:38 +02:00
Juliusz Sompolski
5ebb9cfb12 add some more comments 2017-09-12 16:51:26 +02:00
Juliusz Sompolski
c78f2b3a9b update readme 2017-09-12 16:40:23 +02:00
Juliusz Sompolski
ae8bcdb292 add notebooks 2017-09-12 15:43:08 +02:00
WeichenXu
f08bf31d18 add benchmark for FPGrowth (#113)
Note: 
Add a `ItemSetGenerator` class, use following algo:

1.Create P=`numItems` items (integers 0 to P-1)
2. Generate `numExample` rows, where each row (an itemset) is selected as follows:
  2.1 Choose the size of the itemset from a Poisson distribution
  2.2 Generate `size - 2` items by choosing integers from a Poisson distribution. Eliminate duplicates as needed.
  2.3 Add 2 new items in order to create actual association rules.
    2.3.1 For each itemset, pick the first item, and compute a new item = (firstItem + P / 2) % P, add new item to the set.
    2.3.2 For each itemset, pick the first 2 items (integers) and add them together (modulo P) to compute a new item to add to the set.
2017-09-04 10:48:05 -07:00
Juliusz Sompolski
bcda8fc1e5 Coalesce non-partitioned tables. (#118)
In #109 coalescing of non-partitioned tables into 1 file seems to have gotten accidentally removed.
Put it back, but only when clusterByPartitionedColumns == true
Considering that we coalesce partitions only when that setting is true, it seems to be consistent to use it also for non-partitioned tables.

It may be better to change the name of the parameter, but that changes the interface, and possibly should be left for some future clean up.
2017-09-04 18:05:42 +02:00
Siddharth Murching
3e1bbd00ed [ML-2847] Add new tests for (DecisionTree, RandomForest)Regression, GMM, HashingTF (#116)
This PR follows up on #112, adding new performance tests for DecisionTreeRegression, RandomForestRegression, GMM, and HashingTF.

Summary of changes:
* Added new performance tests
* Updated configs in mllib-small.yaml
** Alphabetized configs
** Added new configs for: RandomForestRegression, DecisionTreeRegression, GMM, HashingTF
* Refactored TreeOrForestClassification into a trait (TreeOrForestEstimator) exposing methods for all tree/forest estimator performance tests.
** Copied code from DecisionTreeClassification.scala into TreeOrForestEstimator.scala

I tested this PR by running the performance tests specified in mllib-small.yaml
2017-09-03 22:26:20 -07:00
WeichenXu
19c41464c7 fix df.drop in VectorAssembler (#117)
fix df.drop in VectorAssembler to return correct DataFrame
2017-09-01 13:51:05 -07:00
WeichenXu
6ec83fd0f7 Add benchmark for LinearSVC/OnehotEncoder/VectorSlicer/VectorAssembler/StringIndexer/Tokenizer (#112)
Add benchmark for:

LinearSVC
OnehotEncoder
VectorSlicer
VectorAssembler
StringIndexer
Tokenizer
2017-08-31 13:56:43 -07:00
Juliusz Sompolski
737a1bc355 BlockingLineStream (#115)
## What changes are proposed in this pull request?

Investigating OOMs during TPCDS data generation:
it turned out that scala standard library's ProcessBuilder.lineStream would by default create a LinkedBlockingQueue buffer of Integer.MAX_VALUE capacity.
It surfaced after in https://github.com/databricks/tpcds-kit/pull/2 we implemented 10x improvements to dsdgen speed.
Now spark-sql-perf does not keep up with ingesting data from dsdgen, and the buffer will cause OOMs.

Pulled out pieces of ProcessBuilderImpl and ProcessImpl just to create a LinkedBlockingQueue with maxQueueSize=65536 instead.

Also submitted https://github.com/scala/scala/pull/6052

## How was this patch tested?

- ssh on the worker - see that dsdgen is being throttled now, Java memory doesn't explode.
- tested that TPCDS SF100 generated correctly.
2017-08-31 15:16:22 +02:00
Siddharth Murching
9febc34f66 Refactor MLParams for spark-sql-perf (#114)
A case class (MLParams) is currently used to store/access parameters for ML tests in spark-sql-perf. With the addition of new ML tests to spark-sql-perf (in this PR: #112), the number of ML-related test params will be > 22, but Scala only allows up to 22 params in a case class.

This PR addresses the issue by:
* Introducing a new MLParameters class (class MLParameters) that provides access to the same parameters as MLParams, except as a class instead of a case class.
* Replacing usages of MLParams with MLParameters
* Storing the members of MLParameters in BenchmarkResult.parameters for logging/persistence.

Tested by running default performance tests in src/main/scala/configs/mllib-small.yaml.
2017-08-28 13:23:59 -07:00
Siddharth Murching
d0de5ae8aa Update tests to run with Spark 2.2, add NaiveBayes & Bucketizer ML tests (#110)
* Made small updates in Benchmark.scala and Query.scala for Spark 2.2
* Added tests for NaiveBayesModel and Bucketizer
* Changed BenchmarkAlgorithm.getEstimator() -> BenchmarkAlgorithm.getPipelineStage() to allow for the benchmarking of Estimators and Transformers instead of just Estimators

Commits:
* Changes made so that spark-sql-perf compiles with Spark 2.2

* Updates for running ML tests from the command line + added Naive Bayes test

* Add Bucketizer test as example of Featurizer test; change getEstimator() to getPipelineStage() in
BenchmarkAlgorithm to allow for testing of transformers in addition to estimators.

* Add comment for main method in MLlib.scala

* Rename MLTransformerBenchmarkable --> MLPipelineStageBenchmarkable, fix issue with NaiveBayes param

* Add UnaryTransformer trait for common data/methods to be shared across all objects
testing featurizers that operate on a single column (StringIndexer, OneHotEncoder, Bucketizer, HashingTF, etc)

* Respond to review comments:

* bin/run-ml: Add newline at EOF
* Query.scala: organized imports
* MLlib.scala: organized imports, fixed SparkContext initialization
* NaiveBayes.scala: removed unused temp val, improved probability calculation in trueModel()
* Bucketizer.scala: use DataGenerator.generateContinuousFeatures instead of generating data on the driver

* Fix bug in Bucketizer.scala

* Precompute log of sum of unnormalized probabilities in NaiveBayes.scala, add NaiveBayes and Bucketizer tests to mllib-small.yaml

* Update Query.scala to use p() to access SparkPlans under a given SparkPlan

* Update README to indicate that spark-sql-perf only works with Spark 2.2+ after this PR
2017-08-21 15:07:46 -07:00
Yin Huai
b3a6ed79b3 Start the development 0.5.0-SNAPSHOT 2017-08-21 14:21:19 -07:00
Bogdan Raducanu
4e7a2363b9 Support for TPC-H benchmark
Refactored TPC-DS code to be able to reuse it for TPC-H.
Added TPC-H queries texts adapted for Spark.
2017-08-09 12:26:32 +02:00
Kevin
fdcde7595c Update README (#107)
Little update for the README
2017-07-13 10:45:24 +02:00
Juliusz Sompolski
6488d74d23 tpcds_2_4: Add alias names to subqueries in FROM clause.
## What changes were proposed in this pull request?

Since SPARK-20690 and SPARK-20916 Spark requires all subqueries in FROM clause to have an alias name.

## How was this patch tested?

Tested on SF1.
2017-06-29 16:04:08 +02:00
Juliusz Sompolski
bff6b34f62 Tweaks and improvements (#106)
Data generation:
* Add an option to change Dates to Strings, and specify it in Tables object creator.
* Add discovering partitions to createExternalTables
* Add analyzeTables function that gathers statistics.

Benchmark execution:
* Perform collect() on Dataframe, so that it is recorded by SQL SparkUI.
2017-06-13 11:42:14 +02:00
Juliusz Sompolski
75f3876e59 Merge pull request #103 from juliuszsompolski/fixtypes
Correct types of keys in schema
2017-05-26 11:53:19 +02:00
Juliusz Sompolski
2ddd521ab5 ok, make it long only where really needed. 2017-05-26 10:36:40 +02:00
Juliusz Sompolski
1bca964a3d Correct types of keys 2017-05-25 17:12:47 +02:00
Volodymyr Lyubinets
beec62844d Merge pull request #101 from vlyubin/master
Add tpcds 2.4 queries
2017-05-16 10:35:35 +02:00
vlyubin
c0bd21c2ec Add ss_max 2017-05-16 10:29:00 +02:00
vlyubin
e5dc6f338f Updated queries 23 2017-05-15 17:30:20 +02:00
vlyubin
e8f85b0b0e Moved queries into a separate folder 2017-05-15 14:22:37 +02:00
vlyubin
96bf10bffc Add tpcds 2.4 queries 2017-05-12 11:54:32 +02:00
Eric Liang
c12b14b013 Merge pull request #98 from databricks/parallel-runs
Add option to avoid cleaning after each run, to enable parallel runs
2017-03-15 13:50:41 -07:00
Eric Liang
64728c7cff Add option to avoid cleaning after each run, to enable parallel runs 2017-03-14 19:45:27 -07:00
Timothy Hunter
53091a1935 Removes labels from tree data generation (#82)
* changes

* removes labels

* reset scala version

* adding metadata

* bumping spark release
2016-12-13 16:47:31 -08:00
srinathshankar
685c50d9dc Cross build with Scala 2.11 (#91)
* Cross build with Scala 2.11

* Update snapshot version
2016-10-03 17:01:17 -07:00
srinathshankar
0eaa4b1d57 [SC-4409] Correct query 41 in TPCDS kit (#90) 2016-09-30 18:02:39 -07:00
Josh Rosen
c2224f37e5 Depend on non-snapshot Spark now that 2.0.0 is released
Now that Spark 2.0.0 is released, we need to update the build to use a released version instead of the snapshot (which is no longer available).

Fixes #84.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #85 from JoshRosen/fix-spark-dep.
2016-08-17 17:53:30 -07:00
Timothy Hunter
948c8369e7 Fixes issues with scala 2.11
Updates the usual scala-logging issues to make the source code cross-compilable between scala 2.10 and scala 2.11.

Tests:
A scala 2.11 version of the code has been run against the official Spark 2.0.0 RC4 binary release (Scala 2.11)
A scala 2.10 version has been run against the official Spark 1.6.2 release

Author: Timothy Hunter <timhunter@databricks.com>

Closes #81 from thunterdb/1607-scala211.
2016-07-19 11:19:52 -07:00
Timothy Hunter
8830bffd46 Merge pull request #79 from jkbradley/tree-test-fix
Fixed tree, forest, GBT tests by adding metadata to DataFrames
2016-07-11 10:42:19 -07:00
Joseph K. Bradley
51469a34d6 Fixed tree, forest, GBT tests by adding metadata to DataFrames 2016-07-11 10:33:19 -07:00
Timothy Hunter
1fcc366cec Merge pull request #78 from thunterdb/1607-fixes
Adding parameters in case of failures
2016-07-06 11:34:05 -07:00
Timothy Hunter
c7d42d3626 adding parameters 2016-07-06 11:23:07 -07:00
Timothy Hunter
2672bcd5b7 ALS algorithm for spark-sql-perf
This has been tested locally with a small amount of data.

I have not bothered to reimplement a more robust version of the ALS synthetic data generation, so it will still require some manual parameter tweaking as before.

Author: Timothy Hunter <timhunter@databricks.com>

Closes #76 from thunterdb/1607-als.
2016-07-05 15:54:08 -07:00
Timothy Hunter
93c0407bbe Merge pull request #77 from thunterdb/1607-linear
Linear regression
2016-07-05 15:41:35 -07:00
Timothy Hunter
40e97ca3c0 comment 2016-07-05 15:01:50 -07:00
Timothy Hunter
ce7e20ae6d set the solver 2016-07-05 13:46:19 -07:00
Timothy Hunter
def20479a1 linear regression 2016-07-05 13:42:56 -07:00
Timothy Hunter
979ebd5d0f Merge pull request #75 from jkbradley/kmeans
Added kmeans test
2016-07-05 10:14:11 -07:00
Joseph K. Bradley
9d11a601c3 added kmeans test 2016-07-01 18:00:49 -07:00
jkbradley
3d3443791c Merge pull request #74 from jkbradley/dt-tests
Decision tree, random forest, GBT classification perf tests
2016-07-01 17:40:16 -07:00
Joseph K. Bradley
495e2716c4 updated per code review. works in local tests 2016-07-01 17:39:28 -07:00
jkbradley
c2f0a35db4 Merge pull request #1 from thunterdb/1606-trees
adding experiments to the yaml file
2016-07-01 11:46:41 -07:00