Commit Graph

23 Commits

Author SHA1 Message Date
Nico Poggi
6136ecea6e
TPC-H datagenerator and instructions (#136)
* Adding basic partitioning to TPCH tables following VectorH paper as baseline
* Multi datagen (TPC- H and DS) and multi scale factor notebook/script.
Generates all the selected scale factors and benchmarks in one run.
* TPCH runner notebook or script for spark-shell
* Adding basic TPCH documentation
2018-09-10 23:18:33 +02:00
Juliusz Sompolski
91604a3ab0 Update README to specify that TPCDS kit needs to be installed on all nodes. 2018-02-27 12:06:12 +01:00
Juliusz Sompolski
31f34beee5
Update README to do sql("use database") (#123) 2017-11-07 20:38:26 +01:00
Juliusz Sompolski
5ebb9cfb12 add some more comments 2017-09-12 16:51:26 +02:00
Juliusz Sompolski
c78f2b3a9b update readme 2017-09-12 16:40:23 +02:00
Siddharth Murching
d0de5ae8aa Update tests to run with Spark 2.2, add NaiveBayes & Bucketizer ML tests (#110)
* Made small updates in Benchmark.scala and Query.scala for Spark 2.2
* Added tests for NaiveBayesModel and Bucketizer
* Changed BenchmarkAlgorithm.getEstimator() -> BenchmarkAlgorithm.getPipelineStage() to allow for the benchmarking of Estimators and Transformers instead of just Estimators

Commits:
* Changes made so that spark-sql-perf compiles with Spark 2.2

* Updates for running ML tests from the command line + added Naive Bayes test

* Add Bucketizer test as example of Featurizer test; change getEstimator() to getPipelineStage() in
BenchmarkAlgorithm to allow for testing of transformers in addition to estimators.

* Add comment for main method in MLlib.scala

* Rename MLTransformerBenchmarkable --> MLPipelineStageBenchmarkable, fix issue with NaiveBayes param

* Add UnaryTransformer trait for common data/methods to be shared across all objects
testing featurizers that operate on a single column (StringIndexer, OneHotEncoder, Bucketizer, HashingTF, etc)

* Respond to review comments:

* bin/run-ml: Add newline at EOF
* Query.scala: organized imports
* MLlib.scala: organized imports, fixed SparkContext initialization
* NaiveBayes.scala: removed unused temp val, improved probability calculation in trueModel()
* Bucketizer.scala: use DataGenerator.generateContinuousFeatures instead of generating data on the driver

* Fix bug in Bucketizer.scala

* Precompute log of sum of unnormalized probabilities in NaiveBayes.scala, add NaiveBayes and Bucketizer tests to mllib-small.yaml

* Update Query.scala to use p() to access SparkPlans under a given SparkPlan

* Update README to indicate that spark-sql-perf only works with Spark 2.2+ after this PR
2017-08-21 15:07:46 -07:00
Kevin
fdcde7595c Update README (#107)
Little update for the README
2017-07-13 10:45:24 +02:00
Michael Armbrust
663ca7560e Main Class for running Benchmarks from the command line
This PR adds the ability to run performance test locally as a stand alone program that reports the results to the console:

```
$ bin/run --help
spark-sql-perf 0.2.0
Usage: spark-sql-perf [options]

  -b <value> | --benchmark <value>
        the name of the benchmark to run
  -f <value> | --filter <value>
        a filter on the name of the queries to run
  -i <value> | --iterations <value>
        the number of iterations to run
  --help
        prints this usage text

$ bin/run --benchmark DatasetPerformance
```

Author: Michael Armbrust <michael@databricks.com>

Closes #47 from marmbrus/MainClass.
2016-01-19 12:37:51 -08:00
Davies Liu
cec648ac0f try to run all TPCDS queries in benchmark (even can't be parsed) 2016-01-08 15:03:44 -08:00
Nong Li
1aa5bfc838 Add remaining tpcds tables.
Author: Nong Li <nongli@gmail.com>

Closes #34 from nongli/tpcds.
2015-11-19 13:50:00 -08:00
Cheng Lian
50808c436b Fixes typos in README.md
Author: Cheng Lian <lian@databricks.com>

Closes #25 from liancheng/readme-fix.
2015-11-11 12:05:44 -08:00
Michael Armbrust
ddeead18ce Add compilation testing with travis
There are no tests yet... but this at least tests compilation.

Author: Michael Armbrust <michael@databricks.com>

Closes #15 from marmbrus/travis.
2015-09-09 21:36:26 -07:00
Yin Huai
34f66a0a10 Add a option of filter rows with null partition column values. 2015-08-26 11:14:19 -07:00
Yin Huai
06eb11f326 Fix the seed to 100 and use distribute by instead of order by. 2015-08-25 20:44:14 -07:00
Yin Huai
9936d49239 Add a option to orderBy partition columns. 2015-08-25 20:44:14 -07:00
Yin Huai
58188c6711 Allow users to use double instead of decimal for generated tables. 2015-08-25 20:44:14 -07:00
Yin Huai
88aadb45a4 Update README. 2015-08-25 20:44:14 -07:00
Yin Huai
97093a45cd Update readme and register temp tables. 2015-08-25 20:44:13 -07:00
Michael Armbrust
a239da90a2 more cleanup, update readme 2015-08-11 15:51:34 -07:00
Yin Huai
fb9939b136 includeBreakdown is a parameter of runExperiment. 2015-04-20 10:03:41 -07:00
Yin Huai
6c5657b609 Refactoring and doc. 2015-04-16 18:10:57 -07:00
Yin Huai
930751810e Initial port. 2015-04-15 20:03:14 -07:00
Yin Huai
e81669ab3b Initial commit. 2015-04-15 20:02:32 -07:00