Commit Graph

61 Commits

Author SHA1 Message Date
Josh Rosen
42a415e8d4 Extract Query class from Benchmark into its own top-level class and make SparkContext field transient
This patch extracts `Query` into its own top-level class and makes its `sparkContext` field transient in order to fix `NotSerializableException`s.

Author: Josh Rosen <rosenville@gmail.com>

Closes #53 from JoshRosen/make-query-into-top-level-class.
2016-02-22 18:23:06 -08:00
Josh Rosen
7e38b77c50 Update to compile against Spark 2.0.0-SNAPSHOT and bump version to 0.4.0-SNAPSHOT
Author: Josh Rosen <rosenville@gmail.com>

Closes #51 from JoshRosen/spark-2.0.0.
2016-02-19 13:02:29 -08:00
Josh Rosen
685ed9e488 Add TPCDS(sqlContext) constructor for backwards-compatibility
This patch adds additional constructors to `TPCDS` to maintain backwards-compatibility with code which calls `new TPCDS(anExistingSqlContext)`. This constructor was removed in #47.

The motivation for backwards-compatibility here is to simplify the gradual roll-out of an updated spark-sql-perf library to some existing jobs which share the same notebook.

Author: Josh Rosen <rosenville@gmail.com>

Closes #52 from JoshRosen/backwards-compatible-tpcds-constructor.
2016-02-19 13:01:23 -08:00
Michael Armbrust
9d3347e949 Improvements to running the benchmark
- Scripts for running the benchmark either while working on spark-sql-perf (bin/run) or while working on Spark (bin/spark-perf).  The latter uses Spark's sbt build to compile spark and downloads the most recent published version of spark-sql-perf.
 - Adds a `--compare` that can be used to compare the results with a baseline run

Author: Michael Armbrust <michael@databricks.com>

Closes #49 from marmbrus/runner.
2016-01-24 20:24:54 -08:00
Michael Armbrust
663ca7560e Main Class for running Benchmarks from the command line
This PR adds the ability to run performance test locally as a stand alone program that reports the results to the console:

```
$ bin/run --help
spark-sql-perf 0.2.0
Usage: spark-sql-perf [options]

  -b <value> | --benchmark <value>
        the name of the benchmark to run
  -f <value> | --filter <value>
        a filter on the name of the queries to run
  -i <value> | --iterations <value>
        the number of iterations to run
  --help
        prints this usage text

$ bin/run --benchmark DatasetPerformance
```

Author: Michael Armbrust <michael@databricks.com>

Closes #47 from marmbrus/MainClass.
2016-01-19 12:37:51 -08:00
Davies Liu
cec648ac0f try to run all TPCDS queries in benchmark (even can't be parsed) 2016-01-08 15:03:44 -08:00
Davies Liu
3105219fb0 Merge commit '11d1f9dd7237ea2a09ecfa61f09d7623ad52fd47' 2016-01-08 11:29:07 -08:00
Davies Liu
11d1f9dd72 update some queries:
" -> `
   fill some values
2016-01-08 11:27:50 -08:00
Michael Armbrust
9269f8f594 Capture BuildInfo when available
Author: Michael Armbrust <michael@databricks.com>

Closes #45 from marmbrus/buildInfo.
2015-12-23 11:03:06 -08:00
Michael Armbrust
f8aa93d968 Initial set of tests for Datasets
Author: Michael Armbrust <michael@databricks.com>

Closes #42 from marmbrus/dataset-tests.
2015-12-08 16:04:42 -08:00
Michael Armbrust
0aa2569a18 Write only one file per run
Author: Michael Armbrust <michael@databricks.com>

Closes #35 from marmbrus/oneResultFile.
2015-12-08 15:46:20 -08:00
Yin Huai
3af656defa Make ExecutionMode.HashResults handle null value
In Spark 1.6, if a value is null, `getLong` will throw an exception. Before 1.6, it will return 0. With this PR, we will check if the result is null. If it is null, null will be returned instead of 0.

Author: Yin Huai <yhuai@databricks.com>

Closes #41 from yhuai/fixSumHash.
2015-12-08 15:28:48 -08:00
Nong Li
43c2f23bb9 Fixes for Q34 and Q73 to return results deterministically.
Author: Nong Li <nong@databricks.com>

Closes #38 from nongli/tpcds.
2015-11-25 15:03:33 -08:00
Nong
70e0dbe656 Add official TPCDS 1.4 queries.
Author: Nong <nong@cloudera.com>

Closes #36 from nongli/tpcds.
2015-11-24 13:12:46 -08:00
Nong Li
1aa5bfc838 Add remaining tpcds tables.
Author: Nong Li <nongli@gmail.com>

Closes #34 from nongli/tpcds.
2015-11-19 13:50:00 -08:00
Nong Li
8d9e8ce9a3 Add another fact table and updates to load a single table at a time.
Author: Nong Li <nongli@gmail.com>

Closes #31 from nongli/more_tables.
2015-11-18 11:12:01 -08:00
Andrew Or
426ae30a2e Increase integration surface area with Spark perf
The changes in this PR are centered around making `Benchmark#runExperiment` accept things other than `Query`s. In particular, in spark-perf we don't always have a DataFrame or an RDD to work with and may want to run arbitrary code (e.g. ALS.train). This PR makes it possible to use the same code in `Benchmark` to do this.

I tested this on dogfood and it works well there.

Author: Andrew Or <andrew@databricks.com>

Closes #33 from andrewor14/spark-perf.
2015-11-18 10:50:46 -08:00
Andrew Or
172ae79f8d Introduce small integration point with Spark perf
This allows us to report Spark perf results in the same format as SQL benchmark results. marmbrus

Author: Andrew Or <andrew@databricks.com>

Closes #30 from andrewor14/spark-perf.
2015-11-16 17:46:53 -08:00
Michael Armbrust
344b31ed69 Update to Spark 1.6
Some internal interfaces changed, so we need to bump the Spark version to run tests on Spark 1.6.

Author: Michael Armbrust <michael@databricks.com>

Closes #29 from marmbrus/spark16.
2015-11-13 12:40:00 -08:00
Nong Li
dc48f2e49b Support generating the data as "text".
This previously failed since text only supports a single column. Having the option of
text output is useful to quickly see what the generator is doing.

Author: Nong Li <nongli@gmail.com>

Closes #27 from nongli/text.
2015-11-11 12:05:14 -08:00
bit1129
f63d40ce9f Add 2 queries
Author: bit1129 <bit1129@gmail.com>

Closes #22 from bit1129/master.
2015-09-16 10:10:20 -07:00
Michael Armbrust
40d085f1c7 Add dashboard notebook
Author: Michael Armbrust <michael@databricks.com>

Closes #21 from marmbrus/master.
2015-09-11 17:46:07 -07:00
Michael Armbrust
f03b3af719 Fail gracefully when invalid CPU logs are encountered
Author: Michael Armbrust <michael@databricks.com>

Closes #18 from marmbrus/parseCpuFail.
2015-09-09 22:02:23 -07:00
Michael Armbrust
e2dc749480 Add more tests for join performance
Author: Michael Armbrust <michael@databricks.com>

Closes #17 from marmbrus/joinPerf.
2015-09-09 21:56:47 -07:00
Michael Armbrust
08cb68ca20 Make it easier to write benchmarks in notebooks
Author: Michael Armbrust <michael@databricks.com>

Closes #19 from marmbrus/notebookTests.
2015-09-09 21:49:50 -07:00
Yin Huai
34f66a0a10 Add a option of filter rows with null partition column values. 2015-08-26 11:14:19 -07:00
Yin Huai
f4e20af107 fix typo 2015-08-25 23:31:50 -07:00
Yin Huai
06eb11f326 Fix the seed to 100 and use distribute by instead of order by. 2015-08-25 20:44:14 -07:00
Yin Huai
9936d49239 Add a option to orderBy partition columns. 2015-08-25 20:44:14 -07:00
Yin Huai
58188c6711 Allow users to use double instead of decimal for generated tables. 2015-08-25 20:44:14 -07:00
Yin Huai
77fbe22b7b address comments. 2015-08-25 20:44:13 -07:00
Yin Huai
97093a45cd Update readme and register temp tables. 2015-08-25 20:44:13 -07:00
Yin Huai
edb4daba80 Bug fix. 2015-08-25 20:44:13 -07:00
Yin Huai
544adce70f Add methods to genData. 2015-08-25 20:44:13 -07:00
Michael Armbrust
32215e05ee Block completion of cpu collection 2015-08-24 16:13:26 -07:00
Michael Armbrust
00aa49e8e4 Add support for CPU Profiling. 2015-08-20 16:46:12 -07:00
Yin Huai
249157f6a6 Fix typo. 2015-08-17 12:56:35 -07:00
Yin Huai
d5c3104ec6 address comments. 2015-08-14 11:39:06 -07:00
Yin Huai
51546868f4 You can specific perf result location. 2015-08-13 18:43:50 -07:00
Yin Huai
11bfdc7c5a Add an ExecutionMode to check query results. 2015-08-13 18:43:49 -07:00
Michael Armbrust
ed8ddfedcd yins comments 2015-08-13 17:54:00 -07:00
Michael Armbrust
4101a1e968 Fixes to breakdown calculation and table creation. 2015-08-13 15:47:01 -07:00
Michael Armbrust
a239da90a2 more cleanup, update readme 2015-08-11 15:51:34 -07:00
Michael Armbrust
51b9dcb5b5 Merge remote-tracking branch 'origin/master' into refactor
Conflicts:
	src/main/scala/com/databricks/spark/sql/perf/bigdata/Queries.scala
	src/main/scala/com/databricks/spark/sql/perf/query.scala
	src/main/scala/com/databricks/spark/sql/perf/runBenchmarks.scala
	src/main/scala/com/databricks/spark/sql/perf/table.scala
	src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/ImpalaKitQueries.scala
	src/main/scala/com/databricks/spark/sql/perf/tpcds/queries/SimpleQueries.scala
2015-08-07 15:31:32 -07:00
Jean-Yves Stephan
9421522820 Closing bracket 2015-07-22 15:03:43 -07:00
Yin Huai
a50fedd5bc Merge pull request #2 from jystephan/master
Allow saving benchmark queries results as parquet files
2015-07-22 13:40:39 -07:00
Jean-Yves Stephan
653d82134d No collect before saveAsParquet 2015-07-22 13:30:40 -07:00
Michael Armbrust
f00ad77985 with data generation 2015-07-22 00:29:58 -07:00
Jean-Yves Stephan
a4a53b8a73 Took Aaron's comments 2015-07-21 20:05:53 -07:00
Jean-Yves Stephan
d866cce1a1 Format 2015-07-21 13:27:50 -07:00