spark-sql-perf

Author	SHA1	Message	Date
Timothy Hunter	ce7e20ae6d	set the solver	2016-07-05 13:46:19 -07:00
Timothy Hunter	def20479a1	linear regression	2016-07-05 13:42:56 -07:00
Joseph K. Bradley	9d11a601c3	added kmeans test	2016-07-01 18:00:49 -07:00
Joseph K. Bradley	495e2716c4	updated per code review. works in local tests	2016-07-01 17:39:28 -07:00
Timothy Hunter	813bd8ad59	adding more experiments	2016-07-01 10:34:42 -07:00
Joseph K. Bradley	c15d083fe7	cleanups	2016-06-30 10:45:15 -07:00
Joseph K. Bradley	ecf2eedbb8	Added decision tree, forest, GBT tests	2016-06-30 10:38:24 -07:00
Joseph K. Bradley	33a1e55366	partly done adding decision tree tests	2016-06-29 17:06:27 -07:00
Timothy Hunter	353dc0c873	comment	2016-06-28 12:00:04 -07:00
Timothy Hunter	5c1990e4ff	no normalization	2016-06-27 13:32:38 -07:00
Timothy Hunter	87dc42a466	work on glm, and some notbooks	2016-06-23 12:13:11 -07:00
Timothy Hunter	1388722b81	Initial commit for adding MLlib reporting in spark-sql-perf This PR adds basic MLlib infrastructure to run some benchmarks against ML pipelines. There are 2 ways to describe and run ML pipelines: - programatically, in scala (see MLBenchmarks.scala) - using a simple YAML file (see mllib-small.yaml for an example) The YAML approach is preferred because it generates programmatically the cartesian product of all the experiments to run and validates the types of the objects in the yaml file. In both cases, all the ML experiments are standard benchmarks. This PR also moves some code in `Benchmark.scala` : the current code generates path-dependent structural signatures and confuses intellij. It does not include tests, but some small benchmarks can be run locally against a spark 2 installation: ``` $SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/spark-sql-perf-assembly-0.4.9-SNAPSHOT.jar ``` and then: ```scala com.databricks.spark.sql.perf.mllib.MLLib.run(yamlFile="src/main/scala/configs/mllib-small.yaml") ``` Author: Timothy Hunter <timhunter@databricks.com> Closes #69 from thunterdb/1605-mllib2.	2016-06-22 16:59:49 -07:00
Davies Liu	ea342c6165	fix checking results and bump to 0.4.9	2016-06-17 12:53:12 -07:00
Eric Liang	0d1e9043f1	[SC-3547] Fix various typos in queries and bump version to 0.48	2016-06-14 12:27:24 -07:00
Davies Liu	c087b68a5c	make number of partitions configurable	2016-05-24 10:40:51 -07:00
Sameer Agarwal	1840fd9f21	Fix/rewrite some TPC-DS 1.4 queries This patch ports upstream query modifications from apache/spark#13188	2016-05-23 14:02:47 -07:00
Sameer Agarwal	0355fc4ee7	Fix build and switch to jdk8 * Fix Build * more memory * switch to jdk8 * old memory settings	2016-05-23 12:54:07 -07:00
Sameer Agarwal	10b90c0d2b	Fix q8 in ImpalaKit	2016-04-29 14:07:31 -07:00
Davies Liu	656f1bdb17	fix writing results	2016-03-30 11:56:55 -07:00
Michael Armbrust	5912673b0d	Fix JoinPerformance compilation Author: Michael Armbrust <michael@databricks.com> Closes #55 from marmbrus/fixJoinPerf.	2016-03-25 11:46:36 -07:00
Josh Rosen	42a415e8d4	Extract Query class from Benchmark into its own top-level class and make SparkContext field transient This patch extracts `Query` into its own top-level class and makes its `sparkContext` field transient in order to fix `NotSerializableException`s. Author: Josh Rosen <rosenville@gmail.com> Closes #53 from JoshRosen/make-query-into-top-level-class.	2016-02-22 18:23:06 -08:00
Josh Rosen	7e38b77c50	Update to compile against Spark 2.0.0-SNAPSHOT and bump version to 0.4.0-SNAPSHOT Author: Josh Rosen <rosenville@gmail.com> Closes #51 from JoshRosen/spark-2.0.0.	2016-02-19 13:02:29 -08:00
Josh Rosen	685ed9e488	Add TPCDS(sqlContext) constructor for backwards-compatibility This patch adds additional constructors to `TPCDS` to maintain backwards-compatibility with code which calls `new TPCDS(anExistingSqlContext)`. This constructor was removed in #47. The motivation for backwards-compatibility here is to simplify the gradual roll-out of an updated spark-sql-perf library to some existing jobs which share the same notebook. Author: Josh Rosen <rosenville@gmail.com> Closes #52 from JoshRosen/backwards-compatible-tpcds-constructor.	2016-02-19 13:01:23 -08:00
Michael Armbrust	9d3347e949	Improvements to running the benchmark - Scripts for running the benchmark either while working on spark-sql-perf (bin/run) or while working on Spark (bin/spark-perf). The latter uses Spark's sbt build to compile spark and downloads the most recent published version of spark-sql-perf. - Adds a `--compare` that can be used to compare the results with a baseline run Author: Michael Armbrust <michael@databricks.com> Closes #49 from marmbrus/runner.	2016-01-24 20:24:54 -08:00
Michael Armbrust	663ca7560e	Main Class for running Benchmarks from the command line This PR adds the ability to run performance test locally as a stand alone program that reports the results to the console: ``` $ bin/run --help spark-sql-perf 0.2.0 Usage: spark-sql-perf [options] -b <value> \| --benchmark <value> the name of the benchmark to run -f <value> \| --filter <value> a filter on the name of the queries to run -i <value> \| --iterations <value> the number of iterations to run --help prints this usage text $ bin/run --benchmark DatasetPerformance ``` Author: Michael Armbrust <michael@databricks.com> Closes #47 from marmbrus/MainClass.	2016-01-19 12:37:51 -08:00
Davies Liu	cec648ac0f	try to run all TPCDS queries in benchmark (even can't be parsed)	2016-01-08 15:03:44 -08:00
Davies Liu	3105219fb0	Merge commit '11d1f9dd7237ea2a09ecfa61f09d7623ad52fd47'	2016-01-08 11:29:07 -08:00
Davies Liu	11d1f9dd72	update some queries: " -> ` fill some values	2016-01-08 11:27:50 -08:00
Michael Armbrust	9269f8f594	Capture BuildInfo when available Author: Michael Armbrust <michael@databricks.com> Closes #45 from marmbrus/buildInfo.	2015-12-23 11:03:06 -08:00
Michael Armbrust	f8aa93d968	Initial set of tests for Datasets Author: Michael Armbrust <michael@databricks.com> Closes #42 from marmbrus/dataset-tests.	2015-12-08 16:04:42 -08:00
Michael Armbrust	0aa2569a18	Write only one file per run Author: Michael Armbrust <michael@databricks.com> Closes #35 from marmbrus/oneResultFile.	2015-12-08 15:46:20 -08:00
Yin Huai	3af656defa	Make ExecutionMode.HashResults handle null value In Spark 1.6, if a value is null, `getLong` will throw an exception. Before 1.6, it will return 0. With this PR, we will check if the result is null. If it is null, null will be returned instead of 0. Author: Yin Huai <yhuai@databricks.com> Closes #41 from yhuai/fixSumHash.	2015-12-08 15:28:48 -08:00
Nong Li	43c2f23bb9	Fixes for Q34 and Q73 to return results deterministically. Author: Nong Li <nong@databricks.com> Closes #38 from nongli/tpcds.	2015-11-25 15:03:33 -08:00
Nong	70e0dbe656	Add official TPCDS 1.4 queries. Author: Nong <nong@cloudera.com> Closes #36 from nongli/tpcds.	2015-11-24 13:12:46 -08:00
Nong Li	1aa5bfc838	Add remaining tpcds tables. Author: Nong Li <nongli@gmail.com> Closes #34 from nongli/tpcds.	2015-11-19 13:50:00 -08:00
Nong Li	8d9e8ce9a3	Add another fact table and updates to load a single table at a time. Author: Nong Li <nongli@gmail.com> Closes #31 from nongli/more_tables.	2015-11-18 11:12:01 -08:00
Andrew Or	426ae30a2e	Increase integration surface area with Spark perf The changes in this PR are centered around making `Benchmark#runExperiment` accept things other than `Query`s. In particular, in spark-perf we don't always have a DataFrame or an RDD to work with and may want to run arbitrary code (e.g. ALS.train). This PR makes it possible to use the same code in `Benchmark` to do this. I tested this on dogfood and it works well there. Author: Andrew Or <andrew@databricks.com> Closes #33 from andrewor14/spark-perf.	2015-11-18 10:50:46 -08:00
Andrew Or	172ae79f8d	Introduce small integration point with Spark perf This allows us to report Spark perf results in the same format as SQL benchmark results. marmbrus Author: Andrew Or <andrew@databricks.com> Closes #30 from andrewor14/spark-perf.	2015-11-16 17:46:53 -08:00
Michael Armbrust	344b31ed69	Update to Spark 1.6 Some internal interfaces changed, so we need to bump the Spark version to run tests on Spark 1.6. Author: Michael Armbrust <michael@databricks.com> Closes #29 from marmbrus/spark16.	2015-11-13 12:40:00 -08:00
Nong Li	dc48f2e49b	Support generating the data as "text". This previously failed since text only supports a single column. Having the option of text output is useful to quickly see what the generator is doing. Author: Nong Li <nongli@gmail.com> Closes #27 from nongli/text.	2015-11-11 12:05:14 -08:00
bit1129	f63d40ce9f	Add 2 queries Author: bit1129 <bit1129@gmail.com> Closes #22 from bit1129/master.	2015-09-16 10:10:20 -07:00
Michael Armbrust	40d085f1c7	Add dashboard notebook Author: Michael Armbrust <michael@databricks.com> Closes #21 from marmbrus/master.	2015-09-11 17:46:07 -07:00
Michael Armbrust	f03b3af719	Fail gracefully when invalid CPU logs are encountered Author: Michael Armbrust <michael@databricks.com> Closes #18 from marmbrus/parseCpuFail.	2015-09-09 22:02:23 -07:00
Michael Armbrust	e2dc749480	Add more tests for join performance Author: Michael Armbrust <michael@databricks.com> Closes #17 from marmbrus/joinPerf.	2015-09-09 21:56:47 -07:00
Michael Armbrust	08cb68ca20	Make it easier to write benchmarks in notebooks Author: Michael Armbrust <michael@databricks.com> Closes #19 from marmbrus/notebookTests.	2015-09-09 21:49:50 -07:00
Yin Huai	34f66a0a10	Add a option of filter rows with null partition column values.	2015-08-26 11:14:19 -07:00
Yin Huai	f4e20af107	fix typo	2015-08-25 23:31:50 -07:00
Yin Huai	06eb11f326	Fix the seed to 100 and use distribute by instead of order by.	2015-08-25 20:44:14 -07:00
Yin Huai	9936d49239	Add a option to orderBy partition columns.	2015-08-25 20:44:14 -07:00
Yin Huai	58188c6711	Allow users to use double instead of decimal for generated tables.	2015-08-25 20:44:14 -07:00

1 2

81 Commits