Commit Graph

168 Commits

Author SHA1 Message Date
Kevin
fdcde7595c Update README (#107)
Little update for the README
2017-07-13 10:45:24 +02:00
Juliusz Sompolski
6488d74d23 tpcds_2_4: Add alias names to subqueries in FROM clause.
## What changes were proposed in this pull request?

Since SPARK-20690 and SPARK-20916 Spark requires all subqueries in FROM clause to have an alias name.

## How was this patch tested?

Tested on SF1.
2017-06-29 16:04:08 +02:00
Juliusz Sompolski
bff6b34f62 Tweaks and improvements (#106)
Data generation:
* Add an option to change Dates to Strings, and specify it in Tables object creator.
* Add discovering partitions to createExternalTables
* Add analyzeTables function that gathers statistics.

Benchmark execution:
* Perform collect() on Dataframe, so that it is recorded by SQL SparkUI.
2017-06-13 11:42:14 +02:00
Juliusz Sompolski
75f3876e59 Merge pull request #103 from juliuszsompolski/fixtypes
Correct types of keys in schema
2017-05-26 11:53:19 +02:00
Juliusz Sompolski
2ddd521ab5 ok, make it long only where really needed. 2017-05-26 10:36:40 +02:00
Juliusz Sompolski
1bca964a3d Correct types of keys 2017-05-25 17:12:47 +02:00
Volodymyr Lyubinets
beec62844d Merge pull request #101 from vlyubin/master
Add tpcds 2.4 queries
2017-05-16 10:35:35 +02:00
vlyubin
c0bd21c2ec Add ss_max 2017-05-16 10:29:00 +02:00
vlyubin
e5dc6f338f Updated queries 23 2017-05-15 17:30:20 +02:00
vlyubin
e8f85b0b0e Moved queries into a separate folder 2017-05-15 14:22:37 +02:00
vlyubin
96bf10bffc Add tpcds 2.4 queries 2017-05-12 11:54:32 +02:00
Eric Liang
c12b14b013 Merge pull request #98 from databricks/parallel-runs
Add option to avoid cleaning after each run, to enable parallel runs
2017-03-15 13:50:41 -07:00
Eric Liang
64728c7cff Add option to avoid cleaning after each run, to enable parallel runs 2017-03-14 19:45:27 -07:00
Timothy Hunter
53091a1935 Removes labels from tree data generation (#82)
* changes

* removes labels

* reset scala version

* adding metadata

* bumping spark release
2016-12-13 16:47:31 -08:00
srinathshankar
685c50d9dc Cross build with Scala 2.11 (#91)
* Cross build with Scala 2.11

* Update snapshot version
2016-10-03 17:01:17 -07:00
srinathshankar
0eaa4b1d57 [SC-4409] Correct query 41 in TPCDS kit (#90) 2016-09-30 18:02:39 -07:00
Josh Rosen
c2224f37e5 Depend on non-snapshot Spark now that 2.0.0 is released
Now that Spark 2.0.0 is released, we need to update the build to use a released version instead of the snapshot (which is no longer available).

Fixes #84.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #85 from JoshRosen/fix-spark-dep.
2016-08-17 17:53:30 -07:00
Timothy Hunter
948c8369e7 Fixes issues with scala 2.11
Updates the usual scala-logging issues to make the source code cross-compilable between scala 2.10 and scala 2.11.

Tests:
A scala 2.11 version of the code has been run against the official Spark 2.0.0 RC4 binary release (Scala 2.11)
A scala 2.10 version has been run against the official Spark 1.6.2 release

Author: Timothy Hunter <timhunter@databricks.com>

Closes #81 from thunterdb/1607-scala211.
2016-07-19 11:19:52 -07:00
Timothy Hunter
8830bffd46 Merge pull request #79 from jkbradley/tree-test-fix
Fixed tree, forest, GBT tests by adding metadata to DataFrames
2016-07-11 10:42:19 -07:00
Joseph K. Bradley
51469a34d6 Fixed tree, forest, GBT tests by adding metadata to DataFrames 2016-07-11 10:33:19 -07:00
Timothy Hunter
1fcc366cec Merge pull request #78 from thunterdb/1607-fixes
Adding parameters in case of failures
2016-07-06 11:34:05 -07:00
Timothy Hunter
c7d42d3626 adding parameters 2016-07-06 11:23:07 -07:00
Timothy Hunter
2672bcd5b7 ALS algorithm for spark-sql-perf
This has been tested locally with a small amount of data.

I have not bothered to reimplement a more robust version of the ALS synthetic data generation, so it will still require some manual parameter tweaking as before.

Author: Timothy Hunter <timhunter@databricks.com>

Closes #76 from thunterdb/1607-als.
2016-07-05 15:54:08 -07:00
Timothy Hunter
93c0407bbe Merge pull request #77 from thunterdb/1607-linear
Linear regression
2016-07-05 15:41:35 -07:00
Timothy Hunter
40e97ca3c0 comment 2016-07-05 15:01:50 -07:00
Timothy Hunter
ce7e20ae6d set the solver 2016-07-05 13:46:19 -07:00
Timothy Hunter
def20479a1 linear regression 2016-07-05 13:42:56 -07:00
Timothy Hunter
979ebd5d0f Merge pull request #75 from jkbradley/kmeans
Added kmeans test
2016-07-05 10:14:11 -07:00
Joseph K. Bradley
9d11a601c3 added kmeans test 2016-07-01 18:00:49 -07:00
jkbradley
3d3443791c Merge pull request #74 from jkbradley/dt-tests
Decision tree, random forest, GBT classification perf tests
2016-07-01 17:40:16 -07:00
Joseph K. Bradley
495e2716c4 updated per code review. works in local tests 2016-07-01 17:39:28 -07:00
jkbradley
c2f0a35db4 Merge pull request #1 from thunterdb/1606-trees
adding experiments to the yaml file
2016-07-01 11:46:41 -07:00
Timothy Hunter
813bd8ad59 adding more experiments 2016-07-01 10:34:42 -07:00
Joseph K. Bradley
c15d083fe7 cleanups 2016-06-30 10:45:15 -07:00
Joseph K. Bradley
ecf2eedbb8 Added decision tree, forest, GBT tests 2016-06-30 10:38:24 -07:00
Joseph K. Bradley
33a1e55366 partly done adding decision tree tests 2016-06-29 17:06:27 -07:00
jkbradley
26a685b97e Merge pull request #72 from thunterdb/1606-glms
Generalized linear models performance tests
2016-06-28 14:37:45 -07:00
Timothy Hunter
353dc0c873 comment 2016-06-28 12:00:04 -07:00
Timothy Hunter
5c1990e4ff no normalization 2016-06-27 13:32:38 -07:00
Timothy Hunter
87dc42a466 work on glm, and some notbooks 2016-06-23 12:13:11 -07:00
Timothy Hunter
1388722b81 Initial commit for adding MLlib reporting in spark-sql-perf
This PR adds basic MLlib infrastructure to run some benchmarks against ML pipelines.

There are 2 ways to describe and run ML pipelines:
 - programatically, in scala (see MLBenchmarks.scala)
 - using a simple YAML file (see mllib-small.yaml for an example)
The YAML approach is preferred because it generates programmatically the cartesian product of all the experiments to run and validates the types of the objects in the yaml file.

In both cases, all the ML experiments are standard benchmarks.

This PR also moves some code in `Benchmark.scala` : the current code generates path-dependent structural signatures and confuses intellij.

It does not include tests, but some small benchmarks can be run locally against a spark 2 installation:

```
$SPARK_HOME/bin/spark-shell --jars $PWD/target/scala-2.10/spark-sql-perf-assembly-0.4.9-SNAPSHOT.jar
```
and then:

```scala
com.databricks.spark.sql.perf.mllib.MLLib.run(yamlFile="src/main/scala/configs/mllib-small.yaml")
```

Author: Timothy Hunter <timhunter@databricks.com>

Closes #69 from thunterdb/1605-mllib2.
2016-06-22 16:59:49 -07:00
Davies Liu
ea342c6165 fix checking results and bump to 0.4.9 2016-06-17 12:53:12 -07:00
Eric Liang
0d1e9043f1 [SC-3547] Fix various typos in queries and bump version to 0.48 2016-06-14 12:27:24 -07:00
Davies Liu
cc50104194 bump to 0.4.7 2016-05-24 10:41:21 -07:00
Davies Liu
c087b68a5c make number of partitions configurable 2016-05-24 10:40:51 -07:00
Sameer Agarwal
375e116b1a bump to 0.4.6 2016-05-23 14:08:02 -07:00
Sameer Agarwal
1840fd9f21 Fix/rewrite some TPC-DS 1.4 queries
This patch ports upstream query modifications from apache/spark#13188
2016-05-23 14:02:47 -07:00
Sameer Agarwal
0355fc4ee7 Fix build and switch to jdk8
* Fix Build

* more memory

* switch to jdk8

* old memory settings
2016-05-23 12:54:07 -07:00
Sameer Agarwal
10b90c0d2b Fix q8 in ImpalaKit 2016-04-29 14:07:31 -07:00
Davies Liu
b8a90621cf bump to 0.4.5 2016-03-30 11:57:35 -07:00