spark-sql-perf

Author	SHA1	Message	Date
Yuming Wang	ca4ccea3dd	Add a convenient class to generate TPC-DS data (#196 ) How to use it: ``` build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData -d /root/tmp/tpcds-kit/tools -s 5 -l /root/tmp/tpcds5g -f parquet" ``` ``` [root@spark-3267648 spark-sql-perf]# build/sbt "test:runMain com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help" [info] Running com.databricks.spark.sql.perf.tpcds.GenTPCDSData --help [info] Usage: Gen-TPC-DS-data [options] [info] [info] -m, --master <value> the Spark master to use, default to local[*] [info] -d, --dsdgenDir <value> location of dsdgen [info] -s, --scaleFactor <value> [info] scaleFactor defines the size of the dataset to generate (in GB) [info] -l, --location <value> root directory of location to create data in [info] -f, --format <value> valid spark format, Parquet, ORC ... [info] -i, --useDoubleForDecimal <value> [info] true to replace DecimalType with DoubleType [info] -e, --useStringForDate <value> [info] true to replace DateType with StringType [info] -o, --overwrite <value> overwrite the data that is already there [info] -p, --partitionTables <value> [info] create the partitioned fact tables [info] -c, --clusterByPartitionColumns <value> [info] shuffle to get partitions coalesced into single files [info] -v, --filterOutNullPartitionValues <value> [info] true to filter out the partition with NULL key value [info] -t, --tableFilter <value> [info] "" means generate all tables [info] -n, --numPartitions <value> [info] how many dsdgen partitions to run - number of input tasks. [info] --help prints this usage text ```	2021-03-30 21:19:36 +09:00
Luca Canali	e1e1365a87	Updates for Spark 3.0 and Scala 2.12 compatibility (#176 ) * Refactor deprecated `getOrCreate()` in spark 3 * Compile with scala 2.12 * Updated usage related to obsolete/deprecated features * remove use of scala-logging replaced by using slf4j directly	2019-01-29 09:58:52 +01:00
Nico Poggi	3c1c9e9070	Rebase for PR 87: Add -m for custom master, use SBT_HOME if set (#169 ) * Add -m for custom master * Add ability to use own sbt jar, update readme to include -m option * Add stddev percentage showing	2018-09-17 15:18:16 +02:00
Phil	d9a41a1204	Fix 3 local benchmark classes (#165 ) * Fix AggregationPerformance * Fix JoinPerformance * Fix average computation for datasets * Add explanation and usage for local benchmarks	2018-09-17 14:08:56 +02:00
Nico Poggi	6136ecea6e	TPC-H datagenerator and instructions (#136 ) * Adding basic partitioning to TPCH tables following VectorH paper as baseline * Multi datagen (TPC- H and DS) and multi scale factor notebook/script. Generates all the selected scale factors and benchmarks in one run. * TPCH runner notebook or script for spark-shell * Adding basic TPCH documentation	2018-09-10 23:18:33 +02:00
Juliusz Sompolski	91604a3ab0	Update README to specify that TPCDS kit needs to be installed on all nodes.	2018-02-27 12:06:12 +01:00
Juliusz Sompolski	31f34beee5	Update README to do sql("use database") (#123 )	2017-11-07 20:38:26 +01:00
Juliusz Sompolski	5ebb9cfb12	add some more comments	2017-09-12 16:51:26 +02:00
Juliusz Sompolski	c78f2b3a9b	update readme	2017-09-12 16:40:23 +02:00
Siddharth Murching	d0de5ae8aa	Update tests to run with Spark 2.2, add NaiveBayes & Bucketizer ML tests (#110 ) * Made small updates in Benchmark.scala and Query.scala for Spark 2.2 * Added tests for NaiveBayesModel and Bucketizer * Changed BenchmarkAlgorithm.getEstimator() -> BenchmarkAlgorithm.getPipelineStage() to allow for the benchmarking of Estimators and Transformers instead of just Estimators Commits: * Changes made so that spark-sql-perf compiles with Spark 2.2 * Updates for running ML tests from the command line + added Naive Bayes test * Add Bucketizer test as example of Featurizer test; change getEstimator() to getPipelineStage() in BenchmarkAlgorithm to allow for testing of transformers in addition to estimators. * Add comment for main method in MLlib.scala * Rename MLTransformerBenchmarkable --> MLPipelineStageBenchmarkable, fix issue with NaiveBayes param * Add UnaryTransformer trait for common data/methods to be shared across all objects testing featurizers that operate on a single column (StringIndexer, OneHotEncoder, Bucketizer, HashingTF, etc) * Respond to review comments: * bin/run-ml: Add newline at EOF * Query.scala: organized imports * MLlib.scala: organized imports, fixed SparkContext initialization * NaiveBayes.scala: removed unused temp val, improved probability calculation in trueModel() * Bucketizer.scala: use DataGenerator.generateContinuousFeatures instead of generating data on the driver * Fix bug in Bucketizer.scala * Precompute log of sum of unnormalized probabilities in NaiveBayes.scala, add NaiveBayes and Bucketizer tests to mllib-small.yaml * Update Query.scala to use p() to access SparkPlans under a given SparkPlan * Update README to indicate that spark-sql-perf only works with Spark 2.2+ after this PR	2017-08-21 15:07:46 -07:00
Kevin	fdcde7595c	Update README (#107 ) Little update for the README	2017-07-13 10:45:24 +02:00
Michael Armbrust	663ca7560e	Main Class for running Benchmarks from the command line This PR adds the ability to run performance test locally as a stand alone program that reports the results to the console: ``` $ bin/run --help spark-sql-perf 0.2.0 Usage: spark-sql-perf [options] -b <value> \| --benchmark <value> the name of the benchmark to run -f <value> \| --filter <value> a filter on the name of the queries to run -i <value> \| --iterations <value> the number of iterations to run --help prints this usage text $ bin/run --benchmark DatasetPerformance ``` Author: Michael Armbrust <michael@databricks.com> Closes #47 from marmbrus/MainClass.	2016-01-19 12:37:51 -08:00
Davies Liu	cec648ac0f	try to run all TPCDS queries in benchmark (even can't be parsed)	2016-01-08 15:03:44 -08:00
Nong Li	1aa5bfc838	Add remaining tpcds tables. Author: Nong Li <nongli@gmail.com> Closes #34 from nongli/tpcds.	2015-11-19 13:50:00 -08:00
Cheng Lian	50808c436b	Fixes typos in README.md Author: Cheng Lian <lian@databricks.com> Closes #25 from liancheng/readme-fix.	2015-11-11 12:05:44 -08:00
Michael Armbrust	ddeead18ce	Add compilation testing with travis There are no tests yet... but this at least tests compilation. Author: Michael Armbrust <michael@databricks.com> Closes #15 from marmbrus/travis.	2015-09-09 21:36:26 -07:00
Yin Huai	34f66a0a10	Add a option of filter rows with null partition column values.	2015-08-26 11:14:19 -07:00
Yin Huai	06eb11f326	Fix the seed to 100 and use distribute by instead of order by.	2015-08-25 20:44:14 -07:00
Yin Huai	9936d49239	Add a option to orderBy partition columns.	2015-08-25 20:44:14 -07:00
Yin Huai	58188c6711	Allow users to use double instead of decimal for generated tables.	2015-08-25 20:44:14 -07:00
Yin Huai	88aadb45a4	Update README.	2015-08-25 20:44:14 -07:00
Yin Huai	97093a45cd	Update readme and register temp tables.	2015-08-25 20:44:13 -07:00
Michael Armbrust	a239da90a2	more cleanup, update readme	2015-08-11 15:51:34 -07:00
Yin Huai	fb9939b136	includeBreakdown is a parameter of runExperiment.	2015-04-20 10:03:41 -07:00
Yin Huai	6c5657b609	Refactoring and doc.	2015-04-16 18:10:57 -07:00
Yin Huai	930751810e	Initial port.	2015-04-15 20:03:14 -07:00
Yin Huai	e81669ab3b	Initial commit.	2015-04-15 20:02:32 -07:00

27 Commits