Spark run faster and faster

Cluster Optimization
Parameters Optimization
Code Optimization

Cluster Optimization

Locality Level

Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible
NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes
NO_PREF data is accessed equally quickly from anywhere and has no locality preference
RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch
ANY data is elsewhere on the network and not in the same rack

Performance: PROCESS_LOCAL > NODE_LOCAL > NO_PREF > RACK_LOCAL

Locality settting

spark.locality.wait.process
spark.locality.wait.node
spark.locality.wait.rack

Data Format

text
orc
parquet
avro

format setting

spark.sql.hive.convertCTAS
spark.sql.sources.default

parallelising

spark.sql.shuffle.partitions : default is 200

computing

—executor-memory : default is 1G
—executor-cores : default is 1 if large memory cause resource throtle in cluster, if small memory cause task termination if more cores cause IO issue, if less cores slow dow computing

memory

spark.executor.overhead.memory

table join

spark.sql.autoBroadcastJoinThreshold : default 10M

predicate push down in Spark SQL queries

spark.sql.parquet.filterPushdown : default True
spark.sql.orc.filterPushdown=true : default False

reuse RDD

    df.persist(pyspark.StorageLevel.MEMORY_ONLY)

Spark operators

shuffle operators
- avoid using reduceByKey, join, distinct, repartition etc
- Broadcast small dataset
High performance operator
- reduceByKey > groupByKey (reduceByKey works at map side)
- mapPartitions > map (reduce function calls)
- treeReduce > reduce (treeReduce works at executor not driver)
  - treeReduce & reduce return some result to driver
  - treeReduce does more work on the executors while reduce bring everything back to the driver.
- foreachPartitions > foreach (reduce function calls)
- filter -> coalesce (reduce number of partitions and reduce tasks)
- repartitionAndSortWithinPartitions > repartition & sort
- broadcast (100M)

shuffle

spark.shuffle.sort.bypassMergeThreshold
spark.shuffle.io.retryWait
spark.shuffle.io.maxRetries

TBC