Professional Documents
Culture Documents
Shiao-An Yuan
@sayuan
2016-08-11
Spark Overview
Cluster Manager (aka Master)
Worker (aka Slave)
Driver
Executor
http://spark.apache.org/docs/latest/cluster-overview.html
RDD (Resilient Distributed Dataset)
A fault-tolerant collection of elements that can be operated
on in parallel
Word Count
val sc: SparkContext = ...
val result = sc.textFile(file) // RDD[String]
.flatMap(_.split(" ")) // RDD[String]
.map(_ -> 1) // RDD[(String, Int)]
.groupByKey() // RDD[(String, Iterable[Int])]
.map(x => (x._1, x._2.sum)) // RDD[(String, Int)]
.collect() // Array[(String, Int])
Lazy, Transformation, Action, Job
Kryo serialization
Much faster
Registration needed
http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
Common Failures
Large shuffle blocks
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
Increase partition count
MetadataFetchFailedException, FetchFailedException
Increase partition count
Increase `spark.executor.memory`
java.lang.OutOfMemoryError: GC overhead over limit exceeded
May caused by shuffle spill
java.lang.OutOfMemoryError: Java heap space
Driver
Increase `spark.driver.memory`
collect()
take()
saveAsTextFile()
Executor
Increase `spark.executor.memory`
More nodes
java.io.IOException: No space left on device
SPARK_WORKER_DIR
SPARK_LOCAL_DIRS, spark.local.dir
Shuffle files
Only delete after the RDD object has been GC
Other Tips
Event logs
spark.eventLog.enabled=true
${SPARK_HOME}/sbin/start-history-server.sh
Partitions
Rule of thumb: ~128 MB per partition
If #partitions <= 2000, but close, bump to just > 2000
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
Executors, Cores, Memory!?
32 nodes
16 cores each
64 GB of RAM each
If you have an application need 32 cores, what is the
correct setting?
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
Why Spark Debugging / Tuning is Hard?
Distributed
Lazy
Hard to do benchmark
Spark is sensitive
Conclusion
When in doubt, repartition!
Avoid shuffle if you can
Choose a reasonable partition count
Premature optimization is the root of all evil -- Donald Knuth
Reference
Tuning and Debugging in Apache Spark
Top 5 Mistakes to Avoid When Writing Apache Spark
Applications
How-to: Tune Your Apache Spark Jobs (Part 1)
How-to: Tune Your Apache Spark Jobs (Part 2)