Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Lesson 4

In this lesson, I peel back the layers on Spark and show you some of its internals.

We start with primer on distributed systems and the theory that underlays them.

From there, I walk through Spark's execution context and how Spark actually executes your code in a distributed fashion on a cluster.

By understanding some of the intricacies of how Spark coordinates program execution, you learn how to improve Spark performance for your applications.

And finally we finish by setting up our own cluster on Amazon Web Services enabling us to really leverage all of the performance gains of a distributed system.

Objectives

  • Understand the basics of distributed systems and how they help us scale data storage as well as computation
  • See how Spark and its execution context efficient runs code in a distributed manner
  • Understand the RDD abstraction and the interface it gives us to manipulate distributed datasets
  • Deploy your own Spark cluster on Amazon Web Services
  • Monitor your Spark jobs to tune them and optimize their execution
  • See how Spark can leverage the memory of a cluster to cache data that is used repeated

Examples

References

4.1: Introduction to Distributed Systems

4.2: Building Systems that Scale

4.3: The Spark Execution Context

4.4: RDD Deep Dive: Dependencies and Lineage

4.5: A Day in the Life of a Spark Application

4.6: How Code Runs: Stages, Tasks, and the Shuffle

4.7: Spark Deployment: Local and Cluster Modes

4.8: Setting Up Your Own Cluster

4.9: Spark Performance: Monitoring and Optimization

4.10: Tuning Your Spark Application

4.11: Making Spark Fly: Parallelism

4.12: Making Spark Fly: Caching