Name		Name	Last commit message	Last commit date
parent directory ..
deploy		deploy
slides		slides
readme.md		readme.md

readme.md

Lesson 4

In this lesson, I peel back the layers on Spark and show you some of its internals.

We start with primer on distributed systems and the theory that underlays them.

From there, I walk through Spark's execution context and how Spark actually executes your code in a distributed fashion on a cluster.

By understanding some of the intricacies of how Spark coordinates program execution, you learn how to improve Spark performance for your applications.

And finally we finish by setting up our own cluster on Amazon Web Services enabling us to really leverage all of the performance gains of a distributed system.

Objectives

Understand the basics of distributed systems and how they help us scale data storage as well as computation
See how Spark and its execution context efficient runs code in a distributed manner
Understand the RDD abstraction and the interface it gives us to manipulate distributed datasets
Deploy your own Spark cluster on Amazon Web Services
Monitor your Spark jobs to tune them and optimize their execution
See how Spark can leverage the memory of a cluster to cache data that is used repeated

Examples

4.5 - 4.6: spark-internals.ipynb
4.9 - 4.12: performance-tuning.ipynb

References

4.1: Introduction to Distributed Systems

4.2: Building Systems that Scale

4.3: The Spark Execution Context

4.7: Spark Deployment: Local and Cluster Modes

4.8: Setting Up Your Own Cluster

4.9: Spark Performance: Monitoring and Optimization

4.10: Tuning Your Spark Application

4.11: Making Spark Fly: Parallelism

Amdahl's Law

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lesson4

Lesson4

readme.md

Lesson 4

Objectives

Examples

References

4.1: Introduction to Distributed Systems

4.2: Building Systems that Scale

4.3: The Spark Execution Context

4.4: RDD Deep Dive: Dependencies and Lineage

4.5: A Day in the Life of a Spark Application

4.6: How Code Runs: Stages, Tasks, and the Shuffle

4.7: Spark Deployment: Local and Cluster Modes

4.8: Setting Up Your Own Cluster

4.9: Spark Performance: Monitoring and Optimization

4.10: Tuning Your Spark Application

4.11: Making Spark Fly: Parallelism

4.12: Making Spark Fly: Caching

Files

Lesson4

Directory actions

More options

Directory actions

More options

Latest commit

History

Lesson4

Folders and files

parent directory

readme.md

Lesson 4

Objectives

Examples

References

4.1: Introduction to Distributed Systems

4.2: Building Systems that Scale

4.3: The Spark Execution Context

4.4: RDD Deep Dive: Dependencies and Lineage

4.5: A Day in the Life of a Spark Application

4.6: How Code Runs: Stages, Tasks, and the Shuffle

4.7: Spark Deployment: Local and Cluster Modes

4.8: Setting Up Your Own Cluster

4.9: Spark Performance: Monitoring and Optimization

4.10: Tuning Your Spark Application

4.11: Making Spark Fly: Parallelism

4.12: Making Spark Fly: Caching