In this lesson, I peel back the layers on Spark and show you some of its internals.
We start with primer on distributed systems and the theory that underlays them.
From there, I walk through Spark's execution context and how Spark actually executes your code in a distributed fashion on a cluster.
By understanding some of the intricacies of how Spark coordinates program execution, you learn how to improve Spark performance for your applications.
And finally we finish by setting up our own cluster on Amazon Web Services enabling us to really leverage all of the performance gains of a distributed system.
- Understand the basics of distributed systems and how they help us scale data storage as well as computation
- See how Spark and its execution context efficient runs code in a distributed manner
- Understand the RDD abstraction and the interface it gives us to manipulate distributed datasets
- Deploy your own Spark cluster on Amazon Web Services
- Monitor your Spark jobs to tune them and optimize their execution
- See how Spark can leverage the memory of a cluster to cache data that is used repeated
- 4.5 - 4.6: spark-internals.ipynb
- 4.9 - 4.12: performance-tuning.ipynb
- Distributed Systems: for fun and profit
- Resilience Engineering: Learning to Embrace Failure
- Chaos Monkey
- CAP Theorem: Revisited
- Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services (original paper)
- CAP Twelve Years Later: How the "Rules" Have Changed
- You Can't Sacrifice Partition Tolerance
- How to Beat the CAP Theorem
- Questioning the Lambda Architecture