Skip to content

lucca-miorelli/spacex-data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SpaceX API - Data Engineering Project

ETL for /launches endpoint

👋 Hey! This is a repository that contains the application that Extracts, Transforms, and Loads launches data from SpaceX API into a Postgres database, and run a few queries on top of it. Tools used are Python, Postgres, Docker, Docker Compose, MinIO

It also contains some theoretical explanations regarding AWS architecture, Kubernetes orchestration, and Terraform development. Follow this README file and click on the links as directed for detailed explanations of each section.

Table of Contents

  1. How to run this application?
  2. Part 1 - Data Infrastructure on AWS
  3. Part 2 - Orchestration and Container Management with Kubernetes
  4. Part 3 - ETL Pipeline
  5. Part 4 - SQL Queries

How to run this application?

Please refer to docs/setup.md for further instructions.

Data Infrastructure on AWS

Refer to docs/part_1_architecture_and_security.md for architectural diagrams and security discussions in AWS.

For a discussion on configuring Terraform to create a simple Redshift cluster on AWS, refer to docs/part_1_terraform_redshit.md.

Orchestration and Container Management with Kubernetes

For steps on deploying a Kubernetes cluster and configuring monitoring and logging, refer to docs/part_2_kubernetes.md.

ETL Pipeline

The script that extracts, transforms, and loads data can be found at app/processing/launches.py.

For further explanation, refer to docs/part_3_etl.md.

For information on scheduling and monitoring this pipeline using Apache Airflow, refer to docs/part_3_airflow.md.

SQL Queries

SQL queries for

  • finding the maximum number of times a core has been reused;
  • the cores that have been reused in less than 50 days after the previous launch.

can be found at max_core_reuse.sql and cores_less_than_50_days.sql, respectively.


Extra: The below image shows the core that has been reused the most (14 times) and each flight it was used. network_flights_core

About

Extract ana analyze SpaceX API data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages