GitHub - alanchn31/Data-Engineering-Projects: Personal Data Engineering Projects

Description

This project looks at data modelling for a fictitious music startup Sparkify, applying STAR schema to ingest data to simplify queries that answers business questions the product owner may have

Looking at the realm of big data, Cassandra helps to ingest large amounts of data in a NoSQL context. This project adopts a query centric approach in ingesting data into data tables in Cassandra, to answer business questions about a music app

In storing semi-structured data, one form to store it in, is in the form of documents. MongoDB makes this possible, with a specific collection containing related documents. Each document contains fields of data which can be queried.
In this project, data is scraped from a books listing website using Scrapy. The fields of each book, such as price of a book, ratings, whether it is available is stored in a document in the books collection in MongoDB.

This project creates a data warehouse, in AWS Redshift. A data warehouse provides a reliable and consistent foundation for users to query and answer some business questions based on requirements.

This project creates a data lake, in AWS S3 using Spark.
Why create a data lake? A data lake provides a reliable store for large amounts of data, from unstructured to semi-structured and even structured data. In this project, we ingest json files, denormalize them into fact and dimension tables and upload them into a AWS S3 data lake, in the form of parquet files.

This project schedules data pipelines, to perform ETL from json files in S3 to Redshift using Airflow.
Why use Airflow? Airflow allows workflows to be defined as code, they become more maintainable, versionable, testable, and collaborative

This project is the finale to Udacity's data engineering nanodegree. Udacity provides a default dataset however I chose to embark on my own project.
My project is on building a movies data warehouse, which can be used to build a movies recommendation system, as well as predicting box-office earnings. View the project here: Movies Data Warehouse

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
0. Back to Basics		0. Back to Basics
1. Postgres ETL		1. Postgres ETL
2. Cassandra ETL		2. Cassandra ETL
3. Web Scraping using Scrapy, Mongo ETL		3. Web Scraping using Scrapy, Mongo ETL
4. Data Warehousing with AWS Redshift		4. Data Warehousing with AWS Redshift
5. Data Lake with Spark & AWS S3		5. Data Lake with Spark & AWS S3
6. Data Pipelining with Airflow		6. Data Pipelining with Airflow
.gitignore		.gitignore
README.md		README.md