A simple and illustrative Amazon Data Pipeline example. It runs a containerized software package, i.e., it runs an instance of a container image. The image is pulled from Docker Hub. A version of the pipeline's definition code outlines the function of each `ShellCommandActivity` node; each runs one of the augmentation pipeline scripts.
In brief, the aim of the projects herein is to simplify the repeated use of a variety of (a) frameworks, (b) cloud services, and (c) cloud platforms. And, the notes & examples are for reference purposes; the notes will be updated continuously.
Focused on templates and small software packages that ease or aid the use of cloud services & platforms
- A program that assigns a specified VPC Elastic IP Address to an EMR Cluster during the cluster's launch; more
- An Amazon Data Pipeline architecture example; this is an infrastructure-as-code illustration.
- An Amazon EMR (Elastic MapReduce) cluster launch example.
Quite a variety of frameworks are used for data science, statistics, data engineering, and machine learning engineering, e.g., Apache Spark, Apache Hive, etc.
Each framework project below is akin to a suite of programs that can be used for any data project that uses the framework; saving time and reducing/eliminating some repetitive steps.
For example, the hive project simplifies the process of projecting a structure onto a repository of data files via Apache Hive; it more or less parameterises steps that are repeated per data project.
The notes of each project are within each project's section.