[AWS] Data Pipeline

AWS Data Pipeline is a managed ETL (Extract, Transform, Load) service

Features

Data-driven workflow
Parameters for data transformations
Highly available with distributed infrastructure
Automatic retries with failed activities
- Notification with SNS
Integration with AWS Storage services
- DynamoDB, RDS, Redshift, S3
Integration with AWS Compute
- EC2 and EMR

Components

Pipeline Definition
Compute Resource
- Create new EC2 instances or reuse existing ones
Task Runners
Data Nodes
- Locations and types of data

Use Cases

Processing data in EMR
Import or exporting data from/to DynamoDB
Copying data files between S3 buckets
Copying data to Redshift
Exporting data from RDS to S3

Published by P. L.

IT (Cloud, Web, AI) Development, Philosophy, Economics. View all posts by P. L.

Leave a Comment Cancel reply