AWS Data Pipeline is a managed ETL (Extract, Transform, Load) service
Features
- Data-driven workflow
- Parameters for data transformations
- Highly available with distributed infrastructure
- Automatic retries with failed activities
- Notification with SNS
- Integration with AWS Storage services
- DynamoDB, RDS, Redshift, S3
- Integration with AWS Compute
- EC2 and EMR
Components
- Pipeline Definition
- Compute Resource
- Create new EC2 instances or reuse existing ones
- Task Runners
- Data Nodes
- Locations and types of data
Use Cases
- Processing data in EMR
- Import or exporting data from/to DynamoDB
- Copying data files between S3 buckets
- Copying data to Redshift
- Exporting data from RDS to S3