[AWS] Data Pipeline

AWS Data Pipeline is a managed ETL (Extract, Transform, Load) service


Features

  • Data-driven workflow
  • Parameters for data transformations
  • Highly available with distributed infrastructure
  • Automatic retries with failed activities
    • Notification with SNS
  • Integration with AWS Storage services
    • DynamoDB, RDS, Redshift, S3
  • Integration with AWS Compute
    • EC2 and EMR

Components

  • Pipeline Definition
  • Compute Resource
    • Create new EC2 instances or reuse existing ones
  • Task Runners
  • Data Nodes
    • Locations and types of data

Use Cases

  • Processing data in EMR
  • Import or exporting data from/to DynamoDB
  • Copying data files between S3 buckets
  • Copying data to Redshift
  • Exporting data from RDS to S3

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s