AWS Glue is a serverless data integration service, which provides fully managed extract, transform, and load (ETL) functionality.
Overview
You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.
Components
- Source Data Store
- S3, RDS, DynamoDB
- Crawler
- A script that connects to a data store, determines the schema of your data, and creates metadata tables.
- Data Catalog
- Persistent metadata store
- Each AWS account has one AWS Glue Data Catalog per AWS Region.
- A Data Catalog is a collection of databases, which are also collections of tables.
- A Database is used to organize metadata tables.
- A Table is metadata representations of a collection of your semi-structured data.
- Job
- The business logic to perform the ETL task
- Python or Scala
- Source Data Store -> (Crawler) -> Data Catalog -> (Job) -> Output Data Store
- The business logic to perform the ETL task
- Data Store
- Data source and Data target
Working with Glue Jobs
Job Parameters

