[AWS] AWS Glue

AWS Glue is a serverless data integration service, which provides fully managed extract, transform, and load (ETL) functionality.

Overview

You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.

Components

Source Data Store
- S3, RDS, DynamoDB
Crawler
- A script that connects to a data store, determines the schema of your data, and creates metadata tables.
Data Catalog
- Persistent metadata store
- Each AWS account has one AWS Glue Data Catalog per AWS Region.
- A Data Catalog is a collection of databases, which are also collections of tables.
  - A Database is used to organize metadata tables.
  - A Table is metadata representations of a collection of your semi-structured data.
Job
- The business logic to perform the ETL task
  - Python or Scala
- Source Data Store -> (Crawler) -> Data Catalog -> (Job) -> Output Data Store
Data Store
- Data source and Data target

Working with Glue Jobs

Job Parameters

[AWS] AWS Glue

Overview

Components

Working with Glue Jobs

Published by P. L.

Leave a Comment Cancel reply

Overview

Components

Working with Glue Jobs

Share this:

Related

Published by P. L.

Leave a Comment Cancel reply