In this lab, we will learn how to use AWS Glue to create a data pipeline.
Overview
- S3
- Source data
- Glue
- Create crawler to retrieve the metadata
- Create a table
- Athena
- Query the data through the Glue table
1. Text Editor – Create a CSV file
Using any text editor, create a “people.csv” file.
firstName,lastName,age
Paul,Henderson,20
Grace,Carr,32
Julia,Jackson,14
Alexandra,May,43
Anna,Jones,29
2. S3 – Create a bucket and upload the CSV file
- Click “Create bucket”
- Bucket name: any unique name like “my-bucket-2023-12-25“
- Accept all defaults -> It creates a bucket with no public access.
- Click “Create bucket“
- Click the bucket name in the list
- Click “Upload“
- Click “Add files” and select “people.csv“
- Click “Upload“
3. Glue – Create a Crawler
- Click “Crawlers” on the left pane
- Click “Create crawler”
- Properties
- Name: “People-Crawler“
- Data source
- Already mapped to tables: “Not yet“
- Click “Add a data source”
- Data source: “S3“
- Network connection: empty
- Location of S3 data: “In this account“
- S3 path: Click “Browse” and select the one you created such as “s3://my-bucket-2023-12-25“
- Subsequent crawlers runs: “Crawl all sub-folders“
- Click “Add an S3 data source“
- Security settings
- Click “Create new IAM role“
- Name: “AWSGlueServiceRole-People“
- Click “Create“
- Output and scheduling
- Target database
- Click “Add database“
- Name: “people“
- Click “Create database“
- Select the “people” database
- Schedule
- Frequency: “On demand“
- Target database
- Review and create
- Click “Create crawler“
- Properties
4. IAM – Check the Role
- Click “Roles” on the left pane
- Search “Glue“
- Click “AWSGlueServiceRole-People“
5. Glue – Run crawler
- Click “Crawlers” on the left pane
- Click “People-Crawler“
- Click “Run crawler”
- Check the status until it says “Completed“
6. Glue – Check Database table in the Data Catalog
- Click “Databases” on the left pane
- Click the “people” database
- You can see the table. If not, click the refresh button.
- Click the table name
- You can see the meta data and the schema of the table.
7. Athena – Query data via Glue table
- Click “Launch query editor”
- Data source: “AwsDataCatalog“
- Database: “people“
- Table: “my_bucket_2023_12_25” or what you created
- Query: Use the query below
- Click “Run“
SELECT firstname, lastname, age
FROM people.my_bucket_2023_12_25
8. S3 – Create a bucket for output
- Click “Create bucket”
- Bucket name: any unique name like “my-bucket-2023-12-25-output“
- Accept all defaults -> It creates a bucket with no public access.
- Click “Create bucket“
9. Glue – Create a Job
- Click “Jobs” on the left pane
- Create job: “Visual with a source and target”
- Source: “S3“
- Target: “S3“
- Click “Create“
- Visual Tab
- Click “Source”
- Check “Data Catalog table“
- Database “people“
- Table “my_bucket_2023_12_25” or what you created
- Click “Transform – ApplyMapping”
- Node Properties
- Node Type: “SQL Query“
- Transform
- Use the query below
- Output schema
- Click “Edit”
- fullname, string
- age, long
- Click “Apply“
- Click “Edit”
- Node Properties
- Click “Target”
- Format: “CSV“
- Compression: “None“
- S3: Browse and select it “s3://my-bucket-2023-12-25-output“
- Click “Source”
- Job details Table
- Name: “PeopleDataJob“
- IAM Role: Select the one you created before “AWSGlueServiceRole-People“
- Under the “Advanced properties”
- Script filename: “PeopleDataJob.py“
- Click “Save“
- Click “Run”
- Check the Run status
select CONCAT(firstname, ' ', lastname) AS fullname, age from myDataSource
10. S3 – Check the output file
- Go to the S3 bucket “my-bucket-2023-12-25-output“
- Check the file exists
- Click “Download” to download the file
fullname,age
"Paul Henderson",20
"Grace Carr",32
"Julia Jackson",14
"Alexandra May",43
"Anna Jones",29