[AWS Lab] Analytics – S3, Glue, Athena

In this lab, we will learn how to use AWS Glue to create a data pipeline.

Overview

  • S3
    • Source data
  • Glue
    • Create crawler to retrieve the metadata
    • Create a table
  • Athena
    • Query the data through the Glue table

1. Text Editor – Create a CSV file

Using any text editor, create a “people.csv” file.

firstName,lastName,age
Paul,Henderson,20
Grace,Carr,32
Julia,Jackson,14
Alexandra,May,43
Anna,Jones,29

2. S3 – Create a bucket and upload the CSV file

  • Click “Create bucket
    • Bucket name: any unique name like “my-bucket-2023-12-25
    • Accept all defaults -> It creates a bucket with no public access.
    • Click “Create bucket
  • Click the bucket name in the list
    • Click “Upload
    • Click “Add files” and select “people.csv
    • Click “Upload

3. Glue – Create a Crawler

  • Click “Crawlers” on the left pane
  • Click “Create crawler
    • Properties
      • Name: “People-Crawler
    • Data source
      • Already mapped to tables: “Not yet
      • Click “Add a data source”
        • Data source: “S3
        • Network connection: empty
        • Location of S3 data: “In this account
        • S3 path: Click “Browse” and select the one you created such as “s3://my-bucket-2023-12-25
        • Subsequent crawlers runs: “Crawl all sub-folders
        • Click “Add an S3 data source
    • Security settings
      • Click “Create new IAM role
      • Name: “AWSGlueServiceRole-People
      • Click “Create
    • Output and scheduling
      • Target database
        • Click “Add database
        • Name: “people
        • Click “Create database
        • Select the “people” database
      • Schedule
        • Frequency: “On demand
    • Review and create
      • Click “Create crawler

4. IAM – Check the Role

  • Click “Roles” on the left pane
  • Search “Glue
  • Click “AWSGlueServiceRole-People

5. Glue – Run crawler

  • Click “Crawlers” on the left pane
  • Click “People-Crawler
  • Click “Run crawler
    • Check the status until it says “Completed

6. Glue – Check Database table in the Data Catalog

  • Click “Databases” on the left pane
  • Click the “people” database
  • You can see the table. If not, click the refresh button.
  • Click the table name
    • You can see the meta data and the schema of the table.

7. Athena – Query data via Glue table

  • Click “Launch query editor
    • Data source: “AwsDataCatalog
    • Database: “people
    • Table: “my_bucket_2023_12_25” or what you created
    • Query: Use the query below
    • Click “Run
SELECT firstname, lastname, age
FROM people.my_bucket_2023_12_25

8. S3 – Create a bucket for output

  • Click “Create bucket
    • Bucket name: any unique name like “my-bucket-2023-12-25-output
    • Accept all defaults -> It creates a bucket with no public access.
    • Click “Create bucket

9. Glue – Create a Job

  • Click “Jobs” on the left pane
  • Create job: “Visual with a source and target
    • Source: “S3
    • Target: “S3
  • Click “Create
  • Visual Tab
    • Click “Source
      • Check “Data Catalog table
      • Database “people
      • Table “my_bucket_2023_12_25” or what you created
    • Click “Transform – ApplyMapping
      • Node Properties
        • Node Type: “SQL Query
      • Transform
        • Use the query below
      • Output schema
        • Click “Edit
          • fullname, string
          • age, long
        • Click “Apply
    • Click “Target
      • Format: “CSV
      • Compression: “None
      • S3: Browse and select it “s3://my-bucket-2023-12-25-output
  • Job details Table
    • Name: “PeopleDataJob
    • IAM Role: Select the one you created before “AWSGlueServiceRole-People
    • Under the “Advanced properties
      • Script filename: “PeopleDataJob.py
  • Click “Save
  • Click “Run
    • Check the Run status
select CONCAT(firstname, ' ', lastname) AS fullname, age from myDataSource

10. S3 – Check the output file

  • Go to the S3 bucket “my-bucket-2023-12-25-output
  • Check the file exists
  • Click “Download” to download the file
fullname,age
"Paul Henderson",20
"Grace Carr",32
"Julia Jackson",14
"Alexandra May",43
"Anna Jones",29

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s