[AWS Lab] Analytics – S3, Glue, Athena

In this lab, we will learn how to use AWS Glue to create a data pipeline.

Overview

S3
- Source data
Glue
- Create crawler to retrieve the metadata
- Create a table
Athena
- Query the data through the Glue table

1. Text Editor – Create a CSV file

Using any text editor, create a “people.csv” file.

firstName,lastName,age
Paul,Henderson,20
Grace,Carr,32
Julia,Jackson,14
Alexandra,May,43
Anna,Jones,29

2. S3 – Create a bucket and upload the CSV file

Click “Create bucket”
- Bucket name: any unique name like “my-bucket-2023-12-25“
- Accept all defaults -> It creates a bucket with no public access.
- Click “Create bucket“
Click the bucket name in the list
- Click “Upload“
- Click “Add files” and select “people.csv“
- Click “Upload“

3. Glue – Create a Crawler

Click “Crawlers” on the left pane
Click “Create crawler”
- Properties
  - Name: “People-Crawler“
- Data source
  - Already mapped to tables: “Not yet“
  - Click “Add a data source”
    - Data source: “S3“
    - Network connection: empty
    - Location of S3 data: “In this account“
    - S3 path: Click “Browse” and select the one you created such as “s3://my-bucket-2023-12-25“
    - Subsequent crawlers runs: “Crawl all sub-folders“
    - Click “Add an S3 data source“
- Security settings
  - Click “Create new IAM role“
  - Name: “AWSGlueServiceRole-People“
  - Click “Create“
- Output and scheduling
  - Target database
    - Click “Add database“
    - Name: “people“
    - Click “Create database“
    - Select the “people” database
  - Schedule
    - Frequency: “On demand“
- Review and create
  - Click “Create crawler“

4. IAM – Check the Role

Click “Roles” on the left pane
Search “Glue“
Click “AWSGlueServiceRole-People“

5. Glue – Run crawler

Click “Crawlers” on the left pane
Click “People-Crawler“
Click “Run crawler”
- Check the status until it says “Completed“

6. Glue – Check Database table in the Data Catalog

Click “Databases” on the left pane
Click the “people” database
You can see the table. If not, click the refresh button.
Click the table name
- You can see the meta data and the schema of the table.

7. Athena – Query data via Glue table

Click “Launch query editor”
- Data source: “AwsDataCatalog“
- Database: “people“
- Table: “my_bucket_2023_12_25” or what you created
- Query: Use the query below
- Click “Run“

SELECT firstname, lastname, age
FROM people.my_bucket_2023_12_25

8. S3 – Create a bucket for output

Click “Create bucket”
- Bucket name: any unique name like “my-bucket-2023-12-25-output“
- Accept all defaults -> It creates a bucket with no public access.
- Click “Create bucket“

9. Glue – Create a Job

Click “Jobs” on the left pane
Create job: “Visual with a source and target”
- Source: “S3“
- Target: “S3“
Click “Create“
Visual Tab
- Click “Source”
  - Check “Data Catalog table“
  - Database “people“
  - Table “my_bucket_2023_12_25” or what you created
- Click “Transform – ApplyMapping”
  - Node Properties
    - Node Type: “SQL Query“
  - Transform
    - Use the query below
  - Output schema
    - Click “Edit”
      - fullname, string
      - age, long
    - Click “Apply“
- Click “Target”
  - Format: “CSV“
  - Compression: “None“
  - S3: Browse and select it “s3://my-bucket-2023-12-25-output“
Job details Table
- Name: “PeopleDataJob“
- IAM Role: Select the one you created before “AWSGlueServiceRole-People“
- Under the “Advanced properties”
  - Script filename: “PeopleDataJob.py“
Click “Save“
Click “Run”
- Check the Run status

select CONCAT(firstname, ' ', lastname) AS fullname, age from myDataSource

10. S3 – Check the output file

Go to the S3 bucket “my-bucket-2023-12-25-output“
Check the file exists
Click “Download” to download the file

fullname,age
"Paul Henderson",20
"Grace Carr",32
"Julia Jackson",14
"Alexandra May",43
"Anna Jones",29

[AWS Lab] Analytics – S3, Glue, Athena

1. Text Editor – Create a CSV file

2. S3 – Create a bucket and upload the CSV file

3. Glue – Create a Crawler

4. IAM – Check the Role

5. Glue – Run crawler

6. Glue – Check Database table in the Data Catalog

7. Athena – Query data via Glue table

8. S3 – Create a bucket for output

9. Glue – Create a Job

10. S3 – Check the output file

Published by Pyongwon Lee

Leave a Comment Cancel reply

1. Text Editor – Create a CSV file

2. S3 – Create a bucket and upload the CSV file

3. Glue – Create a Crawler

4. IAM – Check the Role

5. Glue – Run crawler

6. Glue – Check Database table in the Data Catalog

7. Athena – Query data via Glue table

8. S3 – Create a bucket for output

9. Glue – Create a Job

10. S3 – Check the output file

Share this:

Related

Published by Pyongwon Lee

Leave a Comment Cancel reply