AWS Academy Data Engineering Capstone Project

Parvati Jayakumar
13 min readDec 7, 2024

--

In today’s fast-paced world of data engineering, hands-on experience with cloud technologies has become a must for anyone aiming to excel in building modern data solutions. The Academy Data Engineering Capstone Project offers a fantastic opportunity to bring these skills together, providing a real-world challenge that immerses participants in the process of creating robust data infrastructures using Amazon Web Services (AWS).

This project is centered around fisheries data from the Sea Around Us project, offering a meaningful context to explore the power of data management in tackling real-world issues like environmental sustainability. We get to work with a variety of AWS tools, including AWS Cloud9, Amazon S3, AWS Glue, and Amazon Athena, to manage and analyze large datasets.

By the end of this capstone, we will gain more than just technical knowledge, but also develop a deeper understanding of how modern cloud technologies can empower data professionals to drive impactful solutions.

Final Architecture

Dataset

The Sea Around Us website provides a dataset with extensive historical information about fisheries in all parts of every ocean globally. The data includes information about yearly fishery catches from 1950 to 2018.

The data can be downloaded in CSV format from the Sea Around Us website. The dataset includes columns of information for each year, including which countries caught which types of fish in which areas. The data also indicates how many tonnes of fish were caught and what the value of the catch was, measured in 2010 US dollars.

To understand the data, it will be helpful to understand what is meant by open seas areas and EEZ areas:

  • Open seas (also called high seas): Areas of the ocean that are at least 200 nautical miles away from any single country’s shoreline. The resources, including the fish, in these areas are generally accepted as not belonging to any one country. The following map, which is a screen capture from the Sea Around Us website, shows how the dataset divides the high seas (areas highlighted in dark gray) into unique high seas areas.
High Seas
  • Exclusive Economic Zones (EEZs): Areas within 200 nautical miles of a country’s shoreline. Each country typically claims exclusive access to the resources in the zones, including the fish within them. The following map, which is a screen capture from the Sea Around Us website, shows the EEZs of the world.
EEZs

I worked with three data files from the Sea Around Us website:

  • The first file contains data from all open seas areas.
  • The second file contains data from a single open seas area in the Pacific ocean, referred to as Pacific, Western Central, which is not far from Fiji and many other countries.
  • The third file contains data from the EEZ of a single country (Fiji), which is near the Pacific, Western Central open seas area.

Configuring the development environment

In this first part of the capstone, we will set up your development environment.

First let us observe the details of the CapstoneGlueRole AWS Identity and Access Management (IAM) role that has been created for us.

IAM Roles
CapstoneGlueRole AWS Identity

Next, let us Create an AWS Cloud9 environment with the following settings:

  • Name the environment CapstoneIDE
  • Create a new EC2 instance for the environment, and use a t2.micro instance.
  • Deploy the instance to support SSH connections to the Capstone VPC, in the Capstone public subnet.
  • Keep all other default settings.

Next, let us create two S3 buckets with the following settings:

  • Create the buckets in the us-east-1 Region.
  • Name the first bucket data-source-##### where ##### is a random number.
  • Name the second bucket query-results-##### where ##### is also a random number.
  • Keep all other default settings.

Now we will download the three .csv source data files, run the following commands in the terminal of your AWS Cloud9 IDE:

wget https://aws-tc-largeobjects.s3.us-west-2.amazonaws.com/CUR-TF-200-ACDENG-1-91570/lab-capstone/s3/SAU-GLOBAL-1-v48-0.csv

wget https://aws-tc-largeobjects.s3.us-west-2.amazonaws.com/CUR-TF-200-ACDENG-1-91570/lab-capstone/s3/SAU-HighSeas-71-v48-0.csv

wget https://aws-tc-largeobjects.s3.us-west-2.amazonaws.com/CUR-TF-200-ACDENG-1-91570/lab-capstone/s3/SAU-EEZ-242-v48-0.csv

To observe the column header row and the first five rows of data in the SAU-GLOBAL-1-v48–0.csv file, run the following command:

head -6 SAU-GLOBAL-1-v48-0.csv

Analysis: Among other details, each line of data in this dataset includes:

  • The year that the fishing occurred
  • The country (fishing_entity) that did the fishing
  • The tonnes of fish caught that year by that country
  • The value in 2010 US dollars (landed_value) of the fish caught that year by that country

Note:

  • This dataset contains 561,675 lines.
  • Tip: To confirm this, run wc -l SAU_GLOBAL-1-v48–0.csv. The dataset includes reported and “best guess” data for all fishing that occurred in the global high seas (meaning, not in any one country’s EEZ) between 1950 and 2018.
  • EEZ areas include the ocean waters within 200 nautical miles of the shoreline of a country. Therefore, the fishing reported in this dataset occurred at least 200 miles offshore from any country.

Next, we will convert the SAU-GLOBAL-1-v48–0.csv file to Parquet format.

  1. First, we need to install some tools on your AWS Cloud9 IDE. Run the following command:
sudo pip3 install pandas pyarrow fastparquet
# Start the python interactive shell 
python3
# Use pandas to convert the file to parquet format
import pandas as pd
df = pd.read_csv('SAU-GLOBAL-1-v48-0.csv')
df.to_parquet('SAU-GLOBAL-1-v48-0.parquet')
exit()
  • Note: Pandas is a useful tool for working with data files. For more information, see the pandas website.

Next, to upload the SAU-GLOBAL-1-v48–0.parquet file to the data-source bucket, use an AWS Command Line Interface (AWS CLI) command in the AWS Cloud9 terminal.

Using an AWS Glue crawler and querying multiple files with Athena

The query that we ran in the previous task works well for a single data file. However, what if we need to query a larger dataset that consists of more than one file?

In this second part of the capstone, we will query data that is stored in multiple files. To do this, we will configure an AWS Glue crawler to discover the structure of the data and then use Athena to query the data.

To observe the column header row and first few lines of data from the SAU-HighSeas-71-v48–0.csv file, use the head command.

Recall that we already downloaded the file to your AWS Cloud9 IDE.

The file contains the same columns as the SAU-GLOBAL-1-v48–0.csv file but has additional columns.

We will also convert the SAU-HighSeas-71-v48–0.csv file to Parquet format and upload it to the data-source bucket.

Analysis:

  • Like the SAU-GLOBAL-1-v48–0 dataset that you already uploaded to Amazon S3 and queried, the SAU-HighSeas-71-v48–0 dataset also describes fish catches in the high seas. However, the HighSeas dataset includes data only from one high seas area, known as Pacific, Western Central.
  • Of the additional columns in the HighSeas dataset, two are of particular interest: (1) The area_name column contains the “Pacific, Western Central” value in every row, (2) The common_name column contains values that describe certain types of fish (for example, “Mackerels, tunas, bonitos”).

Next, we will create an AWS Glue database and an AWS Glue crawler with the following settings:

  • Name the database fishdb
  • Name the crawler fishcrawler
  • Configure the crawler to use the CapstoneGlueRole IAM role to crawl the contents of the data-source S3 bucket.
  • Output the results of the crawler to the fishdb database.
  • Set the crawler frequency to On demand.
Create fishdb
Create fishcrawler
Crawl the contents of the data-source S3 bucket
Set output and scheduling

Run the crawler to create a table that contains metadata in the AWS Glue database. Verify that the expected table is created.

To confirm that the table properly categorized the data, use Athena to run SQL queries against each column in the new table.

Important: Before you run the first query in Athena, configure the Athena Query Editor to output data to the query-results bucket.

Example query:

SELECT DISTINCT area_name FROM fishdb.data_source_xxxxx;

Note: The example query returns two results. For this column, every row in the dataset contains either the value “Pacific, Western Central” (for rows pulled from SAU-HighSeas-71-v48–0.parquet) or a null value (for rows pulled from SAU-GLOBAL-1-v48–0.parquet).

Now that our data table is defined, run queries to confirm that it provides useful results.

  • To find the value in US dollars of all fish caught by the country Fiji from the Pacific, Western Central high seas area since 2001, organized by year, use the following query (be sure to replace <FMI_1>with the proper value) :
SELECT year, fishing_entity AS Country, CAST(CAST(SUM(landed_value) AS DOUBLE) AS DECIMAL(38,2)) AS ValuePacificWCSeasCatch
FROM <FMI_1>
WHERE area_name LIKE '%Pacific%' and fishing_entity='Fiji' AND year > 2000
GROUP BY year, fishing_entity
ORDER By year

Note: The CAST(CAST(sum(landed_value) AS DOUBLE) AS DECIMAL(38,2)) part of the query ensures that the format of the returned data from the landed_value column displays in a reader-friendly format (dollars and cents) instead of scientific format.

  • To find the value in US dollars of all fish caught by the country Fiji from all high seas areas since 2001, organized by year. In the output results, name the US dollar value column ValueAllHighSeasCatch
SELECT year, fishing_entity AS Country, CAST(CAST(SUM(landed_value) AS DOUBLE) AS DECIMAL(38,2)) AS ValueAllHighSeasCatch
FROM fishdb.data_source_56473
WHERE area_name IS NULL AND fishing_entity = 'Fiji' AND year > 2000
GROUP BY year, fishing_entity
ORDER BY year;

After creating and running the correct query and see the results displayed, create a view based on the query:

  • Choose Create > View from query.
  • Name the view challenge

Transforming a new file and adding it to the dataset

In this part of the capstone, we will add the SAU-EEZ-242-v48–0.csv data file to the dataset in Amazon S3. Compare the columns that it contains with the columns that the other data files contain.

Use the same technique that you used earlier in the lab to discover the column names for the EEZ file.

Tip: Most of the column names match between the three files. However, two of the column names in the EEZ file are not an exact match. The following chart shows the columns that are contained in each file.

Analysis: The data in the fish_name column needs to be merged with the data in the common_name column from the HighSeas dataset. Likewise, the data in the country column needs to be merged with the data in the fishing_entity column from the HighSeas dataset.

Use the Python data analyst library, which is called pandas, to fix the column names. In addition, convert the EEZ file to the Parquet format.

To accomplish these tasks, run all of the commands in the following code block.

However, before you run the df.rename command, replace the <FMI_#> placeholders with the correct values.

Tips:

  • For example <FMI_1> should be set to one of the column names you want to change and <FMI_2> should be set to what you want to change it to.
  • It will be easier to read the output of the print lines if you make your browser window as wide as possible.
# Make a backup of the file before you modify it in place
cp SAU-EEZ-242-v48-0.csv SAU-EEZ-242-v48-0-old.csv

# Start the python interactive shell
python3
import pandas as pd

# Load the backup version of the file
data_location = 'SAU-EEZ-242-v48-0-old.csv'

# Use Pandas to read the CSV into a dataframe
df = pd.read_csv(data_location)

# View the current column names
print(df.head(1))

# Change the names of the 'fish_name' and 'country' columns to match the column names where this data appears in the other data files already in your data-source bucket
df.rename(columns = {"<FMI_1>": "<FMI_2>", "<FMI_3>": "<FMI_4>"}, inplace = True)

# Verify the column names have been changed
print(df.head(1))

# Write the changes to disk
df.to_csv('SAU-EEZ-242-v48-0.csv', header=True, index=False)
df.to_parquet('SAU-EEZ-242-v48-0.parquet')
exit()

Further, we will upload the new EEZ data file to the data-source bucket.

To update the table metadata with the additional columns that are now part of your dataset, run the AWS Glue crawler again.

Now, let us run some queries in Athena.

Note: For all of these queries, replace data_source_##### with the name of your data_source table.

  • To verify the values in the area_name column, as we did before, use the following query:
SELECT DISTINCT area_name FROM fishdb.data_source_#####;

With the addition of the EEZ file to the dataset, this query now returns three results, including the result for rows where the area_name column doesn’t have any data. (Recall that this query returned only two results previously.)

  • To find the value in US dollars of all fish caught by Fiji from the open seas since 2001, organized by year, use the following query:
SELECT year, fishing_entity AS Country, CAST(CAST(SUM(landed_value) AS DOUBLE) AS DECIMAL(38,2)) AS ValueOpenSeasCatch
FROM fishdb.data_source_#####
WHERE area_name IS NULL AND fishing_entity='Fiji' AND year > 2000
GROUP BY year, fishing_entity
ORDER By year

To find the value in US dollars of all fish caught by Fiji from the Fiji EEZ since 2001, organized by year, use the following query:

SELECT year, fishing_entity AS Country, CAST(CAST(SUM(landed_value) AS DOUBLE) AS DECIMAL(38,2)) AS ValueEEZCatch
FROM fishdb.data_source_#####
WHERE area_name LIKE '%Fiji%' AND fishing_entity='Fiji' AND year > 2000
GROUP BY year, fishing_entity
ORDER By year

To find the value in US dollars of all fish caught by Fiji from either the Fiji EEZ or the open seas since 2001, organized by year, use the following query:

SELECT year, fishing_entity AS Country, CAST(CAST(SUM(landed_value) AS DOUBLE) AS DECIMAL(38,2)) AS ValueEEZAndOpenSeasCatch
FROM fishdb.data_source_#####
WHERE (area_name LIKE '%Fiji%' OR area_name IS NULL) AND fishing_entity='Fiji' AND year > 2000
GROUP BY year, fishing_entity
ORDER By year

Analysis: If our data is formatted well and the AWS Glue crawler properly updated the metadata table, then the results that we get from the first two queries in this step should add up to the results that we get from the third query.

For example, if we add the 2001 ValueOpenSeasCatch value and the 2001 ValueEEZCatch value, the total should equal the 2001 ValueEEZAndOpenSeasCatch value. If the results are consistent with this description, then it is a good indication that the solution is working as intended.

Next, we will create a view in Athena, which will be useful to review the data in the next section of this capstone.

  • Run the following query. Replace data_source_##### with the name of your data_source table:
CREATE OR REPLACE VIEW MackerelsCatch AS
SELECT year, area_name AS WhereCaught, fishing_entity as Country, SUM(tonnes) AS TotalWeight
FROM fishdb.data_source_#####
WHERE common_name LIKE '%Mackerels%' AND year > 2014
GROUP BY year, area_name, fishing_entity, tonnes
ORDER BY tonnes DESC

To verify that the view has data, in the Data panel, under Views, choose the ellipsis (three dot) icon to the right of the mackerelscatch view, and choose Preview View.

To view the following data Tonnes of mackerel caught by year by country

  • Return to the Athena query editor, and run the following SQL query to identify the countries with the highest mackerel catch each year.
SELECT year, Country, MAX(TotalWeight) AS Weight
FROM fishdb.mackerelscatch
GROUP BY year, Country
ORDER BY year, Weight DESC;

To view the MackerelsCatch for a particular country, e.g China, run the following query.

SELECT * FROM "fishdb"."mackerelscatch"
where country in ('China')

Conclusion

The Academy Data Engineering Capstone Project represents a significant opportunity to apply the knowledge and skills acquired throughout the course in a practical, hands-on setting.​ By building a comprehensive infrastructure for managing and analyzing fishing data from the Sea Around Us dataset, we not only gained experience with various AWS services but also understood the real-world implications of data engineering in the context of environmental sustainability. Each task: from configuring the AWS Cloud9 environment to transforming data formats, utilizing AWS Glue crawlers, and querying with Amazon Athena facilitates a deeper comprehension of data workflows and the importance of data accuracy. This capstone project not only reinforced technical skills but also emphasized critical thinking and problem-solving abilities, as we are challenged to navigate complex datasets and derive meaningful insights.

--

--

No responses yet