Kmetz Weekly Insights
Posts
Advancing Environmental Justice Through Data Analysis

Advancing Environmental Justice Through Data Analysis

Unveiling Inequities with Pandas and Visualization Tools

Ryan Kmetz
August 22, 2024

When discussing the relationship between data analysis and environmental justice, it's important to recognize how data serves as a critical tool in uncovering and addressing systemic inequalities. Environmental justice focuses on ensuring that all communities, regardless of race, income, or geographic location, have equal protection from environmental hazards and access to environmental benefits. By leveraging data analysis, we can identify patterns of environmental harm, track pollution levels, and highlight disparities in how different communities are affected by environmental policies and practices.

Using tools like Pandas in Python allows us to dive deep into datasets that reveal the disproportionate impact of environmental issues on marginalized communities. For instance, analyzing data on air quality, water contamination, or exposure to hazardous waste can uncover which communities are most at risk and why. This information is vital for policymakers, activists, and researchers working to create equitable solutions that address both the root causes and the symptoms of environmental injustice.

Giphy

The ability to visualize data through libraries like Matplotlib enables clearer communication of these issues to a broader audience. By presenting data in a visual format, we can make complex environmental justice concerns more accessible and compelling to the public, driving home the urgency of taking action to protect vulnerable populations. Data-driven approaches not only provide the evidence needed to advocate for change but also empower communities.

EJ-PY Panda Edition

Data Analysis with Pandas: Day 4

Introduction

Pandas is a powerful and flexible Python library used for data manipulation and analysis. This article will cover the basics of the Pandas library, creating and manipulating DataFrames, basic data analysis operations such as grouping, filtering, and sorting, and simple data visualization using Matplotlib. Additionally, we'll provide a practical exercise to analyze a dataset on global CO2 emissions and create visualizations showing emissions trends for top polluting countries over time.

Introduction to Pandas Library

Pandas is an open-source data analysis and manipulation library built on top of the Python programming language. It provides data structures and functions needed to manipulate structured data seamlessly.

Installing Pandas

You can install Pandas using pip:

pip install pandas

Importing Pandas

To use Pandas in your Python script, you need to import it:

import pandas as pd

Creating and Manipulating DataFrames

A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). You can think of it as a spreadsheet or SQL table.

Creating a DataFrame

You can create a DataFrame from various sources, such as lists, dictionaries, or other DataFrames.

import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["New York", "Los Angeles", "Chicago"]
}
df = pd.DataFrame(data)
print(df)

# Output:
#       Name  Age         City
# 0    Alice   25     New York
# 1      Bob   30  Los Angeles
# 2  Charlie   35      Chicago

Manipulating DataFrames

You can perform various operations on DataFrames, such as selecting, adding, and modifying columns.

# Selecting a column
ages = df["Age"]
print(ages)

# Adding a new column
df["Salary"] = [70000, 80000, 90000]
print(df)

# Modifying a column
df["Age"] = df["Age"] + 5
print(df)

Basic Data Analysis Operations

Pandas provides several functions for data analysis, including grouping, filtering, and sorting.

Grouping Data

You can group data using the groupby function.

# Creating a sample DataFrame
data = {
    "Department": ["HR", "Tech", "HR", "Tech", "HR", "Tech"],
    "Salary": [50000, 80000, 60000, 90000, 55000, 85000]
}
df = pd.DataFrame(data)

# Grouping by Department and calculating the mean salary
grouped = df.groupby("Department").mean()
print(grouped)

# Output:
#             Salary
# Department
# HR         55000.0
# Tech       85000.0

Filtering Data

You can filter data using boolean indexing.

# Filtering rows where Salary is greater than 60000
high_salary = df[df["Salary"] > 60000]
print(high_salary)

# Output:
#   Department  Salary
# 1       Tech   80000
# 3       Tech   90000
# 5       Tech   85000

Sorting Data

You can sort data using the sort_values function.

# Sorting the DataFrame by Salary in descending order
sorted_df = df.sort_values(by="Salary", ascending=False)
print(sorted_df)

# Output:
#   Department  Salary
# 3       Tech   90000
# 5       Tech   85000
# 1       Tech   80000
# 2         HR   60000
# 4         HR   55000
# 0         HR   50000

Simple Data Visualization Using Matplotlib

Matplotlib is a low-level plotting library in Python that provides a range of tools for creating static, animated, and interactive visualizations.

Installing Matplotlib

You can install Matplotlib using pip:

pip install matplotlib

Importing Matplotlib

To use Matplotlib in your Python script, you need to import it:

import matplotlib.pyplot as plt

Creating a Simple Plot

You can create various plots using Matplotlib, such as line plots, bar plots, and histograms.

import matplotlib.pyplot as plt

# Creating sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]

# Creating a line plot
plt.plot(x, y)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Simple Line Plot")
plt.show()

Practical Exercise: Analyzing Global CO2 Emissions

Let's use Pandas to analyze a dataset on global CO2 emissions and create visualizations showing emissions trends for top polluting countries over time.

Step-by-Step Instructions

Download the dataset:
- Ensure the dataset is available in the same directory as your script.
Read the dataset:
- Use Pandas to read the data into a DataFrame.
Analyze the data:
- Group the data by country and year.
- Calculate the sum of emissions for each year for the top polluting countries.
Visualize the data:
- Use Matplotlib to create line plots showing emissions trends for the top polluting countries.

Sample Code

import pandas as pd
import matplotlib.pyplot as plt

# Step 1: Read the dataset
df = pd.read_csv('global_co2_emissions.csv')

# Step 2: Display the first few rows of the dataset
print(df.head())

# Step 3: Group the data by country and year, and calculate the sum of emissions
grouped_df = df.groupby(['Country', 'Year'])['Emissions'].sum().reset_index()

# Step 4: Find the top polluting countries
top_countries = grouped_df.groupby('Country')['Emissions'].sum().sort_values(ascending=False).head(5).index
top_countries_df = grouped_df[grouped_df['Country'].isin(top_countries)]

# Step 5: Pivot the data to have years as columns
pivot_df = top_countries_df.pivot(index='Year', columns='Country', values='Emissions')

# Step 6: Plot the data
pivot_df.plot(kind='line', figsize=(10, 6))
plt.xlabel('Year')
plt.ylabel('Emissions')
plt.title('CO2 Emissions Trends for Top Polluting Countries')
plt.legend(title='Country')
plt.show()

Explanation of the Code

Import required libraries:
- pandas for data manipulation and matplotlib.pyplot for data visualization.
Read the dataset:
- Use pd.read_csv('global_co2_emissions.csv') to read the dataset into a DataFrame.
Display the first few rows:
- Use df.head() to check the structure and contents of the dataset.
Group the data:
- Group the data by Country and Year and calculate the sum of emissions using groupby and sum() functions.
Find the top polluting countries:
- Calculate the total emissions for each country and select the top 5 polluting countries.
Pivot the data:
- Use the pivot function to restructure the DataFrame for plotting.
Plot the data:
- Use plot to create line plots showing emissions trends for the top polluting countries. Customize the plot with labels, title, and legend.

FAQ

Q1: How do I install Pandas in Python?

A: You can install Pandas using pip: pip install pandas.

Q2: How do I create a DataFrame in Pandas?

A: You can create a DataFrame from various sources, such as lists, dictionaries, or other DataFrames. Use the pd.DataFrame() function.

Q3: What are some basic data analysis operations in Pandas?

A: Basic data analysis operations in Pandas include grouping, filtering, and sorting data using functions such as groupby, boolean indexing, and sort_values.

Q4: How do I visualize data using Matplotlib?

A: Matplotlib provides functions to create various plots, such as line plots, bar plots, and histograms. Use plt.plot(), plt.bar(), and plt.hist() to create these plots.

Q5: How can I analyze global CO2 emissions using Pandas?

A: Read the dataset into a DataFrame, group and sum the data by country and year, find the top polluting countries, pivot the data, and use Matplotlib to visualize emissions trends.