Analysis of Water Quality - An Exploratory Data Analysis Project¶

Jessica Hooker¶

January 2023¶

Introduction¶

Water scarcity is one of the most serious problems humanity is facing. According to UNICEF:

  • Four billion people — almost two thirds of the world’s population — experience severe water scarcity for at least one month each year.
  • Over two billion people live in countries where water supply is inadequate.
  • Half of the world’s population could be living in areas facing water scarcity by as early as 2025.
  • Some 700 million people could be displaced by intense water scarcity by 2030.
  • By 2040, roughly 1 in 4 children worldwide will be living in areas of extremely high water stress.

With the effect of global warming and extreme weather events, day by day, amount of drinkable water decreases. There are different metrics that determines the quality of water. Organizations like WHO has measures defining the limits of these metrics to analyze if a water source is drinkable or not.

Goal: This is an exploratory data analysis project. The primary objective is to explore the features in relation to whether the water sample is drinkable or not.

A Brief Description of the Dataset¶

This dataset contains 9 numerical measurements belonging to 3276 different water samples and a dichotomous label indicating if the water sample is potable water or not.

1. pH value:
PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.

2. Hardness:
Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.

3. Solids (Total dissolved solids - TDS):
Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

4. Chloramines:
Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

5. Sulfate:
Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.

6. Conductivity:
Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.

7. Organic_carbon:
Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

8. Trihalomethanes:
THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.

9. Turbidity:
The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

10. Potability:
Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

Libraries & Visualization Parameters¶

Import all the libraries you use throughout the project below.

In [1]:
# Basic Libraries
import pandas as pd
import numpy as np 

# Visualizations Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
sns.set()
In [2]:
dataframe = pd.read_csv('water_potability.csv')
df = dataframe.copy()
df.shape
Out[2]:
(3276, 10)

Exploring the Structure of Dataset¶

Below is a preliminary examination of the general structure of the data set and characteristics of each feature. This data set comprises 3276 water samples with 9 water potability features.

In [3]:
# get a sample of 10 data points from the dataframe
df.sample(10)
Out[3]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
646 NaN 221.620058 11954.700730 6.657053 NaN 391.238714 12.961433 NaN 3.282061 0
2873 NaN 231.305127 15629.762335 7.018067 297.748867 406.340425 11.835535 49.679696 4.464244 0
595 8.269945 192.003776 11889.217747 8.377233 345.987354 405.340609 17.395662 85.444985 3.076480 0
1113 8.775486 157.354289 22113.391452 7.931503 304.807845 480.715208 15.692839 82.929345 4.393944 1
2325 5.685216 154.570002 17628.043059 6.664444 390.094731 324.427772 17.352586 67.407300 4.066157 1
1883 7.456251 226.538289 9163.294448 5.726224 NaN 417.125210 10.066076 67.304224 4.279697 0
919 5.606856 211.513097 32423.462685 6.812943 NaN 446.386404 20.926950 52.218979 4.070966 0
26 3.445062 207.926260 33424.768678 8.782147 384.007006 441.785876 13.805902 30.284597 4.184397 0
1117 NaN 178.154855 34383.597639 6.425770 311.526890 570.378765 17.592126 68.992741 4.005893 1
1837 NaN 216.271624 26964.317181 6.383548 418.494244 573.159591 19.892680 50.170522 4.501025 0
In [4]:
#describe the entire data frame
df.describe()
Out[4]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
count 2785.000000 3276.000000 3276.000000 3276.000000 2495.000000 3276.000000 3276.000000 3114.000000 3276.000000 3276.000000
mean 7.080795 196.369496 22014.092526 7.122277 333.775777 426.205111 14.284970 66.396293 3.966786 0.390110
std 1.594320 32.879761 8768.570828 1.583085 41.416840 80.824064 3.308162 16.175008 0.780382 0.487849
min 0.000000 47.432000 320.942611 0.352000 129.000000 181.483754 2.200000 0.738000 1.450000 0.000000
25% 6.093092 176.850538 15666.690297 6.127421 307.699498 365.734414 12.065801 55.844536 3.439711 0.000000
50% 7.036752 196.967627 20927.833607 7.130299 333.073546 421.884968 14.218338 66.622485 3.955028 0.000000
75% 8.062066 216.667456 27332.762127 8.114887 359.950170 481.792304 16.557652 77.337473 4.500320 1.000000
max 14.000000 323.124000 61227.196008 13.127000 481.030642 753.342620 28.300000 124.000000 6.739000 1.000000
In [5]:
#desscribe the data frame by samples that are potable
df.loc[df.Potability==1,].describe()
Out[5]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
count 1101.000000 1278.000000 1278.000000 1278.000000 985.000000 1278.000000 1278.000000 1223.000000 1278.000000 1278.0
mean 7.073783 195.800744 22383.991018 7.169338 332.566990 425.383800 14.160893 66.539684 3.968328 1.0
std 1.448048 35.547041 9101.010208 1.702988 47.692818 82.048446 3.263907 16.327419 0.780842 0.0
min 0.227499 47.432000 728.750830 0.352000 129.000000 201.619737 2.200000 8.175876 1.492207 1.0
25% 6.179312 174.330531 15668.985035 6.094134 300.763772 360.939023 12.033897 56.014249 3.430909 1.0
50% 7.036752 196.632907 21199.386614 7.215163 331.838167 420.712729 14.162809 66.678214 3.958576 1.0
75% 7.933068 218.003420 27973.236446 8.199261 365.941346 484.155911 16.356245 77.380975 4.509569 1.0
max 13.175402 323.124000 56488.672413 13.127000 481.030642 695.369528 23.604298 124.000000 6.494249 1.0
In [6]:
#confirm the data types of each of the variables in the data frame
df.dtypes
Out[6]:
ph                 float64
Hardness           float64
Solids             float64
Chloramines        float64
Sulfate            float64
Conductivity       float64
Organic_carbon     float64
Trihalomethanes    float64
Turbidity          float64
Potability           int64
dtype: object
In [7]:
#assess the number of unique values for each variable in the data frame
df.nunique()
Out[7]:
ph                 2785
Hardness           3276
Solids             3276
Chloramines        3276
Sulfate            2495
Conductivity       3276
Organic_carbon     3276
Trihalomethanes    3114
Turbidity          3276
Potability            2
dtype: int64
In [8]:
#plot unique values for features
sns.set_style("white")
plt.figure(figsize = (12,10))
nu = df.nunique().reset_index().iloc[0:9]
nu.columns = ['feature', 'nunique']
ax = sns.barplot(x = 'feature', y = 'nunique', data = nu)
plt.ylabel("Number of unique values", fontsize = 14, weight = "bold")
plt.title("Snapshot of Unique Values across Water Potability Features", fontsize = 16, weight = "bold")
plt.xlabel("Variable", fontsize = 14, weight = "bold")
plt.xticks(rotation = 45, fontsize = 12)
plt.show()

Preliminary Notes¶

Based on a preliminary look, there are some characteristics worth noting. First, the data set contains missing data across three features: ph, Sulfate, and Trihalomethanes. There are a large number of unique values across variables, with a the number of unique values matching the number of samples for the variables without missing data. This is reflected by the number of values after the decimal place across variables.

Data Quality and Preprocessing¶

Check for Data Duplication¶

In [9]:
# Calculate number of duplicated rows
print("Duplicated rows:")
duplicates = df[df.duplicated()]
duplicates
Duplicated rows:
Out[9]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability

There does not appear to be any duplicate rows of data

Distribution of nulls¶

In [10]:
# Write a function calculating what percentage of each feature is null.
def na_per(boop):
    na_sums = boop.isna().sum()
    na_per = round(na_sums/len(boop)*100,2)
    print(f"{na_per}% of {boop.name} are missing values(n = {na_sums})")

for i, feature in enumerate(df.columns):
    na_per(df[feature])
14.99% of ph are missing values(n = 491)
0.0% of Hardness are missing values(n = 0)
0.0% of Solids are missing values(n = 0)
0.0% of Chloramines are missing values(n = 0)
23.84% of Sulfate are missing values(n = 781)
0.0% of Conductivity are missing values(n = 0)
0.0% of Organic_carbon are missing values(n = 0)
4.95% of Trihalomethanes are missing values(n = 162)
0.0% of Turbidity are missing values(n = 0)
0.0% of Potability are missing values(n = 0)

Missingness for Not Potable Samples

In [11]:
x1 = df.loc[df.Potability==0,]
for i, feature in enumerate(x1.columns):
    na_per(x1[feature])
15.72% of ph are missing values(n = 314)
0.0% of Hardness are missing values(n = 0)
0.0% of Solids are missing values(n = 0)
0.0% of Chloramines are missing values(n = 0)
24.42% of Sulfate are missing values(n = 488)
0.0% of Conductivity are missing values(n = 0)
0.0% of Organic_carbon are missing values(n = 0)
5.36% of Trihalomethanes are missing values(n = 107)
0.0% of Turbidity are missing values(n = 0)
0.0% of Potability are missing values(n = 0)

Missingness for Potable Samples

In [12]:
x2 = df.loc[df.Potability==1,]

for i, feature in enumerate(x2.columns):
    na_per(x2[feature])
13.85% of ph are missing values(n = 177)
0.0% of Hardness are missing values(n = 0)
0.0% of Solids are missing values(n = 0)
0.0% of Chloramines are missing values(n = 0)
22.93% of Sulfate are missing values(n = 293)
0.0% of Conductivity are missing values(n = 0)
0.0% of Organic_carbon are missing values(n = 0)
4.3% of Trihalomethanes are missing values(n = 55)
0.0% of Turbidity are missing values(n = 0)
0.0% of Potability are missing values(n = 0)

Overall, the Sulfate variable has the largest amount of missing data at nearly one-quarter. Missing data appears to be fairly distributed across the potable and non-potable samples.

In [13]:
# Comparison of features with dependent variable (Potabilitiy)
df.groupby('Potability').mean()
Out[13]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity
Potability
0 7.085378 196.733292 21777.490788 7.092175 334.56429 426.730454 14.364335 66.303555 3.965800
1 7.073783 195.800744 22383.991018 7.169338 332.56699 425.383800 14.160893 66.539684 3.968328

Comment on your findings. Is there strong or weak correlation with dependent variable?

In [14]:
# Correlation of features with each other
round(df.corr(),3)
Out[14]:
ph Hardness Solids Chloramines Sulfate Conductivity Organic_carbon Trihalomethanes Turbidity Potability
ph 1.000 0.082 -0.089 -0.034 0.018 0.019 0.044 0.003 -0.039 -0.004
Hardness 0.082 1.000 -0.047 -0.030 -0.107 -0.024 0.004 -0.013 -0.014 -0.014
Solids -0.089 -0.047 1.000 -0.070 -0.172 0.014 0.010 -0.009 0.020 0.034
Chloramines -0.034 -0.030 -0.070 1.000 0.027 -0.020 -0.013 0.017 0.002 0.024
Sulfate 0.018 -0.107 -0.172 0.027 1.000 -0.016 0.031 -0.030 -0.011 -0.024
Conductivity 0.019 -0.024 0.014 -0.020 -0.016 1.000 0.021 0.001 0.006 -0.008
Organic_carbon 0.044 0.004 0.010 -0.013 0.031 0.021 1.000 -0.013 -0.027 -0.030
Trihalomethanes 0.003 -0.013 -0.009 0.017 -0.030 0.001 -0.013 1.000 -0.022 0.007
Turbidity -0.039 -0.014 0.020 0.002 -0.011 0.006 -0.027 -0.022 1.000 0.002
Potability -0.004 -0.014 0.034 0.024 -0.024 -0.008 -0.030 0.007 0.002 1.000
In [15]:
#count the number of samples across potable and non-potable samples
potability_series=df['Potability'].value_counts()
potability_per=pd.Series([(potability_series[0]/len(df)*100),(potability_series[1]/len(df)*100)])
potability_labels = pd.Series(['Not potable','Potable'])
df2=pd.DataFrame({"Potability":potability_labels, 'Count':potability_series, 'Percent':round(potability_per,2)})
df2
Out[15]:
Potability Count Percent
0 Not potable 1998 60.99
1 Potable 1278 39.01
In [16]:
#plot the number of samples across potable and non-potable samples
plt.bar(x = df2["Potability"],
        height = df2["Count"],
       color = "midnightblue" )
plt.title("Potability of Samples", size = 16, fontweight = "bold")
plt.ylabel("Number of Samples", size = 13)
plt.yticks(size = 13)
plt.xticks( size = 13)
plt.show()

Based on a prelinimary look, there does not appear to be notable differences between potable and non-potable sample across each of the the nine features. Potable samples comprise a majority of the data set at roughly 60%. Examination of the linear associations between variables indicated a general lack of association, with two minor exceptions. There are weak associations between Sulfate and Hardness and Sulfate and Solids (r= -.11 and -.17, respectively).

Feature Visualization¶

Below are visualizations of each feature for the whole data set as well as by Potability.

Feature Distibution¶

In [17]:
#adjust figure space
plt.figure(figsize = (15,10), tight_layout = True).text(-0.01, 0.5, 'Number of Samples', va='center', rotation='vertical', 
                                                       fontsize = 18, weight = "bold")
# plot histogram for each of the features 
for i, feature in enumerate(df):
    if feature != 'Potability':
        plt.subplot(3,3,i+1)
        plt.hist(df[feature],
             bins = 30,
             color = "#108A99")
        plt.title(f"Distribution of {feature}", fontsize = 14, weight = "bold")
        plt.xlabel("")
        plt.ylabel("")
        sns.despine()

Feature Distribution by Potability of Sample¶

In [18]:
plt.figure(figsize = (15,10), tight_layout = True).text(-0.01, 0.5, 'Number of Samples', va='center', rotation='vertical', 
                                                       fontsize = 18, weight = "bold")
colors = ["steelblue","firebrick"]
names = ["Potable","Not potable"]
for i, feature in enumerate(df):
    if feature != 'Potability':
        plt.subplot(3,3,i+1)
        x1 = df.loc[df.Potability==0,feature]
        x2 = df.loc[df.Potability==1,feature]
        plt.hist([x2,x1], alpha = 0.8, bins = 30, color = colors, label = names, density = True, stacked = False)
        plt.legend()
        plt.title(f"{feature} by Water Potability", fontsize = 14, weight = "bold")
        plt.xlabel("")
        plt.ylabel("")
        sns.despine()

Across the sample as a whole, each of the features appears normally distributed. When distributions are compared by the water potability, they largely overlap across each variable.

Addressing Missing Values¶

Let's examine the descriptive statistics for both drinkable and non-drinkable water samples for these features. This way we will see how different they are from each other and decide on methodology to fill nulls.

In [19]:
# fill null values with feature mean
df3=df.copy()
df3['ph'].fillna(df3['ph'].mean(), inplace=True)
df3['Sulfate'].fillna(df3['Sulfate'].mean(), inplace=True)
df3['Trihalomethanes'].fillna(df3['Trihalomethanes'].mean(), inplace=True)
In [20]:
#check for any remaining missing values
for i, feature in enumerate(df3.columns):
    na_per(df3[feature])
0.0% of ph are missing values(n = 0)
0.0% of Hardness are missing values(n = 0)
0.0% of Solids are missing values(n = 0)
0.0% of Chloramines are missing values(n = 0)
0.0% of Sulfate are missing values(n = 0)
0.0% of Conductivity are missing values(n = 0)
0.0% of Organic_carbon are missing values(n = 0)
0.0% of Trihalomethanes are missing values(n = 0)
0.0% of Turbidity are missing values(n = 0)
0.0% of Potability are missing values(n = 0)

Predicting Water Potability¶

Below are logistic regression plots of each feature predicting water potability.

In [21]:
plt.figure(figsize = (20,15), tight_layout = True)
y= df3['Potability'].astype(bool)
for i, feature in enumerate(df3.columns):
    if feature != 'Potability':
        plt.subplot(3,3,i+1)
        sns.regplot(x=df3[feature],y=df3['Potability'].astype(bool),data=df3,logistic=True,
        scatter_kws={'color':"steelblue"}, line_kws={'color':"midnightblue"})
        plt.title(f"{feature}", fontsize = 14, weight = "bold")

Overall, results of the logistic regression plots largely align with descriptive and visual examinations of each feature. Across all features, there does not appear to be predictive relationship with water potability.

Exploring associations between features in relation to water potability.¶

In [22]:
g=sns.pairplot(df, hue="Potability")
new_labels = ['Non-potable', ' Potable']
for t, l in zip(g._legend.texts, new_labels):
    t.set_text(l)

Similar to the results above, there is a large overlap of data points across potable and non-potable samples within the majority of pairplots. Some exceptions include clear linear associations between variables for the potable samples specifically. For example, there appears to be a strong negative linear association between ph and Sulfate for the potable samples but not for the non-potable samples.