Understanding Outliers :Exploratory Data Analysis with Python

Asmita Pradhan
Mar 22, 2024
4 min read

Exploratory Data analysis (EDA) serves as the cornerstone of any data-driven investigation, offering a crucial initial step in understanding, uncovering and interpreting a dataset.

At its core, EDA is about delving into raw data to uncover patterns, trends, anomalies, outliers, missing values, incorrect data types and relationships that can otherwise go unnoticed. EDA in short is a process of cleaning and reviewing data to derive insights such as statistical inferences, correlations, identifying outliers and generating hypotheses for experiments and analysis. EDA acts as the compass that guides analysts through the labyrinth of raw data, unveiling its inherent intricacies. Through a meticulous blend of statistical summaries, visualizations, data manipulations EDA, transforms raw data into actionable insights.

As such, EDA encompasses comprehensive processes one of the EDA process is to identify outliers in a dataset. As the name suggests they are the points that significantly deviate from the rest of the observations in the dataset. They can manifest as unusually high or low values and often stand out as anomalies when compared to the majority of the data. While outliers may sometimes indicate genuine variations or rare events in the data, they can also result from errors in data collection, measurement, or entry.

Identifying outliers involves employing a combination of statistical and graphical methods to detect observations that deviate significantly from the rest of the dataset. Statistical techniques such as calculating z-scores or using interquartile range (IQR) can help quantify the degree of deviation of individual data points from the central tendency. Additionally, graphical tools like box plots, scatter plots, and histograms provide visual representations of the data distribution, making outliers easier to identify.

A starting place for identifying outliers is by using the pandas .describe() method as

planes["Price"].describe()

count     8140.000000
mean      9038.868796
std       4424.176124
min       1759.000000
25%       5228.000000
50%       8366.000000
75%      12373.000000
max      54826.000000
Name: Price, dtype: float64

Here we can see the maximum price is 6 - 6.5 times the mean and median which indicates extreme value.

Outliers can also be defined mathematically using the inter quartile range IQR

by calculating the difference between the 75th and 25th percentiles as seen in the box plot where the box contains the percentiles and the diamonds that are outside the box are the Outliers

Consider a dataset for Airline data where we are trying to understand the price difference between the different airlines by plotting a boxplot using the Seaborn library as below:

sns.boxplot(data=planes, x="Airline", y="Price")

plt.xticks(rotation=45

plt.show()

Identifying the IQR :

IQR = 75th percentile - 25th percentile = 7145

Upper Outliers> 75th percentile+(1.5*IQR) = 23090

Lower Outliers < 25th percentile -(1.5*IQR) = -5489

Data points above or below the upper bounds and lower bounds are considered potential outliers. Outliers lying outside the range of upper and lower bounds are flagged as potential anomalies and warrant further investigation.

Explanation: The boxplot shows the price variation between different Airlines although the quartiles are in similar range however, there are large variations of outliers observed which can be attributed to the large proportion of Jet Airways tickets are expensive as they are priced for First Class as compared to SpiceJet, Indigo as majority of their tickets are Economy priced. Here, we observe that Jet Airways exhibits an extreme range of outliers, which can be confirmed by calculating the standard deviation. by creating a new column using .groupby() and then calling a pandas function .transform(). Inside the .transform() we apply a lambda function using syntax lambda x : x.std. This calculates the standard deviation of Prices based on Airlines

planes["airline_price_st_dev"] = planes. groupby("Airline")["Price"].transform(lambda x: x.std())

print(planes[["Airline", "airline_price_st_dev"]].value_counts())

Airline            airline_price_st_dev
Jet Airways        4256.192641             3082
IndiGo             2296.581819             1632
Air India          3773.043594             1399
Multiple carriers  3653.796305              959
SpiceJet           1827.006113              653
Vistara            2932.181455              376
Air Asia           2005.894258              260
GoAir              2792.066916              147
dtype: int64

Calculating the standard deviation allows us to understand price variability, comparison across Airlines, Identify Outliers - high standard deviation values indicate presence of outliers,

allowed data exploration and visualization by revealing patterns, trends and anomalies that may not be apparent from raw data alone.

This statistical measure provides further insight into the extent of deviation of data points from the mean, highlighting the outliers' significance in the dataset.

In conclusion, outliers play a crucial role in data analysis, offering valuable insights into the underlying characteristics and potential complexities of a dataset. While they may initially appear as anomalies or errors, outliers often reveal hidden patterns, trends, or rare events that can significantly impact decision-making and predictive models. By employing robust statistical techniques such as the Interquartile Range (IQR) method and visualization tools like boxplots, analysts can effectively identify and understand outliers, leading to more accurate and meaningful interpretations of the data. Embracing outliers as integral components of the data exploration process enables us to uncover untapped insights and refine our understanding of complex real-world phenomena. Ultimately, recognizing and appropriately handling outliers empowers data-driven organizations to make informed decisions and drive innovation in an increasingly data-centric world.

As always, thank you for following my journey, your feedback and comments are appreciated!

credits: datasets and info @datacamp

Understanding Outliers :Exploratory Data Analysis with Python

Recent Posts

Comentários

Comments and Feedback appreciated