Exploratory Data analysis (EDA) serves as the cornerstone of any data-driven investigation, offering a crucial initial step in understanding, uncovering and interpreting a dataset.
At its core, EDA is about delving into raw data to uncover patterns, trends, anomalies, outliers, missing values, incorrect data types and relationships that can otherwise go unnoticed. EDA in short is a process of cleaning and reviewing data to derive insights such as statistical inferences, correlations, identifying outliers and generating hypotheses for experiments and analysis. EDA acts as the compass that guides analysts through the labyrinth of raw data, unveiling its inherent intricacies. Through a meticulous blend of statistical summaries, visualizations, data manipulations EDA, transforms raw data into actionable insights.
As such, EDA encompasses comprehensive processes one of the EDA process is to identify outliers in a dataset. As the name suggests they are the points that significantly deviate from the rest of the observations in the dataset. They can manifest as unusually high or low values and often stand out as anomalies when compared to the majority of the data. While outliers may sometimes indicate genuine variations or rare events in the data, they can also result from errors in data collection, measurement, or entry.
Identifying outliers involves employing a combination of statistical and graphical methods to detect observations that deviate significantly from the rest of the dataset. Statistical techniques such as calculating z-scores or using interquartile range (IQR) can help quantify the degree of deviation of individual data points from the central tendency. Additionally, graphical tools like box plots, scatter plots, and histograms provide visual representations of the data distribution, making outliers easier to identify.
A starting place for identifying outliers is by using the pandas .describe() method as
planes["Price"].describe()
count 8140.000000
mean 9038.868796
std 4424.176124
min 1759.000000
25% 5228.000000
50% 8366.000000
75% 12373.000000
max 54826.000000
Name: Price, dtype: float64
Here we can see the maximum price is 6 - 6.5 times the mean and median which indicates extreme value.
Outliers can also be defined mathematically using the inter quartile range IQR
by calculating the difference between the 75th and 25th percentiles as seen in the box plot where the box contains the percentiles and the diamonds that are outside the box are the Outliers
Consider a dataset for Airline data where we are trying to understand the price difference between the different airlines by plotting a boxplot using the Seaborn library as below:
sns.boxplot(data=planes, x="Airline", y="Price")
plt.xticks(rotation=45
plt.show()
Identifying the IQR :
IQR = 75th percentile - 25th percentile = 7145
Upper Outliers> 75th percentile+(1.5*IQR) = 23090
Lower Outliers < 25th percentile -(1.5*IQR) = -5489
Data points above or below the upper bounds and lower bounds are considered potential outliers. Outliers lying outside the range of upper and lower bounds are flagged as potential anomalies and warrant further investigation.
Explanation: The boxplot shows the price variation between different Airlines although the quartiles are in similar range however, there are large variations of outliers observed which can be attributed to the large proportion of Jet Airways tickets are expensive as they are priced for First Class as compared to SpiceJet, Indigo as majority of their tickets are Economy priced. Here, we observe that Jet Airways exhibits an extreme range of outliers, which can be confirmed by calculating the standard deviation. by creating a new column using .groupby() and then calling a pandas function .transform(). Inside the .transform() we apply a lambda function using syntax lambda x : x.std. This calculates the standard deviation of Prices based on Airlines
planes["airline_price_st_dev"] = planes. groupby("Airline")["Price"].transform(lambda x: x.std())
print(planes[["Airline", "airline_price_st_dev"]].value_counts())
Airline airline_price_st_dev
Jet Airways 4256.192641 3082
IndiGo 2296.581819 1632
Air India 3773.043594 1399
Multiple carriers 3653.796305 959
SpiceJet 1827.006113 653
Vistara 2932.181455 376
Air Asia 2005.894258 260
GoAir 2792.066916 147
dtype: int64
Calculating the standard deviation allows us to understand price variability, comparison across Airlines, Identify Outliers - high standard deviation values indicate presence of outliers,
allowed data exploration and visualization by revealing patterns, trends and anomalies that may not be apparent from raw data alone.
This statistical measure provides further insight into the extent of deviation of data points from the mean, highlighting the outliers' significance in the dataset.
In conclusion, outliers play a crucial role in data analysis, offering valuable insights into the underlying characteristics and potential complexities of a dataset. While they may initially appear as anomalies or errors, outliers often reveal hidden patterns, trends, or rare events that can significantly impact decision-making and predictive models. By employing robust statistical techniques such as the Interquartile Range (IQR) method and visualization tools like boxplots, analysts can effectively identify and understand outliers, leading to more accurate and meaningful interpretations of the data. Embracing outliers as integral components of the data exploration process enables us to uncover untapped insights and refine our understanding of complex real-world phenomena. Ultimately, recognizing and appropriately handling outliers empowers data-driven organizations to make informed decisions and drive innovation in an increasingly data-centric world.
As always, thank you for following my journey, your feedback and comments are appreciated!
credits: datasets and info @datacamp
Comments