EDA at One-Stop: Unveiling the Power of Exploratory Data Analysis
Greetings, everyone! Exploratory Data Analysis (EDA) stands as a pivotal task in the data industry, molding our data into an optimal state for model training. Regardless of the project’s scale, EDA can be both time and effort-intensive. Consequently, conducting this analysis is indispensable for unearthing insights and readying the data for modeling.
Approaches to tackling EDA can vary greatly. In my data journey, navigating diverse EDA and machine learning Kaggle projects, a realization struck — the need for a template. This template is crafted to aid both newcomers entering the data industry and seasoned experts in navigating the analysis more efficiently.
Taking inspiration from a variety of Kaggle projects and insights from Krish Naik, I’ve forged an EDA template that harmonizes with various business use cases. Now, let’s delve into the intricacies of how the template operates. I’ll elucidate this process using a bike sales use case, supplemented by a brief Python code snippet for practical implementation.
You can also check out the complete project here: github
- Identifying and Fixing the Type of Variables: Upon receiving the data, my first step involves scrutinizing the raw data and understanding its features using fundamental functionalities such as info() and describe(). Subsequently, I ensure that each feature is correctly assigned the appropriate data type. For example, when observing bike sales data below, I attempted to discern the data types of each feature
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 10886 non-null datetime64[ns]
1 season 10886 non-null int64
2 holiday 10886 non-null int64
3 workingday 10886 non-null int64
4 weather 10886 non-null int64
5 temp 10886 non-null float64
6 atemp 10886 non-null float64
7 humidity 10886 non-null int64
8 windspeed 10886 non-null float64
9 casual 10886 non-null int64
10 registered 10886 non-null int64
11 count 10886 non-null int64
dtypes: datetime64[ns](1), float64(3), int64(8)
memory usage: 1020.7 KB
- Distinguishing Numerical and Categorical Features: After rectifying the data types, I create lists of numerical and categorical variables within the dataset. This separation streamlines subsequent Univariate and Bivariate Analyses. As seen above, our variables are numerical, leading me to believe that no further feature separation is required.
- Detecting Missing Values and Outliers: Managing missing values and outliers is of paramount importance in the early stages. I conduct a thorough check for missing values, delving into the reasons behind them. Simultaneously, I perform outlier checks, considering the sensitivity of some machine learning algorithms to outliers. In the problem statement, I have followed the same process and observed that the target feature, price, exhibits right skewness.
train.isnull().sum()
Out[63]:
datetime 0
season 0
holiday 0
workingday 0
weather 0
temp 0
atemp 0
humidity 0
windspeed 0
casual 0
registered 0
count 0
dtype: int64
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("Bikes Analysis")
sns.distplot(train['temp'], ax=axes[0, 0,], label="temp")
axes[0, 0].set_xlabel('Temperature')
sns.distplot(train['atemp'], ax=axes[0, 1], label="atemp")
axes[0, 1].set_xlabel('Feels-like Temperature')
sns.distplot(train['humidity'], ax=axes[1, 0], label='humidity')
axes[1, 0].set_xlabel('Humidity')
sns.distplot(train['windspeed'], ax=axes[1, 1], label='windspeed')
axes[1, 1].set_xlabel('Windspeed')
plt.show()
- Dropping Unnecessary Columns: In this phase, I review column values to ensure accuracy. For instance, I might eliminate prefixes like ‘s’ from the ‘season’ column, transforming it into a numerical format. I have performed some preprocessing and dropped columns that I deemed irrelevant for our use.
def drop_columns(df,columns):
df.drop(columns,axis=1,inplace=True)
return df
drop_columns(train,["datetime","casual","registered"])
- Handling Missing Values and Outliers: If missing values are identified, I address them through imputation techniques or by discarding columns. Similarly, I handle outliers using the Interquartile Range (IQR). In our case, I have used the same method to address outliers.
# Calculate the IQR
Q1 = train['windspeed'].quantile(0.25)
Q3 = train['windspeed'].quantile(0.75)
IQR = Q3 - Q1
# Calculate the lower and upper bridges
lower_bridge = Q1 - (IQR * 1.5)
upper_bridge = Q3 + (IQR * 1.5)
print("Lower Bridge:", lower_bridge)
print("Upper Bridge:", upper_bridge)
# Handling outliers
train.loc[train['windspeed'] > upper_bridge, 'windspeed'] = upper_bridge
sns.distplot(train['windspeed'], label='windspeed')
- Conducting Univariate and Bivariate Analysis: Distinguishing between categorical and numerical variables, I conduct univariate and bivariate analyses. This involves exploring different combinations with the target feature to extract valuable insights from the data, utilizing tools like Seaborn and Matplotlib. You can review some of the analyses I performed on the features here
- Handling Categorical Features: Although categorical features are classified, machine learning models require numerical input. To address this, I employ encoding methods such as one-hot encoding or label encoding. Considering the use case, since we don’t have any categorical variables, we can skip this step.
- Feature Scaling: The next step involves feature scaling, where I normalize features within a specific range. This mitigates issues related to varying feature magnitudes during modeling. I have used MinMax Scaler to scale our features.
from sklearn.preprocessing import MinMaxScaler
def scaling(df):
scaler=MinMaxScaler()
num_cols=['temp','atemp','humidity','windspeed']
df[num_cols]=scaler.fit_transform(df[num_cols])
return df
train=scaling(train)
test=scaling(test)
- Addressing Data Imbalance: Scrutinizing data imbalance is crucial for understanding whether the dataset leans towards a specific output. Resampling techniques or ensemble algorithms are deployed to rectify imbalances. In our classification problem statement, addressing data imbalance might not be necessary.
- Feature Selection: Typically performed at the end to combat the Curse of Dimensionality, feature selection involves various techniques such as heat maps, correlation methods, or genetic algorithms. These methods assist in identifying essential features for the subsequent model-building stage. Upon analyzing heat maps, I understood that I require all my features, so I proceeded to the modeling stage without dropping any of them.
These are the precise steps I adhere to in both petite Kaggle projects and comprehensive end-to-end data science ventures. I trust this guide proves beneficial for newcomers and serves as a refresher for those well-acquainted with EDA concepts. If you have additional steps to suggest, please share them in the comments for the benefit of others.
Thanks for joining the EDA journey!