Effective Strategies for Managing Null Values in Datasets
Written on
Chapter 1: Introduction to Null Value Handling
Dealing with null values poses a significant challenge in the realms of machine learning and deep learning. When using libraries like sklearn or TensorFlow, it is crucial to address these null values before feeding the data into any machine learning framework. Failing to do so can lead to confusing error messages that can hinder your progress.
In this article, we will explore several methods for managing null values, starting with simple techniques and gradually progressing to more sophisticated and efficient strategies. We will demonstrate these techniques using the well-known Titanic dataset.
To get started, let's load the dataset:
import pandas as pd
import numpy as np
import seaborn as sns
titanic = sns.load_dataset("titanic")
The initial display reveals several null values. Let’s examine the number of nulls present in each column:
titanic.isnull().sum()
The output indicates that the 'age' column has 177 missing values, while 'embark_town' has 2. Notably, the 'deck' column is missing 688 entries out of a total of 891 rows, which may necessitate its removal for any further analysis.
We will concentrate on the 'age' and 'embark_town' columns, tackling their null values.
Chapter 2: Basic Techniques for Handling Null Values
Section 2.1: Deleting Null Entries
The simplest method to manage null values is to remove any rows that contain them, assuming you have a sufficiently large dataset. This can be accomplished with the following command:
titanic.dropna()
However, since the Titanic dataset is relatively small, deleting rows often isn't practical, as it can significantly reduce the available data for analysis.
Section 2.2: Filling Nulls with Zero
An alternative approach is to replace all null values with zeros. For instance, we can fill in the nulls in the 'age' column like this:
titanic['age'].fillna(0)
This method, while straightforward, is rather simplistic and may not be appropriate for all cases. Age, for instance, cannot logically be zero.
Section 2.3: Forward and Backward Fill
Another common technique is to fill null values using either forward or backward filling. Forward fill replaces a null value with the preceding value in the series, while backward fill uses the subsequent value:
titanic['age'].ffill()
titanic['age'].bfill()
These methods can be useful, but they can also introduce inaccuracies if not applied carefully.
Section 2.4: Mean and Median Imputation
A more refined approach involves filling null values with the mean or median of the column. Using the median is often preferred due to its robustness against outliers:
titanic['age'].fillna(titanic['age'].median(), inplace=True)
This method provides a more realistic estimate for missing entries.
Section 2.5: Grouped Mean and Median Filling
Instead of using the overall mean or median, you can achieve greater accuracy by calculating these values based on specific groups. For instance, filling in the 'age' column by the mean age of each passenger class and survival status:
titanic['age'].fillna(titanic.groupby(['pclass', 'alive'])['age'].transform('mean'))
This method tailors the imputation to the specific characteristics of each group, improving the quality of the data.
Section 2.6: Imputing Categorical Values
For categorical columns such as 'embark_town', a similar approach can be applied. First, convert the categorical values to numerical codes:
titanic['embark_town'] = titanic['embark_town'].astype('category')
titanic['embark_town'] = titanic['embark_town'].cat.codes
Then, fill the nulls using the median of the respective groups:
titanic['embark_town'] = titanic['embark_town'].fillna(titanic.groupby(['pclass', 'alive'])['embark_town'].transform('median'))
Chapter 3: Advanced Techniques for Null Value Management
Section 3.1: Iterative Imputation Using Machine Learning
A highly effective technique for imputing null values involves using a machine learning model. This method utilizes the information from non-null entries to predict the missing values. For this demonstration, we will incorporate the 'deck' column in our analysis, even though it has a high number of nulls.
Using the RandomForestRegressor from sklearn, we can perform iterative imputation:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
titanic1 = titanic[['survived', 'pclass', 'age', 'sibsp', 'fare', 'embark_town']]
imptr = IterativeImputer(RandomForestRegressor(), max_iter=10, random_state=0)
titanic2 = pd.DataFrame(imptr.fit_transform(titanic1), columns=titanic1.columns)
After running the imputation, we can check for any remaining null values:
titanic2.isnull().sum()
The output confirms that all columns are now free from null values.
Conclusion
You may select various strategies tailored to individual columns based on your specific needs. If you have discovered alternative methods that enhance null value handling, please share your insights.
For additional resources and discussions, feel free to connect with me on Twitter and visit my Facebook page for more updates.
Chapter 4: Video Resources
For further insights, check out the following video tutorials:
Learn about handling null values for machine learning in Python with effective strategies.
Discover how to use the coalesce function in Power Query to manage null values seamlessly.