How to Use Excel's CLEAN Function in Pandas
Pandas' data cleaning capabilities are vital for preparing datasets for analysis. By cleaning data, analysts ensure that their datasets are accurate and that their analyses are reliable.
A helpful, but sometimes forgotten Excel function is CLEAN. This function removes all non-printable characters from text, including line breaks, tabs, and other non-printable characters.
This page guides you through the process repllicating Excel's CLEAN function in pandas.
Implementing the Data Cleaning function in Pandas#
To remove non-printable characters in pandas, you often need to employ regular expressions alongside pandas' string methods. This ensures that your DataFrame only contains printable and useful data.
Remove Non-Printable Characters#
The Excel CLEAN function removes removes the first 32 non-printable characters from the provided text. To implement the same functionality in pandas, you need to use a regular expression to match non-printable characters and replace them with an empty string.
# Remove non-printable characters
df['A'] = df['A'].str.replace(r'[\x00-\x1F]', '', regex=True)
Common mistakes when using CLEAN in Python#
While cleaning data with pandas is a powerful process, there are pitfalls that can lead to incorrect data manipulation or loss. Here's how to avoid some of the most common mistakes.
Data Loss During Cleaning#
It's possible to accidentally remove valuable data when cleaning. For instance, using a broad regex pattern might catch and remove characters that are actually part of the data you need.
Ensure your regex patterns in the str.replace method are specific and test them on a subset of data to prevent unintended data loss
For example, the following code removes all non-printable characters and all extra white spaces from the column
# Test the pattern on a small subset first
df['test_column'] = df['Column'].str.replace('[\x00-\x1F\x7F]', '', regex=True)
Understanding the Data Cleaning Formula in Excel#
The CLEAN function in Excel removes all non-printable characters from text. That is, it removes the first 32 non-printable characters in the 7-bit ASCII code (values 0 through 31), if any are found, and returns the result. Text without these characters is returned unchanged.
CLEAN Excel Syntax
|The text from which you want to remove non-printable characters.
|=CLEAN("This is a test\n")
|Removes the line break from the text.
|"This is a test"
|Removes non-printable characters from the text in cell A1.
Don't want to re-implement Excel's functionality in Python?
Edit a spreadsheet.
Mito is the easiest way to write Excel formulas in Python. Every edit you make in the Mito spreadsheet is automatically converted to Python code.View all 100+ transformations →