Excel to Python: CLEAN Function - A Complete Guide | Mito
Home Icon
divider

Functions

divider

Text

divider

CLEAN

How to Use Excel's CLEAN Function in Pandas

Pandas' data cleaning capabilities are vital for preparing datasets for analysis. By cleaning data, analysts ensure that their datasets are accurate and that their analyses are reliable.

A helpful, but sometimes forgotten Excel function is CLEAN. This function removes all non-printable characters from text, including line breaks, tabs, and other non-printable characters.

This page guides you through the process repllicating Excel's CLEAN function in pandas.

Mito is an open source library that lets you write Excel formulas in Python. Either write the formula directly in Python or use the CLEAN formula in the Mito Spreadsheet and generate the equivalent Python code automatically.

Mito's CLEAN function works exactly like it does in Excel. That means you don't need worry about managing data types, handling errors, or the edge case differences between Excel and Python formulas.

Install Mito to start using Excel formulas in Python.

# Import the mitosheet Excel functions
from mitosheet.public.v3 import *;

# Use Mito's CLEAN function
df['A'] = CLEAN(df['A'])
Copy!
Clipboard

To remove non-printable characters in pandas, you often need to employ regular expressions alongside pandas' string methods. This ensures that your DataFrame only contains printable and useful data.

The Excel CLEAN function removes removes the first 32 non-printable characters from the provided text. To implement the same functionality in pandas, you need to use a regular expression to match non-printable characters and replace them with an empty string.

# Remove non-printable characters
df['A'] = df['A'].str.replace(r'[\x00-\x1F]', '', regex=True)
Copy!
Clipboard

While cleaning data with pandas is a powerful process, there are pitfalls that can lead to incorrect data manipulation or loss. Here's how to avoid some of the most common mistakes.

It's possible to accidentally remove valuable data when cleaning. For instance, using a broad regex pattern might catch and remove characters that are actually part of the data you need.

Ensure your regex patterns in the str.replace method are specific and test them on a subset of data to prevent unintended data loss

For example, the following code removes all non-printable characters and all extra white spaces from the column

# Test the pattern on a small subset first
df['test_column'] = df['Column'].str.replace('[\x00-\x1F\x7F]', '', regex=True)
Copy!
Clipboard

The CLEAN function in Excel removes all non-printable characters from text. That is, it removes the first 32 non-printable characters in the 7-bit ASCII code (values 0 through 31), if any are found, and returns the result. Text without these characters is returned unchanged.

=CLEAN(text)

CLEAN Excel Syntax

ParameterDescriptionData Type
textThe text from which you want to remove non-printable characters.string

Examples

FormulaDescriptionResult
=CLEAN("This is a test\n")Removes the line break from the text."This is a test"
=CLEAN(A1)Removes non-printable characters from the text in cell A1.Cleaned text

Don't re-invent the wheel. Use Excel formulas in Python.

Install Mito

Don't want to re-implement Excel's functionality in Python?

Automate analysis with Mito