How to remove (or replace) \xa0 from a text in Python

When you parse HTML files using Beautiful Soup in Python, you might get many \xa0 characters appearing in your text string.

The \xa0 is a Unicode character representing non-breaking space   HTML entity.

There are several ways you can replace the characters with white spaces in your text string:

Use unicodedata.normalize() method
Use BeautifulSoup.get_text() method with strip=True argument
Use replace() method
Use decode() method for Python 2

This tutorial shows how to use each solution to remove the text in practice.

1. Use `unicodedata.normalize()` method

The unicodedata.normalize() method in the unicodedata module can be used to normalize the \xa0 characters, replacing them with white spaces.

The method requires two arguments:

The normalization form used in the process (‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’)
The string you want to normalize

For the purpose of removing \xa0 characters, any normalization method you use works fine:

import unicodedata

test_str = "Python\xa0is\xa0awesome"

new_str = unicodedata.normalize("NFC", test_str)

print(new_str)  # Python is awesome

Notice that the \xa0 characters are gone from the new_str in the example above.

2. Use `BeautifulSoup.get_text()` method

If you’re processing files using Beautiful Soup, then you can use the BeautifulSoup.get_text() method to strip HTML entities from the result string.

You only need to pass the argument strip=True when calling the get_text() method as follows:

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    text = soup.get_text(strip=True)

The result string will have the HTML entities stripped and   should be replaced with white spaces.

If you still see the \xa0 characters, then you need to use unicodedata.normalize() to process the string.

3. Use `replace()` method

You can use the str.replace() method to remove \xa0 characters as follows:

test_str = "Python\xa0is\xa0awesome"

new_str = test_str.replace('\xa0', ' ')

print(new_str)  # Python is awesome

This method works for both Python 2 and Python 3. Another benefit of using str.replace() is that you can use this method when you have a list of strings that have \xa0 characters.

The example below shows how to use list comprehension and str.replace() to remove \xa0 characters in the strings:

text_list = ["Python\xa0is\xa0awesome", "Learning\xa0HTML"]

new_list = [t.replace('\xa0', ' ') for t in text_list]

print(new_list)
# ['Python is awesome', 'Learning HTML']

4. Use `decode()` method for Python 2

If you’re using Python 2, then you can also choose to use str.decode() method to decode into ascii and ignore the Unicode characters:

test_str = "Python\xa0is\xa0awesome"
new_str = test_str.decode('ascii', 'ignore')
print(new_str)  # Pythonisawesome

This solution isn’t recommended because the special characters can only be ignored, resulting in a string with no spaces. It’s better to use replace() instead.

Conclusion

You’ve learned different ways you can remove the Unicode characters \xa0 from a text string in Python.

The Unicode characters can be normalized using unicodedata.normalize() method, replacing them with white spaces. You can also use BeautifulSoup.get_text() if you’re processing text from HTML or XML files.

Alternatively, you can also use the str.replace(), but this method requires you to specify the string portion to replace, so you need to replace the process for every Unicode character you have in your string.

I hope this tutorial helps. Happy coding! 😉

How to remove (or replace) \xa0 from a text in Python

1. Use `unicodedata.normalize()` method

2. Use `BeautifulSoup.get_text()` method

3. Use `replace()` method

4. Use `decode()` method for Python 2

Conclusion

Take your skills to the next level ⚡️

About

Search

Tags

How to remove (or replace) \xa0 from a text in Python

1. Use unicodedata.normalize() method

2. Use BeautifulSoup.get_text() method

3. Use replace() method

4. Use decode() method for Python 2

Conclusion

Take your skills to the next level ⚡️

About

Search

Tags

1. Use `unicodedata.normalize()` method

2. Use `BeautifulSoup.get_text()` method

3. Use `replace()` method

4. Use `decode()` method for Python 2