How to remove (or replace) \xa0 from a text in Python

When you parse HTML files using Beautiful Soup in Python, you might get many \xa0 characters appearing in your text string.

The \xa0 is a Unicode character representing non-breaking space   HTML entity.

There are several ways you can replace the characters with white spaces in your text string:

  1. Use unicodedata.normalize() method
  2. Use BeautifulSoup.get_text() method with strip=True argument
  3. Use replace() method
  4. Use decode() method for Python 2

This tutorial shows how to use each solution to remove the text in practice.

1. Use unicodedata.normalize() method

The unicodedata.normalize() method in the unicodedata module can be used to normalize the \xa0 characters, replacing them with white spaces.

The method requires two arguments:

  1. The normalization form used in the process (‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’)
  2. The string you want to normalize

For the purpose of removing \xa0 characters, any normalization method you use works fine:

import unicodedata

test_str = "Python\xa0is\xa0awesome"

new_str = unicodedata.normalize("NFC", test_str)

print(new_str)  # Python is awesome

Notice that the \xa0 characters are gone from the new_str in the example above.

2. Use BeautifulSoup.get_text() method

If you’re processing files using Beautiful Soup, then you can use the BeautifulSoup.get_text() method to strip HTML entities from the result string.

You only need to pass the argument strip=True when calling the get_text() method as follows:

from bs4 import BeautifulSoup

with open("index.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    text = soup.get_text(strip=True)

The result string will have the HTML entities stripped and   should be replaced with white spaces.

If you still see the \xa0 characters, then you need to use unicodedata.normalize() to process the string.

3. Use replace() method

You can use the str.replace() method to remove \xa0 characters as follows:

test_str = "Python\xa0is\xa0awesome"

new_str = test_str.replace('\xa0', ' ')

print(new_str)  # Python is awesome

This method works for both Python 2 and Python 3. Another benefit of using str.replace() is that you can use this method when you have a list of strings that have \xa0 characters.

The example below shows how to use list comprehension and str.replace() to remove \xa0 characters in the strings:

text_list = ["Python\xa0is\xa0awesome", "Learning\xa0HTML"]

new_list = [t.replace('\xa0', ' ') for t in text_list]

print(new_list)
# ['Python is awesome', 'Learning HTML']

4. Use decode() method for Python 2

If you’re using Python 2, then you can also choose to use str.decode() method to decode into ascii and ignore the Unicode characters:

test_str = "Python\xa0is\xa0awesome"
new_str = test_str.decode('ascii', 'ignore')
print(new_str)  # Pythonisawesome

This solution isn’t recommended because the special characters can only be ignored, resulting in a string with no spaces. It’s better to use replace() instead.

Conclusion

You’ve learned different ways you can remove the Unicode characters \xa0 from a text string in Python.

The Unicode characters can be normalized using unicodedata.normalize() method, replacing them with white spaces. You can also use BeautifulSoup.get_text() if you’re processing text from HTML or XML files.

Alternatively, you can also use the str.replace(), but this method requires you to specify the string portion to replace, so you need to replace the process for every Unicode character you have in your string.

I hope this tutorial helps. Happy coding! 😉

Take your skills to the next level ⚡️

I'm sending out an occasional email with the latest tutorials on programming, web development, and statistics. Drop your email in the box below and I'll send new stuff straight into your inbox!

No spam. Unsubscribe anytime.