When you parse HTML files using Beautiful Soup in Python, you might get many
\xa0 characters appearing in your text string.
\xa0 is a Unicode character representing non-breaking space
There are several ways you can replace the characters with white spaces in your text string:
decode()method for Python 2
This tutorial shows how to use each solution to remove the text in practice.
unicodedata.normalize() method in the
unicodedata module can be used to normalize the
\xa0 characters, replacing them with white spaces.
The method requires two arguments:
- The normalization form used in the process (‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’)
- The string you want to normalize
For the purpose of removing
\xa0 characters, any normalization method you use works fine:
import unicodedata test_str = "Python\xa0is\xa0awesome" new_str = unicodedata.normalize("NFC", test_str) print(new_str) # Python is awesome
Notice that the
\xa0 characters are gone from the
new_str in the example above.
If you’re processing files using Beautiful Soup, then you can use the
BeautifulSoup.get_text() method to strip HTML entities from the result string.
You only need to pass the argument
strip=True when calling the
get_text() method as follows:
from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp, 'html.parser') text = soup.get_text(strip=True)
The result string will have the HTML entities stripped and
should be replaced with white spaces.
If you still see the
\xa0 characters, then you need to use
unicodedata.normalize() to process the string.
You can use the
str.replace() method to remove
\xa0 characters as follows:
test_str = "Python\xa0is\xa0awesome" new_str = test_str.replace('\xa0', ' ') print(new_str) # Python is awesome
This method works for both Python 2 and Python 3. Another benefit of using
str.replace() is that you can use this method when you have a list of strings that have
The example below shows how to use list comprehension and
str.replace() to remove
\xa0 characters in the strings:
text_list = ["Python\xa0is\xa0awesome", "Learning\xa0HTML"] new_list = [t.replace('\xa0', ' ') for t in text_list] print(new_list) # ['Python is awesome', 'Learning HTML']
decode() method for Python 2
If you’re using Python 2, then you can also choose to use
str.decode() method to decode into
ascii and ignore the Unicode characters:
test_str = "Python\xa0is\xa0awesome" new_str = test_str.decode('ascii', 'ignore') print(new_str) # Pythonisawesome
This solution isn’t recommended because the special characters can only be ignored, resulting in a string with no spaces. It’s better to use
You’ve learned different ways you can remove the Unicode characters
\xa0 from a text string in Python.
The Unicode characters can be normalized using
unicodedata.normalize() method, replacing them with white spaces. You can also use
BeautifulSoup.get_text() if you’re processing text from HTML or XML files.
Alternatively, you can also use the
str.replace(), but this method requires you to specify the string portion to replace, so you need to replace the process for every Unicode character you have in your string.
I hope this tutorial helps. Happy coding! 😉