
When you parse HTML files using Beautiful Soup in Python, you might get many \xa0 characters appearing in your text string.
The \xa0 is a Unicode character representing non-breaking space HTML entity.
There are several ways you can replace the characters with white spaces in your text string:
- Use
unicodedata.normalize()method - Use
BeautifulSoup.get_text()method withstrip=Trueargument - Use
replace()method - Use
decode()method for Python 2
This tutorial shows how to use each solution to remove the text in practice.
1. Use unicodedata.normalize() method
The unicodedata.normalize() method in the unicodedata module can be used to normalize the \xa0 characters, replacing them with white spaces.
The method requires two arguments:
- The normalization form used in the process (‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’)
- The string you want to normalize
For the purpose of removing \xa0 characters, any normalization method you use works fine:
import unicodedata
test_str = "Python\xa0is\xa0awesome"
new_str = unicodedata.normalize("NFC", test_str)
print(new_str) # Python is awesome
Notice that the \xa0 characters are gone from the new_str in the example above.
2. Use BeautifulSoup.get_text() method
If you’re processing files using Beautiful Soup, then you can use the BeautifulSoup.get_text() method to strip HTML entities from the result string.
You only need to pass the argument strip=True when calling the get_text() method as follows:
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
text = soup.get_text(strip=True)
The result string will have the HTML entities stripped and should be replaced with white spaces.
If you still see the \xa0 characters, then you need to use unicodedata.normalize() to process the string.
3. Use replace() method
You can use the str.replace() method to remove \xa0 characters as follows:
test_str = "Python\xa0is\xa0awesome"
new_str = test_str.replace('\xa0', ' ')
print(new_str) # Python is awesome
This method works for both Python 2 and Python 3. Another benefit of using str.replace() is that you can use this method when you have a list of strings that have \xa0 characters.
The example below shows how to use list comprehension and str.replace() to remove \xa0 characters in the strings:
text_list = ["Python\xa0is\xa0awesome", "Learning\xa0HTML"]
new_list = [t.replace('\xa0', ' ') for t in text_list]
print(new_list)
# ['Python is awesome', 'Learning HTML']
4. Use decode() method for Python 2
If you’re using Python 2, then you can also choose to use str.decode() method to decode into ascii and ignore the Unicode characters:
test_str = "Python\xa0is\xa0awesome"
new_str = test_str.decode('ascii', 'ignore')
print(new_str) # Pythonisawesome
This solution isn’t recommended because the special characters can only be ignored, resulting in a string with no spaces. It’s better to use replace() instead.
Conclusion
You’ve learned different ways you can remove the Unicode characters \xa0 from a text string in Python.
The Unicode characters can be normalized using unicodedata.normalize() method, replacing them with white spaces. You can also use BeautifulSoup.get_text() if you’re processing text from HTML or XML files.
Alternatively, you can also use the str.replace(), but this method requires you to specify the string portion to replace, so you need to replace the process for every Unicode character you have in your string.
I hope this tutorial helps. Happy coding! 😉