When you parse HTML files using Beautiful Soup in Python, you might get many \xa0
characters appearing in your text string.
The \xa0
is a Unicode character representing non-breaking space
HTML entity.
There are several ways you can replace the characters with white spaces in your text string:
- Use
unicodedata.normalize()
method - Use
BeautifulSoup.get_text()
method withstrip=True
argument - Use
replace()
method - Use
decode()
method for Python 2
This tutorial shows how to use each solution to remove the text in practice.
1. Use unicodedata.normalize()
method
The unicodedata.normalize()
method in the unicodedata
module can be used to normalize the \xa0
characters, replacing them with white spaces.
The method requires two arguments:
- The normalization form used in the process (‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’)
- The string you want to normalize
For the purpose of removing \xa0
characters, any normalization method you use works fine:
import unicodedata
test_str = "Python\xa0is\xa0awesome"
new_str = unicodedata.normalize("NFC", test_str)
print(new_str) # Python is awesome
Notice that the \xa0
characters are gone from the new_str
in the example above.
2. Use BeautifulSoup.get_text()
method
If you’re processing files using Beautiful Soup, then you can use the BeautifulSoup.get_text()
method to strip HTML entities from the result string.
You only need to pass the argument strip=True
when calling the get_text()
method as follows:
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
text = soup.get_text(strip=True)
The result string will have the HTML entities stripped and
should be replaced with white spaces.
If you still see the \xa0
characters, then you need to use unicodedata.normalize()
to process the string.
3. Use replace()
method
You can use the str.replace()
method to remove \xa0
characters as follows:
test_str = "Python\xa0is\xa0awesome"
new_str = test_str.replace('\xa0', ' ')
print(new_str) # Python is awesome
This method works for both Python 2 and Python 3. Another benefit of using str.replace()
is that you can use this method when you have a list of strings that have \xa0
characters.
The example below shows how to use list comprehension and str.replace()
to remove \xa0
characters in the strings:
text_list = ["Python\xa0is\xa0awesome", "Learning\xa0HTML"]
new_list = [t.replace('\xa0', ' ') for t in text_list]
print(new_list)
# ['Python is awesome', 'Learning HTML']
4. Use decode()
method for Python 2
If you’re using Python 2, then you can also choose to use str.decode()
method to decode into ascii
and ignore the Unicode characters:
test_str = "Python\xa0is\xa0awesome"
new_str = test_str.decode('ascii', 'ignore')
print(new_str) # Pythonisawesome
This solution isn’t recommended because the special characters can only be ignored, resulting in a string with no spaces. It’s better to use replace()
instead.
Conclusion
You’ve learned different ways you can remove the Unicode characters \xa0
from a text string in Python.
The Unicode characters can be normalized using unicodedata.normalize()
method, replacing them with white spaces. You can also use BeautifulSoup.get_text()
if you’re processing text from HTML or XML files.
Alternatively, you can also use the str.replace()
, but this method requires you to specify the string portion to replace, so you need to replace the process for every Unicode character you have in your string.
I hope this tutorial helps. Happy coding! 😉