One error that you might encounter when working with Python is:
UnicodeDecodeError: invalid continuation byte
This error occurs when you try to decode a bytes object with an encoding that doesn’t support that character.
This tutorial shows an example that causes this error and how to fix it.
How to reproduce this error
Suppose you have a bytes object in your Python code as follows:
bytes_obj = b"\xe1 b c"
Next, you want to decode the bytes character using the
utf-8 encoding like this:
str_obj = bytes_obj.decode('utf-8')
Traceback (most recent call last): File "main.py", line 3, in <module> str_obj = bytes_obj.decode('utf-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0: invalid continuation byte
You get an error because the character
\xe1 in the bytes object is the
á character encoded using
How to fix this error
To resolve this error, you need to change the encoding used in the
decode() method to
latin-1 as follows:
bytes_obj = b"\xe1 b c" str_obj = bytes_obj.decode('latin-1') print(str_obj) # á b c
Note that this time the
decode() method runs without any error.
You can also get this error when running other methods such as pandas
You need to specify the encoding used by the method as follows:
The same also works when you use the
open() function to work with files:
csv_file = open('example.csv', encoding='latin-1') # or: with open('example.csv', encoding='latin-1') as file:
If you only want to read the files without modifying the content, you can use the
open() function in
rb read binary mode.
Here’s an example when you parse an HTML file using Beautiful Soup:
soup = BeautifulSoup(open('index.html', 'rb'), 'html.parser') print(soup.get_text())
When you decode the bytes object, you need to use the encoding that supports the object.
If you don’t want to encode the object when opening a file, you need to specify the open mode as
wb to read and write in binary mode.
I hope this tutorial helps. See you in other tutorials! 👍