Python String Encoding¶
The Python developer community has published a great article that covers the details of unicode character processing.
- Python 3: https://docs.python.org/3/howto/unicode.html
- Python 2: https://docs.python.org/2/howto/unicode.html
The following notes are intended to help answer some common questions and issues that developers frequently encounter while learning to properly work with different character encodings in Python.
Does ChatterBot handle non-ascii characters?¶
ChatterBot is able to handle unicode values correctly. You can pass to it non-encoded data and it should be able to process it properly (you will need to make sure that you decode the output that is returned).
Below is one of ChatterBot’s tests from tests/test_chatbot.py, this is just a simple check that a unicode response can be processed.
def test_get_response_unicode(self): """ Test the case that a unicode string is passed in. """ response = self.chatbot.get_response(u'سلام') self.assertGreater(len(response.text), 0)
This test passes Python 3. It also verifies that ChatterBot can take unicode input without issue.
How do I fix Python encoding errors?¶
When working with string type data in Python, it is possible to encounter errors such as the following.
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 48: invalid start byte
Depending on what your code looks like, there are a few things that you can do to prevent errors like this.
# -*- coding: utf-8 -*-
When to use the unicode header¶
If your strings use escaped unicode characters (they look like
you do not need to add the header. If you use strings like
'ØÆÅ' then you are required
to use the header.
If you are using this header it must be the first line in your Python file.
Unicode escape characters¶
>>> print u'\u0420\u043e\u0441\u0441\u0438\u044f' Россия
When to use escape characters¶
Prefix your strings with the unicode escape character
u'...' when you are
using escaped unicode characters.