Handling Unicode Strings in Python
Table of Contents
Created On: 2016-08-25 Updated On: 2020-01-27
I am a seasoned python developer, I have seen many UnicodeDecodeError myself, I have seen many new pythonista experience problems related to unicode strings. Actually understanding and handling text data in computer is never easy. Sometimes the programming language makes it even harder. In this post, I will try to explain everything about text and unicode handling in python.
Text Representation in Python
In python, text could be presented using unicode string or bytes. Unicode is a standard for encoding character. Unicode string is a python data structure that can store zero or more unicode characters. Unicode string is designed to store text data. On the other hand, bytes are just a serial of bytes, which could store arbitrary binary data. When you work on strings in RAM, you can probably do it with unicode string alone. Once you need to do IO, you need a binary representation of the string. Typical IO includes reading from and writing to console, files, and network sockets.
Unicode string literal, byte literal and their types are different in python 2 and python 3, as shown in the following table.
python2.7 | python3.4+ | |
---|---|---|
unicode string literal | u"✓ means check" | "✓ means check" or u"✓ means check" |
unicode string type | unicode | str |
byte literal | "abc" or b"abc" | b"abc" |
byte type | str | bytes |
You can get python3.4's string literal behavior in python2.7 using future import:
from __future__ import unicode_literals
When you use unicode string literals that includes non-ascii characters in python source code, you need to specify a source file encoding in the beginning of the file:
#!/usr/bin/env python # coding=utf-8
This coding should match the real encoding of the text file. In linux, it's usually utf-8.
It's recommend you always put coding information there. Just config your IDE to insert the code block when you create a new python source file.
Converting Between Unicode Strings and Bytes
Unicode string can be encoded to bytes using some pre-defined encoding like UTF-8, UTF-16 etc. Bytes can be decoded to unicode string, but this may fail because not all byte sequence are valid strings in a specific encoding.
Converting between unicode and bytes is done via encode
and decode
method:
>>> u"✓ means check".encode("utf-8") b'\xe2\x9c\x93 means check' >>> u"✓ means check".encode("utf-8").decode("utf-8") '✓ means check' >>>
Bytes decoding could fail, you can choose how to handle failure using the errors parameter. The default action is to throw UnicodeDecodeError exception. If you leave it that way, you should capture the exception and handle it.
>>> help(b''.decode) Help on built-in function decode: decode(...) S.decode([encoding[,errors]]) -> object Decodes S using the codec registered for encoding. encoding defaults to the default encoding. errors may be given to set a different error handling scheme. Default is 'strict' meaning that encoding errors raise a UnicodeDecodeError. Other possible values are 'ignore' and 'replace' as well as any other name registered with codecs.register_error that is able to handle UnicodeDecodeErrors.
Displaying Unicode String in REPL
In python2, if you print
a unicode string that is in some container (list,
tuple, dict etc) in REPL, non-ascii characters may be displayed as
"\uxxxx". This is not an encoding/decoding issue. It's how the REPL shows
non-ascii characters. When you want to display the character glyph, you have to
unpack the data structure by using join
or loops.
Example code:
>>> a = [u"✓ means check", "abc"] >>> print a [u'\u2713 means check', 'abc'] >>> print u", ".join(a) ✓ means check, abc >>> for s in a: ... print s ... ✓ means check abc >>>
Only raw unicode string is print
as glyph, any level of nested unicode
string are print
as "\uxxxx".
IO boundary issue
When doing IO, we need to leave the comfortable unicode string zone and deal with raw bytes, some encoding/decoding must be done at these system boundaries. This is called the IO boundary issue.
When we read from IO device, we usually get bytes. If we are actually dealing with string, we need to know the source encoding, and decode accordingly.
In pure logic code, we always deal with unicode string.
When we write to IO device, we need to specify an encoding and convert unicode string to bytes.
For beginners, it's recommended you always write all logic code to handle unicode string and do explicit encode/decode at IO boundaries. When dealing with strings, the pure logic code should accept unicode string as input and return unicode string as output. Some libraries may support doing the encode and decode for you. You should read the library manual and pay attention when using them. These can save you some typing. Under the hood, it still does the encoding and decoding at boundaries.
For experienced programmers, sometimes you may prefer to skip some encoding/decoding for performance. When this is the case, document the types that you expect and return in function docstring.
IO boundary issue: Concrete Case Studies
Handling File IO
Read text file in python2, you should decode line to get unicode string, encode line before writing to a file.
Example:
def process_line(line): """this is an example function that works on unicode string and return unicode string. """ return line[::-1] def reverse_all_lines(src, dest): """reverse all lines in src file, write result to dest file. Args: src: source text file name. dest: target text file name. Return: None. This function is for side-effects only. """ with open(dest, "w") as fout: with open(src, "r") as fin: for line in fin: fout.write(process_line(line.decode("utf-8")).encode("utf-8"))
The same code in python3:
def process_line(line): """this is an example function that works on unicode string and return unicode string. """ return line[::-1] def reverse_all_lines(src, dest): """reverse all lines in src file, write result to dest file. Args: src: source text file name. dest: target text file name. Return: None. This function is for side-effects only. """ with open(dest, "w", encoding="utf-8") as fout: with open(src, "r", encoding="utf-8") as fin: for line in fin: fout.write(process_line(line))
In python3, open
function supports encoding keyword parameter,
decoding/encoding can happen under the hood automatically. You can just work
with unicode string.
On the other hand, if you do not use the encoding parameter. You should do explicit encoding/decoding as in python2.
Handling Database IO
Reading data from database is similar to reading from file. Decode when reading, process it, encode when writing. However, some python database libraries do this for you automatically. sqlite3, MySQLdb, psycopg2 all allow you to pass unicode string directly to INSERT or SELECT statement. When you specify the string encoding when creating connection, the returned string is also decoded to unicode string automatically.
Here is a psycopg2 example:
#!/usr/bin/env python # coding=utf-8 """ postgres database read/write example """ import psycopg2 def get_conn(): return psycopg2.connect(host="localhost", database="t1", user="t1", password="fNfwREMqO69TB9YqE+/OzF5/k+s=") def write(): with get_conn() as conn: cur = conn.cursor() cur.execute(u"""\ CREATE TABLE IF NOT EXISTS t1 (id integer, data text); """) cur.execute(u"""\ DELETE FROM t1 """) cur.execute(u"""\ INSERT INTO t1 VALUES (%s, %s) """, (1, u"✓")) def read(): with get_conn() as conn: cur = conn.cursor() cur.execute(u"""\ SELECT id, data FROM t1 """) for row in cur: data = row[1].decode('utf-8') print(type(data), data) def main(): write() read() if __name__ == '__main__': main()
Read more in Psycopg2 Unicode Handling.
Handling HTTP request and response
When sending HTTP request, data should be encoded according to HTTP standards. The most easy way to encode data is using the requests library.
When reading HTTP response, data should be decoded according to response content-type and content encoding. Sometimes HTML body text's encoding can't be inferred and decode may fail. If you are working with text in HTML, you should handle these cases. For example, you could choose to ignore them or log the error.
Here are examples of using the requests library:
#!/usr/bin/env python # coding=utf-8 """ sending HTTP requests using requests library """ import json import requests def test_get_response(): r = requests.get("https://www.gnu.org/software/emacs/") assert type(r.content) is bytes # r.content is response body in raw bytes assert type(r.text) is unicode # r.text is decoded response body def test_encode_data_for_get(): r = requests.get("https://api.github.com/repos/sylecn/ff-nextpage/issues", {"state": "closed"}, # get request data is encoded using query parameter headers={"Accept": "application/vnd.github.v3+json"}) for issue in r.json(): assert type(issue['title']) is unicode def test_encode_data_for_post_form_urlencoded(): """visit http://requestb.in/14d9thu1?inspect to see how the request looks like. """ r = requests.post("http://requestb.in/14d9thu1", {"keyword": u"日光灯", "limit": 20}) # post data is encoded using application/x-www-form-urlencoded assert r.status_code == 200 def test_encode_data_for_post_raw(): """visit http://requestb.in/14d9thu1?inspect to see how the request looks like. """ data = json.dumps({"keyword": u"日光灯", "limit": 20}) assert type(data) is bytes r = requests.post("http://requestb.in/14d9thu1", data) # raw body is also supported assert r.status_code == 200
Logging
Python's logging module is complex to config. But I won't talk about its configuration here. When you want to log some text, you should just use unicode string. Let logging handle the encoding conversions.
If you only have bytes, decode it to unicode string before passing to logger function. Otherwise, the program may crash because python will try decode using ascii codec by default.
Example code:
#!/usr/bin/env python # coding=utf-8 """ logging text data """ import logging logging.basicConfig(format='%(levelname)-8s %(message)s', level=logging.DEBUG) logger = logging.getLogger(__name__) def reverse(line): logger.debug(u"reverse line: %s", line) return line[::-1] def main(): print(reverse(u"✓ correct")) if __name__ == '__main__': main()
Handling String in JSON encoding and decoding
When encoding python object to JSON, keep using unicode string. When decoding JSON string to python object, you will get unicode string.
#!/usr/bin/env python # coding=utf-8 """ json encode/decode example """ from __future__ import unicode_literals import json def test_main(): o = {"correct": "✓", "incorrect": "❌"} assert json.dumps(o) r = json.loads(json.dumps(o)) assert "correct" in r assert type(r["correct"]) is unicode if __name__ == '__main__': test_main()
When a python object is encoded to JSON, non-ascii character will be encoded
as \uxxxx
. This is just one valid syntax for JSON's string data type and can
provide better cross platform/language compatibility.
If you don't want to see the \uxxxx
in result JSON string. You may use
ensure_ascii=False parameter of json.dumps
, this will return a unicode
json string.
#!/usr/bin/env python # coding=utf-8 """ json encode/decode example """ from __future__ import unicode_literals import json def test_json_unicode(): o = {"correct": "✓", "incorrect": "❌"} json_string = json.dumps(o, ensure_ascii=False) assert type(json_string) is unicode r = json.loads(json.dumps(o)) assert "correct" in r assert type(r["correct"]) is unicode
Handling Strings When Using Redis
In Redis, string values can contain arbitrary binary data, for instance you can store a jpeg image. When you store text as string in redis, and retrieve it, you will get a bytes object. If you want to get unicode string back, use decode_responses=True when creating a redis connection/instance.
Also, in Redis, there is no integer, double or boolean type. These are stored as string value. When you store a number in a redis key, what you get back is a string, either bytes or unicode. As seen in the example:
#!/usr/bin/env python # coding=utf-8 """redis example in python2 """ import redis def test_redis(): conn = redis.StrictRedis(host='localhost', port=6379, db=0) conn.set(u'somestring', u'✓ correct') assert type(conn.get(u'somestring')) is str assert conn.get(u'somestring') == b'✓ correct' # non string types conn.set(u'someint', 123) assert type(conn.get(u'someint')) is str assert conn.get(u'someint') == b'123' conn.set(u'somedouble', 123.1) assert type(conn.get(u'somedouble')) is str assert conn.get(u'somedouble') == b'123.1' conn.set(u'somebool', True) # don't do this. assert type(conn.get(u'somebool')) is str assert conn.get(u'somebool') == b'True' conn.hset(u"somehash", "key1", '✓ correct') conn.hset(u"somehash", "key2", '❌ wrong') d = conn.hgetall(u"somehash") assert "key1" in d assert u'key1' in d assert type(d['key1']) is bytes assert d['key1'] == u'✓ correct'.encode('utf-8') assert d['key1'] != u'✓ correct' def test_redis_auto_decode(): conn = redis.StrictRedis(host='localhost', port=6379, db=0, decode_responses=True) conn.set(u'somestring', u'✓ correct') assert type(conn.get(u'somestring')) is unicode assert conn.get(u'somestring') == u'✓ correct' # non string types conn.set(u'someint', 123) assert type(conn.get(u'someint')) is unicode assert conn.get(u'someint') == u'123' conn.set(u'somedouble', 123.1) assert type(conn.get(u'somedouble')) is unicode assert conn.get(u'somedouble') == u'123.1' conn.hset(u"somehash", "key1", '✓ correct') conn.hset(u"somehash", "key2", '❌ wrong') d = conn.hgetall(u"somehash") assert "key1" in d assert u'key1' in d assert type(d['key1']) is unicode assert d['key1'] == u'✓ correct' assert d['key1'] != u'✓ correct'.encode('utf-8')
Things get a little nasty in python3. In python3, redis keys and values are strictly bytes. This is especially tricky when dealing with hashes.
#!/usr/bin/env python3 # coding=utf-8 """redis example in python3 """ import redis def test_redis(): conn = redis.StrictRedis(host='localhost', port=6379, db=0) conn.set('somestring', '✓ correct') assert type(conn.get('somestring')) is bytes assert conn.get('somestring') == '✓ correct'.encode('utf-8') # non string types conn.set('someint', 123) assert type(conn.get('someint')) is bytes assert conn.get('someint') == b'123' conn.set('somedouble', 123.1) assert type(conn.get('somedouble')) is bytes assert conn.get('somedouble') == b'123.1' conn.set('somebool', True) # don't do this. assert type(conn.get('somebool')) is bytes assert conn.get('somebool') == b'True' conn.hset(u"somehash", "key1", '✓ correct') conn.hset(u"somehash", "key2", '❌ wrong') d = conn.hgetall(u"somehash") assert "key1" not in d assert b'key1' in d assert type(d[b'key1']) is bytes assert d[b'key1'] == '✓ correct'.encode('utf-8') def test_redis_auto_decode(): conn = redis.StrictRedis(host='localhost', port=6379, db=0, decode_responses=True) conn.set('somestring', '✓ correct') assert type(conn.get('somestring')) is str assert conn.get('somestring') == '✓ correct' # non string types conn.set('someint', 123) assert type(conn.get('someint')) is str assert conn.get('someint') == '123' conn.set('somedouble', 123.1) assert type(conn.get('somedouble')) is str assert conn.get('somedouble') == '123.1' conn.hset("somehash", "key1", '✓ correct') conn.hset("somehash", "key2", '❌ wrong') d = conn.hgetall("somehash") assert "key1" in d assert b'key1' not in d assert type(d['key1']) is str assert d['key1'] == u'✓ correct'
Handling Text in PyQt
In PyQt, you should use unicode string or QString. PyQt will accept both (and more). When reading data from other resource, convert them to unicode string or QString first.
Running Python in Apache2 mod_wsgi
Apache2 use C locale by default, which can cause lots of problems in python
program that deals with non-ascii text. To change that, you need to update
/etc/apache2/envvars
to set proper LANG.
## Uncomment the following line to use the system default locale instead: . /etc/default/locale export LANG
Then restart apache2.
Running Python in upstart, systemd
Programs started by upstart or systemd is a direct children of PID 1. Usually many environment variables and resource limit setting is not effective. It can cause mysterious problem at run time. I recommend you set at least the following options in upstart or systemd.
Upstart:
env LANG=en_US.UTF-8 env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin limit nofile 65535 65535
Systemd:
[Service] Environment="LANG=en_US.UTF-8" Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" LimitNOFILE=65535
LANG variable will affect string encoding and decoding. Max number of open files often affects servers with lots of connections or file descriptors.
Running Python in Docker
When run python program in docker container, in Dockerfile, you should add
ENV LANG "C.UTF-8" ENV LC_ALL "C.UTF-8"
If this is not set, on most base images, default system locale will be C. Unicode decode could fail. You may not be able to print unicode string in console.
Summary
Writing software that handles unicode is great. Seeing UnicodeDecodeError is awful. Seeing software or library that other people wrote throw UnicodeDecodeError can be frustrating. Get it correct from day one if you care about i18n and l10n.
This post is supposed to help you understand unicode in python, both the basic information and the practical use cases. If you know a very different use case or trap that is not covered above, please leave a comment so this article could be improved.
Also Read
There is another great post about unicode in python3 that I recommend. Pragmatic Unicode from Ned Batchelder in 2012.