Handling Unicode Strings in Python

Table of Contents

Created On: 2016-08-25 Updated On: 2016-11-15

I am a seasoned python developer, I have seen many UnicodeDecodeError myself, I have seen many new pythonista experience problems related to unicode strings. Actually understanding and handling text data in computer is never easy. Sometimes the programming language makes it even harder. In this post, I will try to explain everything about text and unicode handling in python.

Text Representation in Python

In python, text could be presented using unicode string or bytes. Unicode is a standard for encoding character. Unicode string is a python data structure that can store zero or more unicode characters. Unicode string is designed to store text data. On the other hand, bytes are just a serial of bytes, which could store arbitrary binary data. When you work on strings in RAM, you can probably do it with unicode string alone. Once you need to do IO, you need a binary representation of the string. Typical IO includes reading from and writing to console, files, and network sockets.

Unicode string literal, byte literal and their types are different in python 2 and python 3, as shown in the following table.

  python2.7 python3.4+
unicode string literal u"✓ means check" "✓ means check" or u"✓ means check"
unicode string type unicode str
byte literal "abc" or b"abc" b"abc"
byte type str bytes

You can get python3.4's string literal behavior in python2.7 using future import:

from __future__ import unicode_literals

When you use unicode string literals that includes non-ascii characters in python source code, you need to specify a source file encoding in the beginning of the file:

#!/usr/bin/env python
# coding=utf-8

This coding should match the real encoding of the text file. In linux, it's usually utf-8.

It's recommend you always put coding information there. Just config your IDE to insert the code block when you create a new python source file.

Converting Between Unicode Strings and Bytes

Unicode string can be encoded to bytes using some pre-defined encoding like UTF-8, UTF-16 etc. Bytes can be decoded to unicode string, but this may fail because not all byte sequence are valid strings in a specific encoding.

Converting between unicode and bytes is done via encode and decode method:

>>> u"✓ means check".encode("utf-8")
b'\xe2\x9c\x93 means check'
>>> u"✓ means check".encode("utf-8").decode("utf-8")
'✓ means check'
>>>

Bytes decoding could fail, you can choose how to handle failure using the errors parameter. The default action is to throw UnicodeDecodeError exception. If you leave it that way, you should capture the exception and handle it.

>>> help(b''.decode)
Help on built-in function decode:

decode(...)
    S.decode([encoding[,errors]]) -> object

    Decodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
    as well as any other name registered with codecs.register_error that is
    able to handle UnicodeDecodeErrors.

Displaying Unicode String in REPL

In python2, if you print a unicode string that is in some container (list, tuple, dict etc) in REPL, non-ascii characters may be displayed as "\uxxxx". This is not an encoding/decoding issue. It's how the REPL shows non-ascii characters. When you want to display the character glyph, you have to unpack the data structure by using join or loops.

Example code:

>>> a = [u"✓ means check", "abc"]
>>> print a
[u'\u2713 means check', 'abc']
>>> print u", ".join(a)
✓ means check, abc
>>> for s in a:
...     print s
...
✓ means check
abc
>>>

Only raw unicode string is print as glyph, any level of nested unicode string are print as "\uxxxx".

IO boundary issue

When doing IO, we need to leave the comfortable unicode string zone and deal with raw bytes, some encoding/decoding must be done at these system boundaries. This is called the IO boundary issue.

When we read from IO device, we usually get bytes. If we are actually dealing with string, we need to know the source encoding, and decode accordingly.

In pure logic code, we always deal with unicode string.

When we write to IO device, we need to specify an encoding and convert unicode string to bytes.

python-io-boundary-issue.svg

For beginners, it's recommended you always write all logic code to handle unicode string and do explicit encode/decode at IO boundaries. When dealing with strings, the pure logic code should accept unicode string as input and return unicode string as output. Some libraries may support doing the encode and decode for you. You should read the library manual and pay attention when using them. These can save you some typing. Under the hood, it still does the encoding and decoding at boundaries.

For experienced programmers, sometimes you may prefer to skip some encoding/decoding for performance. When this is the case, document the types that you expect and return in function docstring.

IO boundary issue: Concrete Case Studies

Handling File IO

Read text file in python2, you should decode line to get unicode string, encode line before writing to a file.

Example:

def process_line(line):
    """this is an example function that works on unicode string and return
    unicode string.

    """
    return line[::-1]


def reverse_all_lines(src, dest):
    """reverse all lines in src file, write result to dest file.

    Args:
        src: source text file name.
        dest: target text file name.

    Return:
        None. This function is for side-effects only.

    """
    with open(dest, "w") as fout:
        with open(src, "r") as fin:
            for line in fin:
                fout.write(process_line(line.decode("utf-8")).encode("utf-8"))

The same code in python3:

def process_line(line):
    """this is an example function that works on unicode string and return
    unicode string.

    """
    return line[::-1]


def reverse_all_lines(src, dest):
    """reverse all lines in src file, write result to dest file.

    Args:
        src: source text file name.
        dest: target text file name.

    Return:
        None. This function is for side-effects only.

    """
    with open(dest, "w", encoding="utf-8") as fout:
        with open(src, "r", encoding="utf-8") as fin:
            for line in fin:
                fout.write(process_line(line))

In python3, open function supports encoding keyword parameter, decoding/encoding can happen under the hood automatically. You can just work with unicode string.

On the other hand, if you do not use the encoding parameter. You should do explicit encoding/decoding as in python2.

Handling Database IO

Reading data from database is similar to reading from file. Decode when reading, process it, encode when writing. However, some python database libraries do this for you automatically. sqlite3, MySQLdb, psycopg2 all allow you to pass unicode string directly to INSERT or SELECT statement. When you specify the string encoding when creating connection, the returned string is also decoded to unicode string automatically.

Here is a psycopg2 example:

#!/usr/bin/env python
# coding=utf-8

"""
postgres database read/write example
"""

import psycopg2


def get_conn():
    return psycopg2.connect(host="localhost",
                            database="t1",
                            user="t1",
                            password="fNfwREMqO69TB9YqE+/OzF5/k+s=")


def write():
    with get_conn() as conn:
        cur = conn.cursor()
        cur.execute(u"""\
CREATE TABLE IF NOT EXISTS t1
(id integer,
 data text);
""")
        cur.execute(u"""\
DELETE FROM t1
""")
        cur.execute(u"""\
INSERT INTO t1 VALUES (%s, %s)
""", (1, u"✓"))


def read():
    with get_conn() as conn:
        cur = conn.cursor()
        cur.execute(u"""\
SELECT id, data FROM t1
""")
        for row in cur:
            data = row[1].decode('utf-8')
            print(type(data), data)


def main():
    write()
    read()


if __name__ == '__main__':
    main()

Read more in Psycopg2 Unicode Handling.

Handling HTTP request and response

When sending HTTP request, data should be encoded according to HTTP standards. The most easy way to encode data is using the requests library.

When reading HTTP response, data should be decoded according to response content-type and content encoding. Sometimes HTML body text's encoding can't be inferred and decode may fail. If you are working with text in HTML, you should handle these cases. For example, you could choose to ignore them or log the error.

Here are examples of using the requests library:

#!/usr/bin/env python
# coding=utf-8

"""
sending HTTP requests using requests library
"""

import json
import requests


def test_get_response():
    r = requests.get("https://www.gnu.org/software/emacs/")
    assert type(r.content) is bytes    # r.content is response body in raw bytes
    assert type(r.text) is unicode    # r.text is decoded response body


def test_encode_data_for_get():
    r = requests.get("https://api.github.com/repos/sylecn/ff-nextpage/issues",
                     {"state": "closed"},    # get request data is encoded using query parameter
                     headers={"Accept": "application/vnd.github.v3+json"})
    for issue in r.json():
        assert type(issue['title']) is unicode


def test_encode_data_for_post_form_urlencoded():
    """visit http://requestb.in/14d9thu1?inspect to see how the request looks like.

    """
    r = requests.post("http://requestb.in/14d9thu1",
                      {"keyword": u"日光灯",
                       "limit": 20})    # post data is encoded using application/x-www-form-urlencoded
    assert r.status_code == 200


def test_encode_data_for_post_raw():
    """visit http://requestb.in/14d9thu1?inspect to see how the request looks like.

    """
    data = json.dumps({"keyword": u"日光灯",
                       "limit": 20})
    assert type(data) is bytes
    r = requests.post("http://requestb.in/14d9thu1", data)    # raw body is also supported
    assert r.status_code == 200

Logging

Python's logging module is complex to config. But I won't talk about its configuration here. When you want to log some text, you should just use unicode string. Let logging handle the encoding conversions.

If you only have bytes, decode it to unicode string before passing to logger function. Otherwise, the program may crash because python will try decode using ascii codec by default.

Example code:

#!/usr/bin/env python
# coding=utf-8

"""
logging text data
"""

import logging

logging.basicConfig(format='%(levelname)-8s %(message)s',
                    level=logging.DEBUG)
logger = logging.getLogger(__name__)


def reverse(line):
    logger.debug(u"reverse line: %s", line)
    return line[::-1]


def main():
    print(reverse(u"✓ correct"))


if __name__ == '__main__':
    main()

Handling String in JSON encoding and decoding

When encoding python object to JSON, keep using unicode string. When decoding JSON string to python object, you will get unicode string.

#!/usr/bin/env python
# coding=utf-8

"""
json encode/decode example
"""

from __future__ import unicode_literals

import json


def test_main():
    o = {"correct": "✓",
         "incorrect": "❌"}
    assert json.dumps(o)
    r = json.loads(json.dumps(o))
    assert "correct" in r
    assert type(r["correct"]) is unicode


if __name__ == '__main__':
    test_main()

When a python object is encoded to JSON, non-ascii character will be encoded as \uxxxx. This is just one valid syntax for JSON's string data type and can provide better cross platform/language compatibility.

If you don't want to see the \uxxxx in result JSON string. You may use ensure_ascii=False parameter of json.dumps, this will return a unicode json string.

#!/usr/bin/env python
# coding=utf-8

"""
json encode/decode example
"""

from __future__ import unicode_literals

import json


def test_json_unicode():
    o = {"correct": "✓",
         "incorrect": "❌"}
    json_string = json.dumps(o, ensure_ascii=False)
    assert type(json_string) is unicode
    r = json.loads(json.dumps(o))
    assert "correct" in r
    assert type(r["correct"]) is unicode

Handling Strings When Using Redis

In Redis, string values can contain arbitrary binary data, for instance you can store a jpeg image. When you store text as string in redis, and retrieve it, you will get a bytes object. If you want to get unicode string back, use decode_responses=True when creating a redis connection/instance.

Also, in Redis, there is no integer, double or boolean type. These are stored as string value. When you store a number in a redis key, what you get back is a string, either bytes or unicode. As seen in the example:

#!/usr/bin/env python
# coding=utf-8

"""redis example in python2

"""

import redis


def test_redis():
    conn = redis.StrictRedis(host='localhost', port=6379, db=0)
    conn.set(u'somestring', u'✓ correct')
    assert type(conn.get(u'somestring')) is str
    assert conn.get(u'somestring') == b'✓ correct'

    # non string types

    conn.set(u'someint', 123)
    assert type(conn.get(u'someint')) is str
    assert conn.get(u'someint') == b'123'

    conn.set(u'somedouble', 123.1)
    assert type(conn.get(u'somedouble')) is str
    assert conn.get(u'somedouble') == b'123.1'

    conn.set(u'somebool', True)    # don't do this.
    assert type(conn.get(u'somebool')) is str
    assert conn.get(u'somebool') == b'True'

    conn.hset(u"somehash", "key1", '✓ correct')
    conn.hset(u"somehash", "key2", '❌ wrong')
    d = conn.hgetall(u"somehash")
    assert "key1" in d
    assert u'key1' in d
    assert type(d['key1']) is bytes
    assert d['key1'] == u'✓ correct'.encode('utf-8')
    assert d['key1'] != u'✓ correct'


def test_redis_auto_decode():
    conn = redis.StrictRedis(host='localhost', port=6379, db=0,
                             decode_responses=True)
    conn.set(u'somestring', u'✓ correct')
    assert type(conn.get(u'somestring')) is unicode
    assert conn.get(u'somestring') == u'✓ correct'

    # non string types

    conn.set(u'someint', 123)
    assert type(conn.get(u'someint')) is unicode
    assert conn.get(u'someint') == u'123'

    conn.set(u'somedouble', 123.1)
    assert type(conn.get(u'somedouble')) is unicode
    assert conn.get(u'somedouble') == u'123.1'

    conn.hset(u"somehash", "key1", '✓ correct')
    conn.hset(u"somehash", "key2", '❌ wrong')
    d = conn.hgetall(u"somehash")
    assert "key1" in d
    assert u'key1' in d
    assert type(d['key1']) is unicode
    assert d['key1'] == u'✓ correct'
    assert d['key1'] != u'✓ correct'.encode('utf-8')

Things get a little nasty in python3. In python3, redis keys and values are strictly bytes. This is especially tricky when dealing with hashes.

#!/usr/bin/env python3
# coding=utf-8

"""redis example in python3

"""

import redis


def test_redis():
    conn = redis.StrictRedis(host='localhost', port=6379, db=0)
    conn.set('somestring', '✓ correct')
    assert type(conn.get('somestring')) is bytes
    assert conn.get('somestring') == '✓ correct'.encode('utf-8')

    # non string types

    conn.set('someint', 123)
    assert type(conn.get('someint')) is bytes
    assert conn.get('someint') == b'123'

    conn.set('somedouble', 123.1)
    assert type(conn.get('somedouble')) is bytes
    assert conn.get('somedouble') == b'123.1'

    conn.set('somebool', True)    # don't do this.
    assert type(conn.get('somebool')) is bytes
    assert conn.get('somebool') == b'True'

    conn.hset(u"somehash", "key1", '✓ correct')
    conn.hset(u"somehash", "key2", '❌ wrong')
    d = conn.hgetall(u"somehash")
    assert "key1" not in d
    assert b'key1' in d
    assert type(d[b'key1']) is bytes
    assert d[b'key1'] == '✓ correct'.encode('utf-8')


def test_redis_auto_decode():
    conn = redis.StrictRedis(host='localhost', port=6379, db=0,
                             decode_responses=True)
    conn.set('somestring', '✓ correct')
    assert type(conn.get('somestring')) is str
    assert conn.get('somestring') == '✓ correct'

    # non string types

    conn.set('someint', 123)
    assert type(conn.get('someint')) is str
    assert conn.get('someint') == '123'

    conn.set('somedouble', 123.1)
    assert type(conn.get('somedouble')) is str
    assert conn.get('somedouble') == '123.1'

    conn.hset("somehash", "key1", '✓ correct')
    conn.hset("somehash", "key2", '❌ wrong')
    d = conn.hgetall("somehash")
    assert "key1" in d
    assert b'key1' not in d
    assert type(d['key1']) is str
    assert d['key1'] == u'✓ correct'

Handling Text in PyQt

In PyQt, you should use unicode string or QString. PyQt will accept both (and more). When reading data from other resource, convert them to unicode string or QString first.

Running Python in Apache2 mod_wsgi

Apache2 use C locale by default, which can cause lots of problems in python program that deals with non-ascii text. To change that, you need to update /etc/apache2/envvars to set proper LANG.

## Uncomment the following line to use the system default locale instead:
. /etc/default/locale
export LANG

Then restart apache2.

Running Python in upstart, systemd

Programs started by upstart or systemd is a direct children of PID 1. Usually many environment variables and resource limit setting is not effective. It can cause mysterious problem at run time. I recommend you set at least the following options in upstart or systemd.

Upstart:

env LANG=en_US.UTF-8
env PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

limit nofile 65535 65535

Systemd:

[Service]
Environment="LANG=en_US.UTF-8"
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

LimitNOFILE=65535

LANG variable will affect string encoding and decoding. Max number of open files often affects servers with lots of connections or file descriptors.

Summary

Writing software that handles unicode is great. Seeing UnicodeDecodeError is awful. Seeing software or library that other people wrote throw UnicodeDecodeError can be frustrating. Get it correct from day one if you care about i18n and l10n.

This post is supposed to help you understand unicode in python, both the basic information and the practical use cases. If you know a very different use case or trap that is not covered above, please leave a comment so this article could be improved.

Also Read

There is another great post about unicode in python3 that I recommend. Pragmatic Unicode from Ned Batchelder in 2012.

Is this post helpful?