Python - Encoding and Unicode

> Procedural Languages > Python

1 - Default

print sys.getdefaultencoding()
Cp1252

In PyDev, you can change it in the Run Configuration:

and you get:

UTF-8

2 - How to

2.1 - get the console encoding

stdout:

import sys
print sys.stdout.encoding
Cp1252
Advertising

2.2 - get the system file encoding

print sys.getfilesystemencoding()
mbcs

2.3 - get rid of the Bom

s = u"This is an unicode string".encode('utf-8-sig')
print s # You will see the BOM
print s.decode('utf-8-sig')
This is an unicode string
This is an unicode string

3 - Environment variable

set PYTHONIOENCODING=UTF-8

4 - Support

4.1 - 'charmap' codec can't encode character u'\ufeff'

UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>

Character \ufeff is a BOM

4.2 - UnicodeEncodeError: 'charmap' codec can't encode character

The UnicodeEncodeError happens when encoding a unicode string into a certain coding.

Python encodes the output using default encoding then:

print u"\u20AC"

is equivalent to on a Windows platform:

print u"\u20AC".encode('Cp1252')

20AC is the Euro Sign as you can see in the Code page (cp) 1252

The codings mapping concerns only a limited number of unicode characters to str strings, a non-presented character will cause the coding-specific encode() to fail. The character set doesn't support all character.

For instance, the White heart suit (U+2661) is not present in the Cp1252 character set.

If you then try to print it, you will get a UnicodeEncodeError.

print u"\u2661"
Traceback (most recent call last):
  File "D:\workspace\PythonWorkpsace\mypackage\Test.py", line 1, in <module>
    print u"\u2661"
  File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2661' in position 0: character maps to <undefined>

To resolve this problem, you can:

  • encode it with a character set that support it.
print u"\u2661".encode('utf-8')
  • use the replace option of the encode function. It will replace an unknown character with a ?
print u"\u2661".encode(sys.getdefaultencoding(), 'replace')
?
Advertising

5 - Documentation / Reference