Python Standard Encodings

2022-4-2 12:37| 发布者: Hocassian| 查看: 383| 评论: 0

摘要: Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables.

Python Standard Encodings

Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables. The following table lists the codecs by name, together with a few common aliases, and the languages for which the encoding is likely used. Neither the list of aliases nor the list of languages is meant to be exhaustive. Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.

Many of the character sets support the same languages. They vary in individual characters (e.g. whether the EURO SIGN is supported or not), and in the assignment of characters to code positions. For the European languages in particular, the following variants typically exist:

  • an ISO 8859 codeset
  • a Microsoft Windows code page, which is typically derived from an 8859 codeset, but replaces control characters with additional graphic characters
  • an IBM EBCDIC code page
  • an IBM PC code page, which is ASCII compatible
CodecAliasesLanguages
ascii646, us-asciiEnglish
big5big5-tw, csbig5Traditional Chinese
big5hkscsbig5-hkscs, hkscsTraditional Chinese
cp037IBM037, IBM039English
cp424EBCDIC-CP-HE, IBM424Hebrew
cp437437, IBM437English
cp500EBCDIC-CP-BE, EBCDIC-CP-CH, IBM500Western Europe
cp720 Arabic
cp737 Greek
cp775IBM775Baltic languages
cp850850, IBM850Western Europe
cp852852, IBM852Central and Eastern Europe
cp855855, IBM855Bulgarian, Byelorussian, Macedonian, Russian, Serbian
cp856 Hebrew
cp857857, IBM857Turkish
cp858858, IBM858Western Europe
cp860860, IBM860Portuguese
cp861861, CP-IS, IBM861Icelandic
cp862862, IBM862Hebrew
cp863863, IBM863Canadian
cp864IBM864Arabic
cp865865, IBM865Danish, Norwegian
cp866866, IBM866Russian
cp869869, CP-GR, IBM869Greek
cp874 Thai
cp875 Greek
cp932932, ms932, mskanji, ms-kanjiJapanese
cp949949, ms949, uhcKorean
cp950950, ms950Traditional Chinese
cp1006 Urdu
cp1026ibm1026Turkish
cp1140ibm1140Western Europe
cp1250windows-1250Central and Eastern Europe
cp1251windows-1251Bulgarian, Byelorussian, Macedonian, Russian, Serbian
cp1252windows-1252Western Europe
cp1253windows-1253Greek
cp1254windows-1254Turkish
cp1255windows-1255Hebrew
cp1256windows-1256Arabic
cp1257windows-1257Baltic languages
cp1258windows-1258Vietnamese
euc_jpeucjp, ujis, u-jisJapanese
euc_jis_2004jisx0213, eucjis2004Japanese
euc_jisx0213eucjisx0213Japanese
euc_kreuckr, korean, ksc5601, ks_c-5601, ks_c-5601-1987, ksx1001, ks_x-1001Korean
gb2312chinese, csiso58gb231280, euc- cn, euccn, eucgb2312-cn, gb2312-1980, gb2312-80, iso- ir-58Simplified Chinese
gbk936, cp936, ms936Unified Chinese
gb18030gb18030-2000Unified Chinese
hzhzgb, hz-gb, hz-gb-2312Simplified Chinese
iso2022_jpcsiso2022jp, iso2022jp, iso-2022-jpJapanese
iso2022_jp_1iso2022jp-1, iso-2022-jp-1Japanese
iso2022_jp_2iso2022jp-2, iso-2022-jp-2Japanese, Korean, Simplified Chinese, Western Europe, Greek
iso2022_jp_2004iso2022jp-2004, iso-2022-jp-2004Japanese
iso2022_jp_3iso2022jp-3, iso-2022-jp-3Japanese
iso2022_jp_extiso2022jp-ext, iso-2022-jp-extJapanese
iso2022_krcsiso2022kr, iso2022kr, iso-2022-krKorean
latin_1iso-8859-1, iso8859-1, 8859, cp819, latin, latin1, L1West Europe
iso8859_2iso-8859-2, latin2, L2Central and Eastern Europe
iso8859_3iso-8859-3, latin3, L3Esperanto, Maltese
iso8859_4iso-8859-4, latin4, L4Baltic languages
iso8859_5iso-8859-5, cyrillicBulgarian, Byelorussian, Macedonian, Russian, Serbian
iso8859_6iso-8859-6, arabicArabic
iso8859_7iso-8859-7, greek, greek8Greek
iso8859_8iso-8859-8, hebrewHebrew
iso8859_9iso-8859-9, latin5, L5Turkish
iso8859_10iso-8859-10, latin6, L6Nordic languages
iso8859_11iso-8859-11, thaiThai languages
iso8859_13iso-8859-13, latin7, L7Baltic languages
iso8859_14iso-8859-14, latin8, L8Celtic languages
iso8859_15iso-8859-15, latin9, L9Western Europe
iso8859_16iso-8859-16, latin10, L10South-Eastern Europe
johabcp1361, ms1361Korean
koi8_r Russian
koi8_u Ukrainian
mac_cyrillicmaccyrillicBulgarian, Byelorussian, Macedonian, Russian, Serbian
mac_greekmacgreekGreek
mac_icelandmacicelandIcelandic
mac_latin2maclatin2, maccentraleuropeCentral and Eastern Europe
mac_romanmacromanWestern Europe
mac_turkishmacturkishTurkish
ptcp154csptcp154, pt154, cp154, cyrillic-asianKazakh
shift_jiscsshiftjis, shiftjis, sjis, s_jisJapanese
shift_jis_2004shiftjis2004, sjis_2004, sjis2004Japanese
shift_jisx0213shiftjisx0213, sjisx0213, s_jisx0213Japanese
utf_32U32, utf32all languages
utf_32_beUTF-32BEall languages
utf_32_leUTF-32LEall languages
utf_16U16, utf16all languages
utf_16_beUTF-16BEall languages (BMP only)
utf_16_leUTF-16LEall languages (BMP only)
utf_7U7, unicode-1-1-utf-7all languages
utf_8U8, UTF, utf8all languages
utf_8_sig all languages

Python Specific Encodings

A number of predefined codecs are specific to Python, so their codec names have no meaning outside Python. These are listed in the tables below based on the expected input and output types (note that while text encodings are the most common use case for codecs, the underlying codec infrastructure supports arbitrary data transforms rather than just text encodings). For asymmetric codecs, the stated purpose describes the encoding direction.

The following codecs provide unicode-to-str encoding [1] and str-to-unicode decoding [2], similar to the Unicode text encodings.

CodecAliasesPurpose
idna Implements RFC 3490, see also encodings.idna
mbcsdbcsWindows only: Encode operand according to the ANSI codepage (CP_ACP)
palmos Encoding of PalmOS 3.5
punycode Implements RFC 3492
raw_unicode_escape Produce a string that is suitable as raw Unicode literal in Python source code
rot_13rot13Returns the Caesar-cypher encryption of the operand
undefined Raise an exception for all conversions. Can be used as the system encoding if no automatic coercion between byte and Unicode strings is desired.
unicode_escape Produce a string that is suitable as Unicode literal in Python source code
unicode_internal Return the internal representation of the operand

New in version 2.3: The idna and punycode encodings.

The following codecs provide str-to-str encoding and decoding [2].

CodecAliasesPurposeEncoder/decoder
base64_codecbase64, base-64Convert operand to multiline MIME base64 (the result always includes a trailing 'n')base64.encodestring(),base64.decodestring()
bz2_codecbz2Compress the operand using bz2bz2.compress()bz2.decompress()
hex_codechexConvert operand to hexadecimal representation, with two digits per bytebinascii.b2a_hex()binascii.a2b_hex()
quopri_codecquopri, quoted-printable, quotedprintableConvert operand to MIME quoted printablequopri.encode() with quotetabs=True,quopri.decode()
string_escape Produce a string that is suitable as string literal in Python source code 
uu_codecuuConvert the operand using uuencodeuu.encode()uu.decode()
zlib_codeczip, zlibCompress the operand using gzipzlib.compress()zlib.decompress()
[1]str objects are also accepted as input in place of unicode objects. They are implicitly converted to unicode by decoding them using the default encoding. If this conversion fails, it may lead to encoding operations raising UnicodeDecodeError.
[2](12) unicode objects are also accepted as input in place of str objects. They are implicitly converted to str by encoding them using the default encoding. If this conversion fails, it may lead to decoding operations raising UnicodeEncodeError.


路过

雷人

握手

鲜花

鸡蛋

最新评论

引用 Hocassian 2022-5-14 23:20
呦呵

返回顶部