Linux學習: 12月 2017

https://docs.python.org/3/howto/unicode.html
最早的ASCII碼(Code)，只有0~127(7bits)，無法包含更多特殊字。
1980年代的個人電腦幾乎都是8bit，所以可以處理ASCII且還多出128~255可以使用，但不同機器有不同的Code。
Unicode一開始用16bits來取代原來的8bits，有2^16 (65536)個值可使用，目標是可包含所有的語言，但其實還是不夠。後來擴展到 0 through 1,114,111 ( 0x10FFFF in base 16).
一個字元(Character)，Unicode定義了一個字元的Code point，就是這個字元的值。表示法如U+12CA就表示值是0x12ca的字元。
Encoding(編碼)是指從字串轉為位元組序列的規則，UTF-8就是一種最普遍的編碼方式，UTF (Unicode Transformation Format)，8是指8-bits的數字被使用來編碼。規則是這樣：

If the code point is < 128, it’s represented by the corresponding byte value.
If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

Python的預設編碼方式是UTF-8，可用特殊的註解改變(第一或第二行)

# -*- coding:  -*-

Python支援以Unicode為名的變數
若想保持Python的原始碼是ASCII-only，可用Escape字符 /u或/U或/N字元名來編寫

>>> "\N{GREEK CAPITAL LETTER DELTA}"  # Using the character name
'\u0394'
>>> "\u0394"                          # Using a 16-bit hex value
'\u0394'
>>> "\U00000394"                      # Using a 32-bit hex value
'\u0394'

另外，可用bytes的method "decode()"來輸出Unicode，另外帶encoding參數和error處理方式(請看最上方連結)
Python 3.2有100種不同的encoding。
單個字元的轉換，可用chr(int)輸出Unicode，用ord(str)輸出值(code point)
字串與位元組串的操作是bytes.decode() 和 str.encode()

The most important tip is:

Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end.

Linux學習

2017年12月20日星期三

Python的Unicode HOWTO

2017年12月20日 星期三

Python的Unicode HOWTO

2017年12月20日星期三