2024 iThome 鐵人賽

DAY 13

Python

為你自己讀 CPython 原始碼系列第 13 篇

Day 13 - 參觀 Bytecode 工廠

16th鐵人賽 python python3 原始碼為你自己學

高見龍

2024-09-27 23:36:57

507 瀏覽

分享至

本文同步刊載於「為你自己學 Python - 參觀 Bytecode 工廠」

參觀 Bytecode 工廠

為你自己學 Python

雖然 Python 在分類上算是被分類在直譯式程式語言，但 Python 程式碼在執行之前，會先被編譯成 Bytecode，這個在前面幾章介紹過好幾次。而 .pyc 檔，就是經過編譯之後的 Bytecode 所產生的檔案，產生 .pyc 檔案最主要的目的是為了提高程式的執行效率。這個章節我們來看看這些 .pyc 到底是怎麼生成的，裡面又包了哪些有趣的東西。

其實只要有 `.pyc` 就行了

為了做實驗，我先準備一個很厲害（並沒有）的 hello 模組：

# 檔案：hello.py

def greeting(name):
    print(f"Hello, {name}!")

以及主程式 app.py：

# 檔案：app.py

from hello import greeting
greeting("Kitty")

其實這些程式碼沒什麼營養啦，就只是展示用途而已。執行 python app.py 指令之後，應該會發現在目錄下多了一個 __pycache__ 的資料夾，裡面有一個 hello.cpython-312.pyc 的檔案

├── __pycache__
│   └── hello.cpython-312.pyc
├── app.py
└── hello.py

hello.cpython-312.pyc 檔名的編碼方式也不難猜，就是看用什麼版本的 Python 來編譯的。

這個檔案就是 Python 編譯過後的 Bytecode 檔案。不過這裡會看到只有產生 hello.py 的 .pyc 檔案，如果 app.py 也要順便產生一份的話，可以利用內建模組 py_compile 來達成：

$ python -m py_compile app.py

或是另一個更方便的模組 compileall：

$ python -m compileall .

一次把所有的 .py 檔案都編譯成 .pyc 檔案。

接著可以把這些 .py 檔刪掉，進到 __pycache__ 目錄裡，把 app.cpython-312.pyc 跟 hello.cpython-312.pyc 分別改成 app.pyc 以及 hello.pyc，然後執行 python app.pyc 指令，就會發現程式還是可以正常執行的。

$ python app.pyc
Hello, Kitty!

不過這沒什麼神秘的，在上個章節我們就有提到 Python 在讀取程式檔案的過程中，有個 maybe_pyc_file() 就會檢查是不是 .pyc 檔，是的話會用二進位的方式把檔案讀進來執行。

所以有些時候因為某些原因，不想給出 .py 原始碼，光是提供 .pyc 也是可以執行的。至於為什麼不想提供原始碼就不是我關心的話題，我這裡比較關心的，是為什麼執行 python app.py 的時候，主程式 app.py 不會產生 .pyc，但在主程式裡被 import 進來的 hello 模組就會產生。

這得從 Python 的 import 機制來看...

「可能是」.pyc 檔？

CPython 的 import 機制我們在第 7 章「匯入模組的時候...」曾經追過，前半段的工作在 Python/import.c，後半段則交由 Lib/importlib/_bootstrap.py 處理。在

但在執行主程式的時候，在上個章節也看到從最一開始的 Py_BytesMain() 追到最後執行的 run_eval_code_obj() 函數，都沒有看到有產生 .pyc 檔的行為。雖然在執行的時候，Python 會先檢查看看有沒有對應的 .pyc 檔，如果有就會把 .pyc 以二進位的方式把檔案讀進來，不然就是執行 pyrun_file() 函數。而大部份時候程式只會被執行一次，所以雖然特地將它再轉存成 .pyc 檔案來提高下一回的執行效率也不是不行，但看起來意義不大，所以 Python 在執行主程式的時候，並不會特地產生 .pyc 檔。相對的其它被 import 進來的模組，可能會在不同的程式中被重複使用，所以將它們轉成 .pyc 來提高後續的執行效率就有意義了。

這個 maybe_pyc_file() 函數怎麼會這麼沒有自信，程式的東西大多是非黑即白、非 0 則 1 嗎？怎麼會有什麼「可能是」？我們來看看是怎麼回事：

// 檔案：Python/pythonrun.c

static int
maybe_pyc_file(FILE *fp, PyObject *filename, int closeit)
{
    PyObject *ext = PyUnicode_FromString(".pyc");
    if (ext == NULL) {
        return -1;
    }
    Py_ssize_t endswith = PyUnicode_Tailmatch(filename, ext, 0, PY_SSIZE_T_MAX, +1);
    Py_DECREF(ext);
    if (endswith) {
        return 1;
    }

    // ... 略 ...

    /* Read only two bytes of the magic. If the file was opened in
       text mode, the bytes 3 and 4 of the magic (\r\n) might not
       be read as they are on disk. */
    unsigned int halfmagic = PyImport_GetMagicNumber() & 0xFFFF;
    unsigned char buf[2];
    /* Mess:  In case of -x, the stream is NOT at its start now,
       and ungetc() was used to push back the first newline,
       which makes the current stream position formally undefined,
       and a x-platform nightmare.
       Unfortunately, we have no direct way to know whether -x
       was specified.  So we use a terrible hack:  if the current
       stream position is not 0, we assume -x was specified, and
       give up.  Bug 132850 on SourceForge spells out the
       hopelessness of trying anything else (fseek and ftell
       don't work predictably x-platform for text-mode files).
    */
    int ispyc = 0;
    if (ftell(fp) == 0) {
        if (fread(buf, 1, 2, fp) == 2 &&
            ((unsigned int)buf[1]<<8 | buf[0]) == halfmagic)
            ispyc = 1;
        rewind(fp);
    }
    return ispyc;
}

它會檢查：

附檔名是不是 .pyc。
如果不是 .pyc 附檔名，則檢查檔案的開頭兩個位元組是不是 Python 的「魔術數字」。

這魔術數字我們待會再看，但在中間有一段註解寫到有個複雜的判斷，細節不明，詳細情況可能得再去追 SourceForge 上的 Bug 清單，但看起來是有跨平台編譯出來不容易判斷的問題，在遇到某些參數的時候會直接放棄判斷，我想這也是這個函數會用 maybe_ 來命名的原因。

魔術數字，Magic！

那麼這個「魔術數字」是什麼呢？來看看 PyImport_GetMagicNumber() 是怎麼拿這個號碼的：

// 檔案：Python/import.c

long
PyImport_GetMagicNumber(void)
{
    long res;
    PyInterpreterState *interp = _PyInterpreterState_GET();
    PyObject *external, *pyc_magic;

    external = PyObject_GetAttrString(IMPORTLIB(interp), "_bootstrap_external");
    if (external == NULL)
        return -1;
    pyc_magic = PyObject_GetAttrString(external, "_RAW_MAGIC_NUMBER");
    Py_DECREF(external);
    if (pyc_magic == NULL)
        return -1;
    res = PyLong_AsLong(pyc_magic);
    Py_DECREF(pyc_magic);
    return res;
}

看起來是開始借用 Python 的 importlib._bootstrap_external 模組來取得 _RAW_MAGIC_NUMBER 這個屬性，繼續追看看是怎麼回事：

# 檔案：Lib/importlib/_bootstrap_external.py

MAGIC_NUMBER = (3531).to_bytes(2, 'little') + b'\r\n'
_RAW_MAGIC_NUMBER = int.from_bytes(MAGIC_NUMBER, 'little')  # For import.c

嘿嘿，又是親切的 Python 程式碼了，這個 .to_bytes() 方法的第二個參數，表示要用 little，也就是 little-endian 的方式來呈現，除了 little 還有另一種 big-endian，呈現的效果不太一樣。我用數字 9527 來算一下：

9527 這個數字的十六進位是 0x2537。
在 big-endian 的呈現方式，會把「最高有效位元組（Most Significant Byte, MSB）」放在最前面，以結果來說就是 \x25\x37
而在 little-endian 的呈現方式，會把「最低有效位元組（Least Significant Byte, LSB）」放在最前面，也就是 \x37\x25

回到原本的程式，看的出來是把數字 3531 數字轉成 2 個 Byte，然後再加上一個 \r\n，這就是 Python 的「魔術數字」了。如果再往上翻一點，會看到這個 3521 是怎麼來的：

# 檔案：Lib/importlib/_bootstrap.py

# Known values:
#  Python 1.5:   20121
#  Python 1.5.1: 20121
#     Python 1.5.2: 20121
#     Python 1.6:   50428
#     Python 2.0:   50823
#     ... 略 ...
#     Python 2.7a0  62211 (introduce MAP_ADD and SET_ADD)
#     Python 3000:   3000
#                    3010 (removed UNARY_CONVERT)
#                    3020 (added BUILD_SET)
#     ... 略 ...
#     Python 3.12b1 3530 (Shrink the LOAD_SUPER_ATTR caches)
#     Python 3.12b1 3531 (Add PEP 695 changes)
#
#     Python 3.13 will start with 3550

也就是說，每個 Python 的版本號都會有一個對應的「魔術數字」，我們可以進到 REPL 查到它，不同版本的 Python 有不同的「魔術數字」：

# Python 3.11.9
>>> from importlib._bootstrap_external import MAGIC_NUMBER
>>> MAGIC_NUMBER
b'\xa7\r\r\n'

# Python 3.12.6
>>> from importlib._bootstrap_external import MAGIC_NUMBER
>>> MAGIC_NUMBER
b'\xcb\r\r\n'

這裡我分別使用了 3.11 跟 3.12 兩個版本，可以看的出來魔術數字是不一樣的。剛才我們有使用 py_compile 模組來產生 .pyc 檔，來看看它是怎麼做到的：

# 檔案：Lib/py_compile.py

def compile(file, cfile=None, dfile=None, doraise=False, optimize=-1,
            invalidation_mode=None, quiet=0):
    # ... 略 ...

    if invalidation_mode == PycInvalidationMode.TIMESTAMP:
        source_stats = loader.path_stats(file)
        bytecode = importlib._bootstrap_external._code_to_timestamp_pyc(
            code, source_stats['mtime'], source_stats['size'])
    else:
        source_hash = importlib.util.source_hash(source_bytes)
        bytecode = importlib._bootstrap_external._code_to_hash_pyc(
            code,
            source_hash,
            (invalidation_mode == PycInvalidationMode.CHECKED_HASH),
        )
    # ... 略 ...

在這個 compile 函數裡會做一些判斷，而決定要用 _code_to_timestamp_pyc() 或 _code_to_hash_pyc() 函數進行計算，看看這兩個函數：

# 檔案：Lib/importlib/_bootstrap_external.py

def _code_to_timestamp_pyc(code, mtime=0, source_size=0):
    data = bytearray(MAGIC_NUMBER)
    data.extend(_pack_uint32(0))
    data.extend(_pack_uint32(mtime))
    data.extend(_pack_uint32(source_size))
    data.extend(marshal.dumps(code))
    return data

def _code_to_hash_pyc(code, source_hash, checked=True):
    data = bytearray(MAGIC_NUMBER)
    flags = 0b1 | checked << 1
    data.extend(_pack_uint32(flags))
    assert len(source_hash) == 8
    data.extend(source_hash)
    data.extend(marshal.dumps(code))
    return data

這兩個函數雖然中間塞的資料不一樣，但開頭都是那個「魔術數字」，這個數字會在編譯 Bytecode 的過程中放在這個 ByteArray 的最前面。也就是說，不同版本的 Python 編出來的 Bytecode 是不一樣的。我就做了個簡單的實驗，我用 Python 3.11 編譯出 app.pyc，然後用 Python 3.12 編譯出 hello.pyc，執行之後就得到這個結果：

$ python app.pyc
RuntimeError: Bad magic number in .pyc file

證明不同版本的 Python 編譯出來的 Bytecode 是不相容的。

那 compileall 模組呢？去追一下原始碼就會發現它其實就是跑個 for 迴圈，然後針對每個檔案執行 py_compile.compile() 函數而已。如果有看懂這個玩法的話，我們也可以直接呼叫 py_compile.compile() 來產生 .pyc 檔：

$ python
>>> from py_compile import compile
>>> compile("hello.py")
'__pycache__/hello.cpython-312.pyc'

真的可以耶！

看到這裡，你可能會想說 Python 就是把程式碼轉成 AST，再轉成 Bytecode，最後可能會順便存成 .pyc 檔，所以 .pyc 檔就是 Bytecode。嗯，大方向是對的，但讓我們打開 .pyc 檔來看看裡面到底是什麼。

解開 `.pyc` 檔

前面應該有看到在產生 .pyc 檔的時候，會把 Bytecode 用 marshal.dumps() 來轉成二進位資料，所以照理說我們應該可以用 marshal.loads() 來把這個二進位資料轉回來看看。來寫一小段程式：

with open("hello.pyc", "rb") as f:
    print(f.read())

假設我的 .pyc 檔案是 hello.pyc，其實就是很簡單的把檔案以二進位的方式讀進來，執行結果像這樣：

b'\xcb\r\r\n\x00\x00\x00\x00S\xac\xf6f1\x00\...

有覺得前面的幾個字很面熟嗎？對，就是「魔術數字」，真正的內容應該就是接在這個它後面，所以我先跳過前面的 4 個位元組，再用 marshal.loads() 函數把資料印出來看看：

import marshal

with open("hello.pyc", "rb") as f:
    f.read(16)
    content = marshal.load(f)
    print(type(content))

執行之後應該會發現這東西是個 Code Object，Code Object 我們待後續章節再跟大家詳細介紹，在這個物件裡有個 .co_code 屬性，可以看到這顆物件裡的 Bytecode，印出來之後會看到這個：

b'\x97\x00d\x00\x84\x00Z\x00y\x01'

這個 ByteArray 是程式碼嗎？並不是，我把它轉成串列給大家看看：

>>> list(b'\x97\x00d\x00\x84\x00Z\x00y\x01')
[151, 0, 100, 0, 132, 0, 90, 0, 121, 1]

這串數字代表什麼意思？還記得我們有時候會借用 dis 模組來把程式碼展開成 Bytecode 嗎？以下這些指令覺得面熟嗎：

// 檔案：Include/opcode.h

#define CACHE                                    0
#define POP_TOP                                  1
#define PUSH_NULL                                2
#define INTERPRETER_EXIT                         3
// ... 略 ...
#define SWAP                                    99
#define LOAD_CONST                             100
#define LOAD_NAME                              101
// ... 略 ...
#define YIELD_VALUE                            150
#define RESUME                                 151
#define MATCH_CLASS                            152
#define FORMAT_VALUE                           155
// ... 略 ...

每個指令都有對應到一個數字，這些指令在 opcode.h 裡都有定義。opcode 是 Operation Code 的縮寫，中文可翻譯成「操作碼」，Python 的虛擬機器（Virtual Machine）會根據這些操作碼來執行程式。我們來借一下 dis 模組來把這個串列轉回成 opcode：

>>> ops = [151, 0, 100, 0, 132, 0, 90, 0, 121, 1]
>>> for op in ops:
...   print(dis.opname[op])
...
RESUME
CACHE
LOAD_CONST
CACHE
MAKE_FUNCTION
CACHE
STORE_NAME
CACHE
RETURN_CONST
POP_TOP

來用 dis 模組來看看 hello.py 的 Bytecode：

$ python -m dis hello.py
  0           0 RESUME                   0
  1           2 LOAD_CONST               0 (<code object greeting at 0x102b8d020, file "hello.py", line 1>)
              4 MAKE_FUNCTION            0
              6 STORE_NAME               0 (greeting)
              8 RETURN_CONST             1 (None)

... 略 ...

CACHE 指令的對應數字是 0，如果把 CACHE 拿掉，就是這樣的指令：

RESUME
LOAD_CONST
MAKE_FUNCTION
STORE_NAME
RETURN_CONST
POP_TOP

看起來就是跟 dis 模組印出來的指令集差不多了。

所以，更精準的講法，應該是 .pyc 檔裡面存的 Bytecode 其實就是由一連串的 opcode 組成的 ByteArray，而最後 Python 的 VM 會按照這些 opcode 來執行。

本文同步刊載於「為你自己學 Python - 參觀 Bytecode 工廠」

Day 12 - 從準備到起飛！

Day 14 - 串列的排隊人生

系列文

為你自己讀 CPython 原始碼共 31 篇

RSS系列文訂閱系列文

35 人訂閱

完整目錄

直播研討會

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22210 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

為你自己讀 CPython 原始碼系列 第 13 篇