Python struct 模組的踩雷記錄

前言

昨日在寫 Python Socket Programming 作業時，遇到了一個關於 struct 模組的有趣的現象，記錄一下。

正文

情境是這樣的，我想要將一個正整數以及一個長度 3 的字串跟一個正整數結構化成 byte 的形式

很直覺的使用了

1
2
3

import struct
struct.pack("I3sI", 1, b'abc', 1)
# \x01\x00\x00\x00abc\x00\x01\x00\x00\x00

但這個輸出並不符合我的預期，正整數 4 bytes + 長度三的字串 3 bytes + 正整數 4 bytes，結果應該要是 11 bytes。但是輸出卻是 \x01\x00\x00\x00abc\x00\x01\x00\x00\x00 -> 12 bytes

於是使用 calcsize 確認一下

1 2	struct.calcsize("I3sI") # 12

的確也是 12 bytes，而且可以發現的是，是中間的字串變成了 4 bytes

於是實驗了一下：

struct.calcsize("3s")
# 3
struct.calcsize("3sI")
# 8
struct.calcsize("4sI")
# 8
struct.calcsize("5sI")
# 12

可以發現 3s 的確是 3 bytes，但是當後面還有有其他 bytes 的話，則會補齊 4 bytes，然而我用 4sI 可以發現還是 8 bytes 無誤。

第一個直覺就是猜想，是不是 struct 自動做了 word 大小為 4 的對齊。

查了一下文件

By default, C types are represented in the machine’s native format and byte order, and properly aligned by skipping pad bytes if necessary (according to the rules used by the C compiler).

到這邊真相已經大白，但是還是沒解決問題

這個問題的解法就是在 fmt 前面補上一些符號來告訴 struct 模組需不需要自動做對齊

struct.pack("3sI", b'abc', 1)
# b'abc\x00\x01\x00\x00\x00'
struct.pack("@3sI", b'abc', 1)
# b'abc\x00\x01\x00\x00\x00'
# 以上兩週是一樣的，預設就是用 @，代表自動對齊

struct.pack("=3sI", b'abc', 1)
# b'abc\x01\x00\x00\x00'
# 使用 =，代表不啟動對齊

文件還有貼心的寫下

The form ‘!’ is available for those poor souls who claim they can’t remember whether network byte order is big-endian or little-endian.

struct.pack("!3sI", b'abc', 1)
# b'abc\x00\x00\x00\x01'
struct.pack(">3sI", b'abc', 1)
# b'abc\x00\x00\x00\x01'
# 使用網路相關傳輸可以直接用 !，省去記下網路傳輸是 big-endian
struct.pack("<3sI", b'abc', 1)
# b'abc\x01\x00\x00\x00'

這些符號總共有 @, =, <, >, !，有興趣可以到官方文件了解。有趣的是只有第一個(@)，也就是預設的模式會採用自動對齊！