Day19 | Rosalind 生資解題 - 010. CONS（Consensus and Profile）+max()用法 +collections.Counter - 下篇

17th鐵人賽

gjlmotea

2025-09-30 21:56:56

92 瀏覽

分享至

Day19 | Rosalind 生資解題 - 010. CONS（Consensus and Profile）+max()用法 +collections.Counter - 下篇

延續前一天的內容

補充：`max()`、`min()`的用法

通常會說max()、min()是取得一組當中最大、最小的值
但更準確來說，是在iterable（可叠代物件）之中以特定方法（key）來比較，返回最大、最小的物件

有這幾種用法，參數可帶入key、default：

max(iterable)
max(iterable, key)
max(iterable, default)
max(iterable, key, default)

## ==== min, max 用法研究 ====

## max(), min()
data = [1, 2, 3, 100, 50, 50]
print(min(data))  # 100
print(max(data))  # 1


## default用法
data = [] # default僅在「iterable為空」時有用。ex: data={}, (), set(), ""
print(min(data, default=9999))  # 9999
print(min(data, default=float("inf")))  # inf
print(max(data, default=0))  # 0
print(max(data, default=float("-inf")))  # -inf


## key用法：count
data = [1, 2, 3, 100, 50, 50]
print(data.count(100)) # 1 -> 100出現了1次
print(data.count(50)) # 2 -> 50出現了2次

for i in set(data):  # 用set移除重複
    print(f"{i} 出現了 {data.count(i)} 次")
    # 1 出現了 1 次
    # 2 出現了 1 次
    # 3 出現了 1 次
    # 100 出現了 1 次
    # 50 出現了 2 次

most_frequent_value = max(set(data), key=data.count)
most_frequent_count = data.count(most_frequent_value)

print(most_frequent_value)  # 50
print(most_frequent_count)  # 2


## key用法：len
data = ["apple", "book", "cat", "dog"]
print(len(data[1])) # 4

for i in data:
    print(f"{i} 長度為 {len(i)}")

longest_item = max(data, key=len)
longest_length = len(longest_item)

print(longest_item) # apple
print(longest_length) # 5


## key用法：dict字典（key-value）
data = {
    "Alice": 88,
    "Bob": 95,
    "Charlie": 90,
    "Diana": 60,
}

print(max(data))  # Diana -> 挑出key字母順序最大者
print(max(data.keys()))  # Diana -> 同上，dict叠代預設就是keys()
print(max(data.values()))  # 95 -> 挑出最高分（value最大）

# ⚠️這幾種方法，要嘛挑出key（捨棄value的資訊）、要嘛挑出value（捨棄key的資訊）
# 無法得知value是哪個 key
# 需要反查dict，才能從「最高分」反推出「學生」


## key用法：dict.get字典查值（同時保留 key 與 value 資訊）
data = {
    "Alice": 88,
    "Bob": 95,
    "Charlie": 90,
    "Diana": 60,
}

for i in data:
    print(f"{i} 的分數是 {data.get(i)}")
    # Alice 的分數是 88
    # Bob 的分數是 95
    # Charlie 的分數是 90
    # Diana 的分數是 60

highest_scoring_student = max(data, key=data.get)
highest_score = data[highest_scoring_student]

print(highest_scoring_student)  # Bob
print(highest_score)            # 95

words = ["apple", "Banana", "cherry", "Apricot"]

# 按照轉大寫後的字母順序來比大小
print(max(words, key=str.upper))  # cherry
print(min(words, key=str.upper))  # apple

順帶一提，因為key=...是把「函式」交出去給max()函式使用
只要指定函式名稱，叫他用這個函式即可
ex: key=len、key=str.upper、key=data.get...

而且不是我們要立刻呼叫的，所以不用寫括號()、也不能帶參數
ex:key=data.get()、key=data.get(0)、key=data.get("Alice")...
已經外包出去了，他怎麼用與我們無關

解法四：zip解法+Counter

前一篇文章提及三種解法

到目前為止，都是我們手動+1或者用count來計算次數
最後我們引入collections.counter來幫我們自動數數

補充：collections.Counter 操作

先來看看 collections.Counter 的用法
索引取值（陣列符號取值）與get()的差異，在於當查不到 key 時的行為不同

## ==== collections.Counter 用法研究 ====

from collections import Counter

# 範例一：字串
c = Counter("AAGCAACC")
print(c)  # Counter({'A': 4, 'C': 3, 'G': 1}) 依出現次數高到低排序

print(
    c["A"]
)  # 4 -> 等價寫法 print(c.__getitem__("A")) 👉索引取值`[]`的寫法就是__getitem__函式語法糖
print(c.get("A"))  # 4

print(c["T"])  # 0 -> 查不到 "T"
print(c.get("T"))  # None -> 查不到 "T"⚠️小心出現None！

# 取出現次數最多的前n項、以及出現次數
print(c.most_common(1))  # [('A', 4)]
print(c.most_common(2))  # [('A', 4), ('C', 3)]


# 範例二：list串列
c = Counter(["", " ", "a", "a", "b", "c", "c", "c"])
print(c)  # Counter({'c': 3, 'a': 2, '': 1, ' ': 1, 'b': 1})
print(sum(c.values()))  # 總元素數: 8


# 範例三：dict字典（Counter本質上就是dict）
# 倒推還原，由統計資料導回原本的元素內容
c = Counter({"A": 4, "C": 3, "G": 1})  # dict只能帶入int數字
print(c)  # Counter({'A': 4, 'C': 3, 'G': 1})

print(list(c.elements()))  # ['A', 'A', 'A', 'A', 'C', 'C', 'C', 'G']


# 範例四：集合運算
c = Counter({"A": 4, "C": 3, "G": 1})
d = Counter({"A": 1, "C": 10})

print(c + d) # Counter({'C': 13, 'A': 5, 'G': 1}) 加法
print(c - d) # Counter({'A': 3, 'G': 1}) 減法。負值、零值會被過濾掉
print(c & d) # Counter({'C': 3, 'A': 1}) 交集（取min）
print(c | d) # Counter({'C': 10, 'A': 4, 'G': 1}) 聯集（取max）

c.subtract(d) # 原地減法(in-place subtraction)，會保留負值、零值
print(c) # Counter({'A': 3, 'G': 1, 'C': -7})

順帶一提，補充一下set、dict能做的運算，來做個對比

補充：set、dict能做的運算

## ==== set、dict 能做的運算 ====

# set 可以做交集、聯集、差集
c = {"A", "C", "G"}
d = {"A", "C", "G", "T"}
print(c & d) # {'C', 'G', 'A'} （交集and運算）
print(c | d) # {'C', 'T', 'G', 'A'}（聯集or運算）
print(c - d) # set() 被減光了（差集diff運算）
print(c ^ d) # {'T'}（異或xor運算）


# dict 只能做 合併字典
c = {"A": 4, "C": 3, "G": 1}
d = {"A": 1, "C": 10}

print(c | d) # {'A': 1, 'C': 10, 'G': 1}
# dict union 有相同元素時，右邊會覆蓋左邊

e = {**c, **d} # 合併字典（倒出元素到新dict）
print(e) # {'A': 1, 'C': 10, 'G': 1}

c.update(d) # 合併（更新）字典
print(c) # {'A': 1, 'C': 10, 'G': 1}

最終完整程式碼：

from collections import Counter

def parse_fasta(data: str) -> dict:
    sequences = {}
    label = ""
    for line in data.strip().splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            label = line[1:].strip()
            if not label:
                label = ""
                continue
            sequences[label] = ""
        elif label:
            sequences[label] += line

    return sequences


def build_profile_and_consensus(sequences: list[str]):
    profile = {nt: [] for nt in "ACGT"}  # => {'A': [], 'C': [], 'G': [], 'T': []}
    consensus = ""

    for column in zip(*sequences):
        nt_counts = Counter(column)

        # print(nt_counts)
        # ex: Counter({'A': 5, 'G': 1, 'T': 1})
        # Counter只記錄有出現的nt，像上面就漏掉了'C'

        for nt in "ACGT":
            profile[nt].append(nt_counts.get(nt, 0)) # 沒出現的base記得要補0，否則會出現None

        most_common = nt_counts.most_common(1) # 取最常出現的前1項1個
        consensus += most_common[0][0] # 只取字母（tuple第0項的第0個元素）
        # 或者 傳遞自訂函數lambda，補0防止key不存在時回傳None
        # consensus += max("ACGT", key=lambda x: nt_counts.get(x, 0))
        # 或者 用__getitem__
        # consensus += max("ACGT", key=nt_counts.__getitem__)

    return consensus, profile


# 測試資料
data = """
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
"""

sequences = list(parse_fasta(data).values())
consensus, profile = build_profile_and_consensus(sequences)

print(consensus)
for nt in "ACGT":
    print(f"{nt}: {' '.join(str(x) for x in profile[nt])}")