Day16 | Rosalind 生資解題 - 009. SUBS（Finding a Motif in DNA）+KMP演算法 - 上篇

17th鐵人賽

gjlmotea

2025-09-27 23:55:13

58 瀏覽

分享至

Day16 | Rosalind 生資解題 - 009. SUBS（Finding a Motif in DNA）+KMP演算法 - 上篇

題目連結：https://rosalind.info/problems/subs/

motif的中文直接翻譯是「主題」
在生物資訊中，motif中文可稱作模體，意思是在DNA或RNA或蛋白質上，重複出現的片段、小區段，通常是被蛋白質「辨識、結合」的區域。
這些片段不需要完全一模一樣，可以容忍有一些不同

motif這名詞取的真不錯
要比喻的話，我會稱他是「旋律」
像是慶祝生日的這首「生日快樂歌」

每個人的發音不同、樂器的音色不同，甚至有節奏差異
但旋律一聽我們耳熟能詳，能辨識出對方在彈唱的音樂
甚至把「祝你生日快樂」唱成「豬你生ｚ怪了」有一些詞彙上的差異
只要旋律對了，我們還是能聽的懂，不影響認知

或者在金融市場上，看到特定pattern的K線圖、均線排列
代表這支股票正在多頭走勢，就像聽到股票即將起飛的旋律

不過胡扯一通
其實這一題與上面的概念都無關，單純只要尋找「子字串匹配」就可以了
不用什麼模糊匹配、模糊搜尋、模糊辨識

子字串匹配只是其中一種尋找Motif的方法
是最簡單、最基本的搜尋方式，但是通常「不夠準確」

輸入：

GATATATGCATATACTT
ATAT

輸出：

2 4 10

給一段序列GATATATGCATATACTT
尋找並記錄ATAT重複片段的起始點位置

程式碼：

暴力破解，雙層for迴圈處理字元比對

s = "GATATATGCATATACTT"
sub = "ATAT"

positions = []

for i in range(0, len(s) - len(sub) + 1):
    match = True
    for j in range(len(sub)):
        if s[i + j] != sub[j]:
            match = False
            break
    if match:
        positions.append(i + 1)

for m in positions:
    print(m, end=" ")

暴力破解，寫成函式

利用python字串比對的寫法

if s1 == s2:
或者
if s[i:i+len(t)] == t:

但其實底層也是跑一層for迴圈（處理字元比對），效率相同

def find_substring_locations(s, t):
    positions = []
    for i in range(len(s) - len(t) + 1):
        if s[i:i + len(t)] == t:
            positions.append(i + 1)  # 位置從1開始計算位置
    return positions

# 等價的寫法，列表推導式
# def find_substring_locations(s, t):
#     return [
#         i + 1 # 位置從1開始計算位置
#         for i in range(len(s) - len(t) + 1)
#         if s[i:len(t) + i] == t
#     ]

s = "GATATATGCATATACTT"
sub = "ATAT"

print(" ".join(map(str, find_substring_locations(s, sub))))