iT邦幫忙

2025 iThome 鐵人賽

DAY 25
0
AI & Data

Rosalind 生物資訊解題系統系列 第 25

Day25 | Rosalind 生資解題 - 014. LCSM(Finding a Shared Motif)+跳出迴圈三種辦法

  • 分享至 

  • xImage
  •  

Day25 | Rosalind 生資解題 - 014. LCSM(Finding a Shared Motif)+跳出迴圈三種辦法

題目連結:https://rosalind.info/problems/lcsm/

https://ithelp.ithome.com.tw/upload/images/20251007/20125192GfNXn1nwF4.png

輸入:

>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA

輸出:

AC

以這題來說,答案輸出AC或者TACA均可
因為這幾個長度相等(長度都為2),都同時是最長子字串

今天這題難囉!

https://ithelp.ithome.com.tw/upload/images/20251007/20125192OK8MuHtedF.png

展開絲路之旅

解法一

絲路如下:
既然找最長子字串,那一定是每個序列中都要具有的字串
所以直接拿「最短序列」來當模板做比對!

再來,怎麼比對呢?
假設已有的最短序列是AACT
我們讓「候選模板」自動跑過各一輪排列組合:
長度1:A、A、C、T
長度2:AA、AC、CT
長度3:AAC、ACT
長度4:AACT
遍歷排列組合用兩指針(變數i、j)來做巢狀迴圈

data = """
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
"""


def parse_fasta(data: str) -> dict:
    sequences = {}
    label = ""
    for line in data.strip().splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            label = line[1:].strip()
            if not label:
                label = ""
                continue
            sequences[label] = ""
        elif label:
            sequences[label] += line

    return sequences


seqs = parse_fasta(data)
s_seq = min(seqs.values(), key=len)  # 找出最短字串作為模板

best = ""  # longest common substring
for i in range(len(s_seq)):
    for j in range(i+1, len(s_seq)+1):
        template = s_seq[i:j]
        is_common = True

        for seq in seqs.values():
            if template in seq:
                continue
            else:
                is_common = False
        if is_common:
            if len(best) < len(template):
                best = template
        # 以上比對,也可簡化成all()寫法:
        # if all(template in seq for seq in seqs.values()):
        #     if len(template) > len(best):
        #         best = template

print(best)

但是這樣子寫效率很慢
測資的序列有夠長
電腦跑了整整1分半...我都還以為當機了

解法二

延續絲路一,繼續改進強化
比對時改為由長比對到短,當比對成功即可返回
這樣一找到,就可以結束程式了!

當然也可以反向,由短比對到長,當一不符合就結束程式
可是我們一開始的 解法一 並不是按照長短順序的,但這裡就不補充寫法了

怎麼做?
先計算出長度L,再用i紀錄要從哪個字元開始數L個字

exit跳出迴圈

data = """
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
"""


def parse_fasta(data: str) -> dict:
    sequences = {}
    label = ""
    for line in data.strip().splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            label = line[1:].strip()
            if not label:
                label = ""
                continue
            sequences[label] = ""
        elif label:
            sequences[label] += line

    return sequences


seqs = parse_fasta(data)
s_seq = min(seqs.values(), key=len)  # 找出最短字串作為模板

best = ""  # longest common substring
n = len(s_seq)

# 由長找到短
for L in range(n, 0, -1):
    for i in range(n-L+1):
        candidate = s_seq[i:i+L]
        if all(candidate in seq for seq in seqs.values()):
            best = candidate
            print(best)
            exit(0) # 因為python沒有break to label,所以直接終止。但寫法不好

flag + break 跳出迴圈

這個寫法比較pythonic,看得時候也更直觀

data = """
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
"""


def parse_fasta(data: str) -> dict:
    sequences = {}
    label = ""
    for line in data.strip().splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            label = line[1:].strip()
            if not label:
                label = ""
                continue
            sequences[label] = ""
        elif label:
            sequences[label] += line

    return sequences


seqs = parse_fasta(data)
s_seq = min(seqs.values(), key=len)  # 找出最短字串作為模板

best = ""  # longest common substring
n = len(s_seq)

# 由長找到短
found = False
for L in range(n, 0, -1):
    for i in range(n-L+1):
        candidate = s_seq[i:i+L]
        if all(candidate in seq for seq in seqs.values()):
            best = candidate
            found = True
            break
    if found:
        break

print(best)

func + return 跳出迴圈

def parse_fasta(data: str) -> dict:
    sequences = {}
    label = ""
    for line in data.strip().splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            label = line[1:].strip()
            if not label:
                label = ""
                continue
            sequences[label] = ""
        elif label:
            sequences[label] += line

    return sequences

def longest_common_substring(seqs: dict) -> str:
    s_seq = min(seqs.values(), key=len)
    n = len(s_seq)

    for L in range(n, 0, -1):
        for i in range(n - L + 1):
            candidate = s_seq[i:i+L]
            if all(candidate in seq for seq in seqs.values()):
                return candidate

data = """
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
"""

seqs = parse_fasta(data)
print(longest_common_substring(seqs))


這種由長到短的比對方式,實測大約只需跑5秒鐘
可大幅縮短程式碼運行時間!

那,還有沒有更好的解法呢?

一定有的,這種刁難的題目
只是絲路不容易...駱駝還在路上

解法三、解法四

(待續


上一篇
Day24 | Rosalind 生資解題 - 013. IEV(Calculating Expected Offspring) + numpy dtype
下一篇
Day26 | Rosalind 生資解題 - 014. LCSM(Finding a Shared Motif)最長子字串比對 - 下篇
系列文
Rosalind 生物資訊解題系統28
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言