iT邦幫忙

2025 iThome 鐵人賽

DAY 23
0
AI & Data

Rosalind 生物資訊解題系統系列 第 23

Day23 | Rosalind 生資解題 - 012. GRPH(Overlap Graphs) - 下篇

  • 分享至 

  • xImage
  •  

Day23 | Rosalind 生資解題 - 012. GRPH(Overlap Graphs) - 下篇

將昨天的 解法一 改成function版本
加入可以適配邊長的功能,並自動帶入k=3
這樣未來也手動可以改成k=4、k=5

from typing import List, Tuple


def parse_fasta(data: str) -> dict:
    sequences = {}
    label = ""
    for line in data.strip().splitlines():
        line = line.strip()  # 縮排防呆
        if not line:  # 跳過空行
            continue
        if line.startswith(">"):
            label = line[1:].strip()  # 去除label頭尾空白
            if not label:  # 跳過空標籤(不合法)
                label = ""
                continue
            sequences[label] = ""
        elif label:
            sequences[label] += line

    return sequences


def find_overlap_edges(data: str, k: int = 3) -> List[Tuple[str, str]]:
    seqs = parse_fasta(data)
    edges: List[Tuple[str, str]] = []
    for s1 in seqs:
        for s2 in seqs:
            if s1 == s2:
                continue
            if seqs[s1][-k:] == seqs[s2][:k]:
                edges.append((s1, s2))
    return edges


data = """
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
"""

for u, v in find_overlap_edges(data, k=3):
    print(u, v)

這題除了暴力解以外,有沒有更快的解法呢?

還有的,思路是「建立索引」,我們替每一個序列的頭跟尾都建立起索引
想像一下O3的情景
AAA、CCC、GGG、TTT等等,當長度為3時,不管再多的序列,全部最多也只會有4^3=64種可能性
我們先替這64種可能性建立好索引,以備未來使用
以dict的key來存放每一種索引(O3序列),以dict的value來存放對應的序列label,因為有多個所以用list

解法二:雙邊索引法(Prefix/Suffix Map)

from typing import List, Tuple


def parse_fasta(data: str) -> dict:
    sequences = {}
    label = ""
    for line in data.strip().splitlines():
        line = line.strip()  # 縮排防呆
        if not line:  # 跳過空行
            continue
        if line.startswith(">"):
            label = line[1:].strip()  # 去除label頭尾空白
            if not label:  # 跳過空標籤(不合法)
                label = ""
                continue
            sequences[label] = ""
        elif label:
            sequences[label] += line

    return sequences


def find_overlap_edges(data: str, k: int = 3) -> List[Tuple[str, str]]:
    seqs = parse_fasta(data)
    edges: List[Tuple[str, str]] = []
    for s1 in seqs:
        for s2 in seqs:
            if s1 == s2:
                continue
            if seqs[s1][-k:] == seqs[s2][:k]:
                edges.append((s1, s2))
    return edges


data = """
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
"""

for u, v in find_overlap_edges(data, k=3):
    print(u, v)

解法三:單邊索引法

上一個方法中,我們各別為頭、尾都建立了索引,總共建立了兩次
但實際上呢,只需要建立一次(頭部、或者尾部擇一)即可

這樣子,我們就可以省去建立一張表的步驟

from collections import defaultdict
from typing import Tuple, Iterator

def parse_fasta(data: str) -> dict:
    sequences = {}
    label = ""
    for line in data.strip().splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith(">"):
            label = line[1:].strip()
            if not label:
                label = ""
                continue
            sequences[label] = ""
        elif label:
            sequences[label] += line

    return sequences


def find_overlap_edges(data: str, k: int = 3) -> Iterator[Tuple[str, str]]:
    seqs = parse_fasta(data)
    prefix = defaultdict(list) # 建立prefix index
    for label, seq in seqs.items():
        if len(seq) >= k:
            prefix[seq[:k]].append(label)


    for src, seq in seqs.items():
        suffix = seq[-k:]
        for dst in prefix.get(suffix, ()):
            if dst != src:  # 剔除掉自我配對(self-loop)
                yield src, dst
    ## 上面 yield 寫法,等價以下寫法
    # edges: list[tuple[str, str]] = []
    # for src, seq in seqs.items():
    #     suffix = seq[-k:]
    #     for dst in prefix.get(suffix, ()):
    #         if src != dst:  # 剔除掉自我配對(self-loop)
    #             edges.append((src, dst))
    # return edges

data = """
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
"""

edges = sorted(find_overlap_edges(data, 3))
for u, v in edges:
    print(u, v)

複雜度比對

將三種不同解法的複雜度比對一下

方法 時間複雜度 空間
暴力破解(兩兩比對) (O(n^2,k)) (O(1))
雙邊索引(key 交集) (O(nk + M))(常數較大) (O(n))
單邊索引(建索引prefix索引,拿suffix 比對) (O(nk + M)) (O(n))

單邊索引與雙邊索引使用分箱法、分桶法(bucketing/binning)
把每條序列的開頭/結尾k個字(k-mer)分箱,再在同箱內配對

索引作法把問題從暴力破解(兩兩比對)的Θ(n²)降低為「先分桶再配對」的 O(nk+M)
但若k太小(邊太短 ex:k=1)、桶子太小時,M可能爆成Θ(n²)

https://ithelp.ithome.com.tw/upload/images/20251005/20125192OUDLZhyuRg.png


上一篇
Day22 | Rosalind 生資解題 - 012. GRPH(Overlap Graphs) - 上篇
下一篇
Day24 | Rosalind 生資解題 - 013. IEV(Calculating Expected Offspring) + numpy dtype
系列文
Rosalind 生物資訊解題系統24
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言