將昨天的 解法一 改成function版本
加入可以適配邊長的功能,並自動帶入k=3
這樣未來也手動可以改成k=4、k=5
from typing import List, Tuple
def parse_fasta(data: str) -> dict:
sequences = {}
label = ""
for line in data.strip().splitlines():
line = line.strip() # 縮排防呆
if not line: # 跳過空行
continue
if line.startswith(">"):
label = line[1:].strip() # 去除label頭尾空白
if not label: # 跳過空標籤(不合法)
label = ""
continue
sequences[label] = ""
elif label:
sequences[label] += line
return sequences
def find_overlap_edges(data: str, k: int = 3) -> List[Tuple[str, str]]:
seqs = parse_fasta(data)
edges: List[Tuple[str, str]] = []
for s1 in seqs:
for s2 in seqs:
if s1 == s2:
continue
if seqs[s1][-k:] == seqs[s2][:k]:
edges.append((s1, s2))
return edges
data = """
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
"""
for u, v in find_overlap_edges(data, k=3):
print(u, v)
這題除了暴力解以外,有沒有更快的解法呢?
還有的,思路是「建立索引」,我們替每一個序列的頭跟尾都建立起索引
想像一下O3的情景
AAA、CCC、GGG、TTT等等,當長度為3時,不管再多的序列,全部最多也只會有4^3=64種可能性
我們先替這64種可能性建立好索引,以備未來使用
以dict的key來存放每一種索引(O3序列),以dict的value來存放對應的序列label,因為有多個所以用list
from typing import List, Tuple
def parse_fasta(data: str) -> dict:
sequences = {}
label = ""
for line in data.strip().splitlines():
line = line.strip() # 縮排防呆
if not line: # 跳過空行
continue
if line.startswith(">"):
label = line[1:].strip() # 去除label頭尾空白
if not label: # 跳過空標籤(不合法)
label = ""
continue
sequences[label] = ""
elif label:
sequences[label] += line
return sequences
def find_overlap_edges(data: str, k: int = 3) -> List[Tuple[str, str]]:
seqs = parse_fasta(data)
edges: List[Tuple[str, str]] = []
for s1 in seqs:
for s2 in seqs:
if s1 == s2:
continue
if seqs[s1][-k:] == seqs[s2][:k]:
edges.append((s1, s2))
return edges
data = """
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
"""
for u, v in find_overlap_edges(data, k=3):
print(u, v)
上一個方法中,我們各別為頭、尾都建立了索引,總共建立了兩次
但實際上呢,只需要建立一次(頭部、或者尾部擇一)即可
這樣子,我們就可以省去建立一張表的步驟
from collections import defaultdict
from typing import Tuple, Iterator
def parse_fasta(data: str) -> dict:
sequences = {}
label = ""
for line in data.strip().splitlines():
line = line.strip()
if not line:
continue
if line.startswith(">"):
label = line[1:].strip()
if not label:
label = ""
continue
sequences[label] = ""
elif label:
sequences[label] += line
return sequences
def find_overlap_edges(data: str, k: int = 3) -> Iterator[Tuple[str, str]]:
seqs = parse_fasta(data)
prefix = defaultdict(list) # 建立prefix index
for label, seq in seqs.items():
if len(seq) >= k:
prefix[seq[:k]].append(label)
for src, seq in seqs.items():
suffix = seq[-k:]
for dst in prefix.get(suffix, ()):
if dst != src: # 剔除掉自我配對(self-loop)
yield src, dst
## 上面 yield 寫法,等價以下寫法
# edges: list[tuple[str, str]] = []
# for src, seq in seqs.items():
# suffix = seq[-k:]
# for dst in prefix.get(suffix, ()):
# if src != dst: # 剔除掉自我配對(self-loop)
# edges.append((src, dst))
# return edges
data = """
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
"""
edges = sorted(find_overlap_edges(data, 3))
for u, v in edges:
print(u, v)
將三種不同解法的複雜度比對一下
方法 | 時間複雜度 | 空間 |
---|---|---|
暴力破解(兩兩比對) | (O(n^2,k)) | (O(1)) |
雙邊索引(key 交集) | (O(nk + M))(常數較大) | (O(n)) |
單邊索引(建索引prefix索引,拿suffix 比對) | (O(nk + M)) | (O(n)) |
單邊索引與雙邊索引使用分箱法、分桶法(bucketing/binning)
把每條序列的開頭/結尾k個字(k-mer)分箱,再在同箱內配對
索引作法把問題從暴力破解(兩兩比對)的Θ(n²)降低為「先分桶再配對」的 O(nk+M)
但若k太小(邊太短 ex:k=1)、桶子太小時,M可能爆成Θ(n²)