題目連結:https://rosalind.info/problems/lcsm/
輸入:
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
輸出:
AC
以這題來說,答案輸出AC
或者TA
、CA
均可
因為這幾個長度相等(長度都為2),都同時是最長子字串
今天這題難囉!
展開絲路之旅
絲路如下:
既然找最長子字串,那一定是每個序列中都要具有的字串
所以直接拿「最短序列」來當模板做比對!
再來,怎麼比對呢?
假設已有的最短序列是AACT
我們讓「候選模板」自動跑過各一輪排列組合:
長度1:A、A、C、T
長度2:AA、AC、CT
長度3:AAC、ACT
長度4:AACT
遍歷排列組合用兩指針(變數i、j)來做巢狀迴圈
data = """
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
"""
def parse_fasta(data: str) -> dict:
sequences = {}
label = ""
for line in data.strip().splitlines():
line = line.strip()
if not line:
continue
if line.startswith(">"):
label = line[1:].strip()
if not label:
label = ""
continue
sequences[label] = ""
elif label:
sequences[label] += line
return sequences
seqs = parse_fasta(data)
s_seq = min(seqs.values(), key=len) # 找出最短字串作為模板
best = "" # longest common substring
for i in range(len(s_seq)):
for j in range(i+1, len(s_seq)+1):
template = s_seq[i:j]
is_common = True
for seq in seqs.values():
if template in seq:
continue
else:
is_common = False
if is_common:
if len(best) < len(template):
best = template
# 以上比對,也可簡化成all()寫法:
# if all(template in seq for seq in seqs.values()):
# if len(template) > len(best):
# best = template
print(best)
但是這樣子寫效率很慢
測資的序列有夠長
電腦跑了整整1分半...我都還以為當機了
延續絲路一,繼續改進強化
比對時改為由長比對到短,當比對成功即可返回
這樣一找到,就可以結束程式了!
當然也可以反向,由短比對到長,當一不符合就結束程式
可是我們一開始的 解法一 並不是按照長短順序的,但這裡就不補充寫法了
怎麼做?
先計算出長度L,再用i紀錄要從哪個字元開始數L個字
data = """
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
"""
def parse_fasta(data: str) -> dict:
sequences = {}
label = ""
for line in data.strip().splitlines():
line = line.strip()
if not line:
continue
if line.startswith(">"):
label = line[1:].strip()
if not label:
label = ""
continue
sequences[label] = ""
elif label:
sequences[label] += line
return sequences
seqs = parse_fasta(data)
s_seq = min(seqs.values(), key=len) # 找出最短字串作為模板
best = "" # longest common substring
n = len(s_seq)
# 由長找到短
for L in range(n, 0, -1):
for i in range(n-L+1):
candidate = s_seq[i:i+L]
if all(candidate in seq for seq in seqs.values()):
best = candidate
print(best)
exit(0) # 因為python沒有break to label,所以直接終止。但寫法不好
這個寫法比較pythonic,看得時候也更直觀
data = """
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
"""
def parse_fasta(data: str) -> dict:
sequences = {}
label = ""
for line in data.strip().splitlines():
line = line.strip()
if not line:
continue
if line.startswith(">"):
label = line[1:].strip()
if not label:
label = ""
continue
sequences[label] = ""
elif label:
sequences[label] += line
return sequences
seqs = parse_fasta(data)
s_seq = min(seqs.values(), key=len) # 找出最短字串作為模板
best = "" # longest common substring
n = len(s_seq)
# 由長找到短
found = False
for L in range(n, 0, -1):
for i in range(n-L+1):
candidate = s_seq[i:i+L]
if all(candidate in seq for seq in seqs.values()):
best = candidate
found = True
break
if found:
break
print(best)
def parse_fasta(data: str) -> dict:
sequences = {}
label = ""
for line in data.strip().splitlines():
line = line.strip()
if not line:
continue
if line.startswith(">"):
label = line[1:].strip()
if not label:
label = ""
continue
sequences[label] = ""
elif label:
sequences[label] += line
return sequences
def longest_common_substring(seqs: dict) -> str:
s_seq = min(seqs.values(), key=len)
n = len(s_seq)
for L in range(n, 0, -1):
for i in range(n - L + 1):
candidate = s_seq[i:i+L]
if all(candidate in seq for seq in seqs.values()):
return candidate
data = """
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
"""
seqs = parse_fasta(data)
print(longest_common_substring(seqs))
這種由長到短的比對方式,實測大約只需跑5秒鐘
可大幅縮短程式碼運行時間!
那,還有沒有更好的解法呢?
一定有的,這種刁難的題目
只是絲路不容易...駱駝還在路上
(待續