話不多說,今天忙到現在才有空寫,就直接接續昨天的話題,不過今天我就直接貼上這些步驟所需的commands了。
由於我目前最常碰到的是使用Oxford Nanopore Technology這間公司所開發的奈米孔定序儀所產生的資料,所以處理步驟也都是參考他們公司所提供的技術白皮書來安排的。跟以往NGS時期那些設備所產生的資料最大的不同就是—Nanopore sequencer一次能夠讀出的序列長度幾乎都在4000~6000bp,而以往NGS的設備所產生的序列長度幾乎只有前者的二十分之一。關於Nanopore reads
的部份,之後有機會再詳細介紹。
# Statistics of read properties
NanoPlot -t 4 --fastq input.fastq --plots hex dot
# Adapter trimming
porechop -t 4 -i input.fastq -o trimmed.fastq
# Quality trimming and too-short-read filtering
NanoFilt -q 9 -l 200 < trimmed.fastq > cleaned.fastq
序列比對這部份,根據目前我看到的文獻,只要是使用Nanopore設備去做細菌16s rRNA基因全長定序(長度約在1200kb上下),那在序列比對的部份就會使用LAST或是BLAST來做alignment。但如果是用LAST
,最後產生的會是一個被稱作MAF
格式的文字檔,我會使用LAST
這個軟體包裡頭的maf-convert
這個工具將結果轉換成BLAST
輸出的格式。另外,我這邊所使用的16S rRNA基因的序列都是從NCBI上下載的。
# Building DB
lastdb -uYASS -R01 microbialdb 16S.fasta
# Alignment
lastal -P 4 -q 1 -b 1 -Q 0 -a 1 -e 45 microbialdb cleaned.fastq > aligned.maf
# Convert maf to BLAST tab-text
maf-convert blasttab aligned.maf > aligned.txt
在比對完之後,由於不同種的細菌之間(特別是分類上相當接近的種類)有些序列是相當保守的,意思就是在它們的共同祖先演化的過程當中即使演化出不同種的細菌,但當中仍有些基因的序列並未發生改變。這種序列保守的特性造成了我們在進行序列比對的時候有一定的麻煩,所以我們會根據一些條件來做結果的篩選,像是序列之間的一致的程度、比對錯誤的數量及比對上的長度等等。這邊我暫時先放上我最早使用R
所寫的script,因為我對julia
的掌握還沒有R
來的熟悉,加上在R
裡頭有相當方便的tidyverse
工具包。
# Load required Pkgs
suppressMessages(library(data.table))
suppressMessages(library(tidyverse))
# Set input locations
id_mapping_file <- "id_mapping.txt"
blastOutputs <- "aligned.txt"
# Import inputs
id_mapping <- fread(id_mapping_file, header = FALSE, sep = "\t", stringsAsFactors = FALSE)
setkey(id_mapping, V1)
blastoutput <- fread(blastOutputs, header = FALSE, sep = "\t", stringsAsFactors = FALSE)
blastoutput_filtered <- blastoutput %>%
dplyr::filter(V3>=90 & V4>=700) %>%
group_by(V1) %>%
dplyr::filter(V3 == max(V3)) %>%
dplyr::filter(V12 == max(V12))
allreadsnum <- length(unique(blastoutput$V1))
readStats <- table(blastoutput_filtered$V2)
refseqIds <- names(readStats)
df <- data.frame(RefseqID = refseqIds,
Lineage = NA,
ReadsNum = NA,
OrgName = NA,
ReadsPerc = NA)
for (i in 1:length(refseqIds)) {
refseqid <- refseqIds[i]
tmp <- id_mapping[.(refseqid)]
lineage <- tmp$V3
orgname <- tmp$V4
readnum <- readStats[refseqid]
perc <- readnum*100/allreadsnum
df$Lineage[i] <- lineage
df$ReadsNum[i] <- readnum
df$ReadsPerc[i] <- perc
df$OrgName[i] <- orgname
}
taxonomy <- paste(df$Lineage, df$OrgName, sep="; ")
output_df <- data.frame(Taxonomy = taxonomy, ReadsNumber = df$ReadsNum)
output_df <- aggregate(output_df$ReadsNumber,
by = list(Taxonomy=output_df$Taxonomy),
FUN = sum)
colnames(output_df)[2] <- "ReadsNumber"
ReadPercentage <- output_df$ReadsNumber*100/allreadsnum
output_df <- cbind(output_df, ReadPercentage)
output_df <- output_df[order(output_df$ReadPercentage, decreasing = TRUE),]
output <- paste0("tax_", sub("\\.\\/Alignment\\/", "", file))
write.table(output_df, output, sep = "\t", row.names = FALSE, quote = FALSE)
這裡面需要用到一個id_mapping.txt
,這邊就是用我前面幾天提到的BioServices.jl
,是爬取NCBI上面的資料產生的結果,看起來像是這個樣子:
NG_041960.1 37372 Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter Helicobacter bilis 16S ribosomal RNA
NG_042070.1 86170 Bacteria; Firmicutes; Clostridia; Clostridiales; Syntrophomonadaceae; Syntrophothermus Syntrophothermus lipocalidus 16S ribosomal RNA
NG_042071.1 59610 Bacteria; Firmicutes; Clostridia; Clostridiales; Peptococcaceae; Desulfotomaculum Desulfotomaculum reducens 16S ribosomal RNA
NG_042880.1 215 Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter Helicobacter fennelliae 16S ribosomal RNA
NG_042881.1 215 Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter Helicobacter fennelliae 16S ribosomal RNA
NG_042882.1 287948 Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter Helicobacter mastomyrinus
16S ribosomal RNA
NG_042883.1 76936 Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter Helicobacter typhlonius 16S ribosomal RNA
NG_042884.1 398626 Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter Helicobacter macacae 16S ribosomal RNA
NR_024570.1 562 Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia Escherichia coli 16S ribosomal RNA
NR_024632.1 71388 Bacteria; Proteobacteria; Gammaproteobacteria; Vibrionales; Vibrionaceae; Vibrio Vibrio halioticoli 16S ribosomal RNA
NR_024633.1 1329 Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus Streptococcus canis 16S ribosomal RNA
NR_024634.1 1340 Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus Streptococcus porcinus 16S ribosomal RNA
NR_024635.1 77524 Bacteria; Proteobacteria; Gammaproteobacteria; Alteromonadales; Colwelliaceae; Colwellia Colwellia maris 16S ribosomal RNA
NR_024636.1 69367 Bacteria; Actinobacteria; Micrococcales; Microbacteriaceae; Microbacterium Microbacterium luteolum 16S ribosomal RNA
NR_024637.1 69368 Bacteria; Actinobacteria; Micrococcales; Microbacteriaceae; Microbacterium Microbacterium saperdae 16S ribosomal RNA
NR_024639.1 1108045 Bacteria; Actinobacteria; Corynebacteriales; Gordoniaceae; Gordonia Gordonia rhizosphera NBRC 16068 16S ribosomal RNA
NR_024640.1 61645 Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Enterobacter; Enterobacter cloacae complex Enterobacter asburiae 16S ribosomal RNA
今天就先這樣啦~ 圖明早再補上 T_T