iT邦幫忙

2019 iT 邦幫忙鐵人賽

DAY 8
0

話不多說,今天忙到現在才有空寫,就直接接續昨天的話題,不過今天我就直接貼上這些步驟所需的commands了。

Pre-processing

由於我目前最常碰到的是使用Oxford Nanopore Technology這間公司所開發的奈米孔定序儀所產生的資料,所以處理步驟也都是參考他們公司所提供的技術白皮書來安排的。跟以往NGS時期那些設備所產生的資料最大的不同就是—Nanopore sequencer一次能夠讀出的序列長度幾乎都在4000~6000bp,而以往NGS的設備所產生的序列長度幾乎只有前者的二十分之一。關於Nanopore reads的部份,之後有機會再詳細介紹。

# Statistics of read properties
NanoPlot -t 4 --fastq input.fastq --plots hex dot

# Adapter trimming
porechop -t 4 -i input.fastq -o trimmed.fastq

# Quality trimming and too-short-read filtering
NanoFilt -q 9 -l 200 < trimmed.fastq > cleaned.fastq

Alignment

序列比對這部份,根據目前我看到的文獻,只要是使用Nanopore設備去做細菌16s rRNA基因全長定序(長度約在1200kb上下),那在序列比對的部份就會使用LAST或是BLAST來做alignment。但如果是用LAST,最後產生的會是一個被稱作MAF格式的文字檔,我會使用LAST這個軟體包裡頭的maf-convert這個工具將結果轉換成BLAST輸出的格式。另外,我這邊所使用的16S rRNA基因的序列都是從NCBI上下載的。

# Building DB
lastdb -uYASS -R01 microbialdb 16S.fasta

# Alignment
lastal -P 4 -q 1 -b 1 -Q 0 -a 1 -e 45 microbialdb cleaned.fastq > aligned.maf

# Convert maf to BLAST tab-text
maf-convert blasttab aligned.maf > aligned.txt

Filtering alignment output & building abundance table

在比對完之後,由於不同種的細菌之間(特別是分類上相當接近的種類)有些序列是相當保守的,意思就是在它們的共同祖先演化的過程當中即使演化出不同種的細菌,但當中仍有些基因的序列並未發生改變。這種序列保守的特性造成了我們在進行序列比對的時候有一定的麻煩,所以我們會根據一些條件來做結果的篩選,像是序列之間的一致的程度、比對錯誤的數量及比對上的長度等等。這邊我暫時先放上我最早使用R所寫的script,因為我對julia的掌握還沒有R來的熟悉,加上在R裡頭有相當方便的tidyverse工具包。

# Load required Pkgs
suppressMessages(library(data.table))
suppressMessages(library(tidyverse))


# Set input locations
id_mapping_file <- "id_mapping.txt"
blastOutputs <- "aligned.txt"

# Import inputs
id_mapping <- fread(id_mapping_file, header = FALSE, sep = "\t", stringsAsFactors = FALSE)
setkey(id_mapping, V1)

blastoutput <- fread(blastOutputs, header = FALSE, sep = "\t", stringsAsFactors = FALSE)
blastoutput_filtered <- blastoutput %>% 
                        dplyr::filter(V3>=90 & V4>=700) %>% 
                        group_by(V1) %>% 
                        dplyr::filter(V3 == max(V3)) %>% 
                        dplyr::filter(V12 == max(V12))
allreadsnum <- length(unique(blastoutput$V1))
readStats <- table(blastoutput_filtered$V2)
refseqIds <- names(readStats)
df <- data.frame(RefseqID = refseqIds, 
                Lineage = NA, 
                ReadsNum = NA, 
                OrgName = NA, 
                ReadsPerc = NA)

for (i in 1:length(refseqIds)) {
    refseqid <- refseqIds[i]
    tmp <- id_mapping[.(refseqid)]
    lineage <- tmp$V3
    orgname <- tmp$V4
    readnum <- readStats[refseqid]
    perc <- readnum*100/allreadsnum
    df$Lineage[i] <- lineage
    df$ReadsNum[i] <- readnum
    df$ReadsPerc[i] <- perc
    df$OrgName[i] <- orgname
}

taxonomy <- paste(df$Lineage, df$OrgName, sep="; ")
output_df <- data.frame(Taxonomy = taxonomy, ReadsNumber = df$ReadsNum)
output_df <- aggregate(output_df$ReadsNumber, 
                       by = list(Taxonomy=output_df$Taxonomy), 
                       FUN = sum)
colnames(output_df)[2] <- "ReadsNumber"
ReadPercentage <- output_df$ReadsNumber*100/allreadsnum
output_df <- cbind(output_df, ReadPercentage)
output_df <- output_df[order(output_df$ReadPercentage, decreasing = TRUE),]
output <- paste0("tax_", sub("\\.\\/Alignment\\/", "", file))
write.table(output_df, output, sep = "\t", row.names = FALSE, quote = FALSE)

這裡面需要用到一個id_mapping.txt,這邊就是用我前面幾天提到的BioServices.jl,是爬取NCBI上面的資料產生的結果,看起來像是這個樣子:

NG_041960.1     37372   Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter     Helicobacter bilis      16S ribosomal RNA
NG_042070.1     86170   Bacteria; Firmicutes; Clostridia; Clostridiales; Syntrophomonadaceae; Syntrophothermus  Syntrophothermus lipocalidus    16S ribosomal RNA
NG_042071.1     59610   Bacteria; Firmicutes; Clostridia; Clostridiales; Peptococcaceae; Desulfotomaculum       Desulfotomaculum reducens       16S ribosomal RNA
NG_042880.1     215     Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter     Helicobacter fennelliae 16S ribosomal RNA
NG_042881.1     215     Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter     Helicobacter fennelliae 16S ribosomal RNA
NG_042882.1     287948  Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter     Helicobacter mastomyrinus
       16S ribosomal RNA
NG_042883.1     76936   Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter     Helicobacter typhlonius 16S ribosomal RNA
NG_042884.1     398626  Bacteria; Proteobacteria; Epsilonproteobacteria; Campylobacterales; Helicobacteraceae; Helicobacter     Helicobacter macacae    16S ribosomal RNA
NR_024570.1     562     Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia        Escherichia coli        16S ribosomal RNA
NR_024632.1     71388   Bacteria; Proteobacteria; Gammaproteobacteria; Vibrionales; Vibrionaceae; Vibrio        Vibrio halioticoli      16S ribosomal RNA
NR_024633.1     1329    Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus Streptococcus canis     16S ribosomal RNA
NR_024634.1     1340    Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus Streptococcus porcinus  16S ribosomal RNA
NR_024635.1     77524   Bacteria; Proteobacteria; Gammaproteobacteria; Alteromonadales; Colwelliaceae; Colwellia        Colwellia maris 16S ribosomal RNA
NR_024636.1     69367   Bacteria; Actinobacteria; Micrococcales; Microbacteriaceae; Microbacterium      Microbacterium luteolum 16S ribosomal RNA
NR_024637.1     69368   Bacteria; Actinobacteria; Micrococcales; Microbacteriaceae; Microbacterium      Microbacterium saperdae 16S ribosomal RNA
NR_024639.1     1108045 Bacteria; Actinobacteria; Corynebacteriales; Gordoniaceae; Gordonia     Gordonia rhizosphera NBRC 16068 16S ribosomal RNA
NR_024640.1     61645   Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Enterobacter; Enterobacter cloacae complex Enterobacter asburiae   16S ribosomal RNA

今天就先這樣啦~ 圖明早再補上 T_T


上一篇
[Day 8] 從Fastq出發到abundance table (一)
下一篇
[Day 10] 使用Julia在目標物種當中搜尋一段DNA序列
系列文
When Bioinfo met Julia: Bioinformatician的30天Julia學習之路32
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言