上一篇我們的基因體時代-AI, Data和生物資訊 Day24- 使用tidyverse觀念來分析基因資料:plyranges分享加入了tidyverse觀念的資料處理方法,算是這五年R裡面主流的一個語法,而創建IRange/GRange架構的開發者Michael Lawrence也利用tidyverse的原則來建立plyranges包,我們持續整合前面方法搭配plyranges來進行基因註釋資料的分享。
之前有介紹過使用AnnotationDbi的介面來對資料封包做存取,分別可以使用columns, keytypes等函數來了解這個資料包有什麼內容,以及可以用什麼資訊來檢索,這邊介紹獲取更大資料的方式,可以利用AnnotationHub,就其目前的資料,至少包含有57231筆資料集可以使用,裡面有來自52個組織或是項目的資料包含如UCSC, Ensembl, NIH, EMBL-EBI, NCBI, ENCODE Project, Broad Institute, Stanford等等,來自2643種物種的資訊,裡面有人類、細菌、黴菌、寄生蟲等等,資料格式有28種,從單純的列表(list)、資料格(data.frame)到我們前幾天分享的GRanges, TxDb甚至序列資料(TwoBit)等,非常豐富。
## load library
library(AnnotationHub)
## construct annotationHub
ah <- AnnotationHub()
## see the elements and resources in the annotationHub
length(ah)
snapshotDate(ah)
mcols(ah)
我們可以使用mcols來看這些resource的一些細節,比如title, dataprovider, species, taxonomyi, genome, description, maintainer, rdatadateadded, preparerclass等等。
> ah %>% mcols() %>% colnames
[1] "title" "dataprovider" "species" "taxonomyid" "genome"
[6] "description" "coordinate_1_based" "maintainer" "rdatadateadded" "preparerclass"
[11] "tags" "rdataclass" "rdatapath" "sourceurl" "sourcetype"
這邊我們可以進一步利用query來搜尋資源裡面包含有特定字串描述的資源,並且來閱覽一下他們的內容
ahs.gtf.resources <- query(ah, c("Homo sapiens", "GRCh38", "gtf"))
ahs.2bit.resources <- query(ah, c("Homo sapiens", "GRCh38", "2bit"))
ahs.gtf.resources.df <- data.frame(ah_id = ahs.gtf.resources$ah_id,
title = ahs.gtf.resources$title,
provider = ahs.gtf.resources$dataprovider,
genome = ahs.gtf.resources$genome,
date = ahs.gtf.resources$rdatadateadded,
dataType = ahs.gtf.resources$rdataclass)
ahs.2bit.resources.df <- data.frame(ah_id = ahs.2bit.resources$ah_id,
title = ahs.2bit.resources$title,
provider = ahs.2bit.resources$dataprovider,
genome = ahs.2bit.resources$genome,
date = ahs.2bit.resources$rdatadateadded,
dataType = ahs.2bit.resources$rdataclass)
> ahs.2bit.resources.df %>% head
ah_id title provider genome date dataType
1 AH49722 Homo_sapiens.GRCh38.cdna.all.2bit Ensembl GRCh38 2015-12-28 TwoBitFile
2 AH49723 Homo_sapiens.GRCh38.dna.primary_assembly.2bit Ensembl GRCh38 2015-12-28 TwoBitFile
3 AH49724 Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit Ensembl GRCh38 2015-12-28 TwoBitFile
4 AH49725 Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit Ensembl GRCh38 2015-12-28 TwoBitFile
5 AH49726 Homo_sapiens.GRCh38.ncrna.2bit Ensembl GRCh38 2015-12-28 TwoBitFile
6 AH50067 Homo_sapiens.GRCh38.cdna.all.2bit Ensembl GRCh38 2015-12-29 TwoBitFile
最後,我們可以利用他們的ah_id來進行下載,語法相當單純,比如我們想要下載一筆名為Homo_sapiens.GRCh38.cdna.all.2bit的資料,我們直接用他的ah_id即為AH49722,用下面的語法即可
Homo.sapiens.GRCh38.cdna.2bit <- ahs.2bit.resources[["AH49722"]]
閱讀參考:
Pagès H, Carlson M, Falcon S, Li N (2021). AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. R package version 1.54.1, https://bioconductor.org/packages/AnnotationDbi.
Morgan M, Pagès H, Obenchain V, Hayden N (2021). Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package version 2.8.0, https://bioconductor.org/packages/Rsamtools.
Pagès H (2021). BSgenome: Software infrastructure for efficient representation of full genomes and their SNPs. R package version 1.60.0, https://bioconductor.org/packages/BSgenome.
Morgan M (2019). AnnotationHub: Client to access AnnotationHub resources. R package version 2.18.0.
2017.[Adavanced GenomicRanges, RtrackLayer and Rsamtools]
(https://combine-australia.github.io/2017-05-19-bioconductor-melbourne/AdvGRanges_Rtracklayer_Rsamtools.html)
這個月的規劃貼在這篇文章中我們的基因體時代-AI, Data和生物資訊 Overview,也會持續調整!我們的基因體時代是我經營的部落格,如有對於生物資訊、檢驗醫學、資料視覺化、R語言有興趣的話,可以來交流交流!