R语言之NCBI等数据库的序列批量下载和相关R包的使用
Biomartr包
报错信息
问题1:
download.database.all(db = "taxdb", path = "taxdb")
运行后报错, 这个包检测网络连接情况的connected.to.internet
函数,源码如下:
1 | connected.to.internet <- function() { |
因为是检测google连接情况来判定网络是否联通的,所以国内使用会出问题。 解决办法: _ 修改hosts: _ 修改R包代码,把www.google.com
换成www.bing.com
,从github下载源码,修改后本地安装。
问题2(未解决):
getCollection( db = "refseq", organism = "Drosophila melanogaster")
报错
1 | The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_assembly_stats.txt' seems not to be available at the moment. This might either be due to an instable internet connection, a firewall issue, a wrong organism name, or due to the fact that the specified organism is not available in the database you selected. If it is an internet connection issue, could you please try to re-run the function to see whether it works now? |
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_assembly_stats.txt可以在浏览器中正常下载
可能出问题的:custom_download.R,getGenome.R,getBioSet.R,readAssemblyDoc.R,Refseq_Genbank_post_processing.R
简介
为了在元基因组规模上自动检索基因组和它们的注释信息,biomartr
包提供了基因组序列检索和功能注释检索的接口功能。biomartr
的主要目标是促进计算再现性和大规模处理基因组数据(元)基因组分析。有了biomartr
,用户现在可以控制并了解他们自动检索的基因组版本。许多大规模基因组学研究缺乏这些信息。当基因组版本信息的记录被忽视时,再现和数据解释几乎是不可能的。 In detail, biomartr
automates genome, proteome, CDS, RNA, Repeats, GFF/GTF (annotation), genome assembly quality, and metagenome project data retrieval from the major biological databases such as NCBI RefSeq NCBI Genbank ENSEMBL ENSEMBLGENOMES(截止到2019年4月份 - ENSEMBL和ENSEMBLGENOMES已经合并了- see details here) UniProt
安装biomartr包
1 | # On Windows, this won't work - see ?build_github_devtools |
自带教程
1 | # source the biomartr package |
Meta-Genome Retrieval
-
meta.retrieval()
: Perform Meta-Genome Retieval from NCBI of species belonging to the same kingdom of life or to the same taxonomic subgroup -
meta.retrieval.all()
: Perform Meta-Genome Retieval from NCBI of the entire kingdom of life -
getMetaGenomes()
: Retrieve metagenomes from NCBI Genbank -
getMetaGenomeAnnotations()
: Retrieve annotation *.gff files for metagenomes from NCBI Genbank -
listMetaGenomes()
: List available metagenomes on NCBI Genbank -
getMetaGenomeSummary()
: Helper function to retrieve the assembly_summary.txt file from NCBI genbank metagenomes -
clean.retrieval()
: Format meta.retrieval output
Genome Retrieval
-
listGenomes()
: List all genomes available on NCBI and ENSEMBL servers -
listKingdoms()
: list the number of available species per kingdom of life on NCBI and ENSEMBL servers -
listGroups()
: list the number of available species per group on NCBI and ENSEMBL servers -
getKingdoms()
: Retrieve available kingdoms of life -
getGroups()
: Retrieve available groups for a kingdom of life -
is.genome.available()
: Check Genome Availability NCBI and ENSEMBL servers -
getCollection()
: Retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats -
getGenome()
: Download a specific genome stored on NCBI and ENSEMBL servers -
getGenomeSet()
: Genome Retrieval of multiple species -
getProteome()
: Download a specific proteome stored on NCBI and ENSEMBL servers -
getProteomeSet()
: Proteome Retrieval of multiple species -
getCDS()
: Download a specific CDS file (genome) stored on NCBI and ENSEMBL servers -
getCDSSet()
: CDS Retrieval of multiple species -
getRNA()
: Download a specific RNA file stored on NCBI and ENSEMBL servers -
getRNASet()
: RNA Retrieval of multiple species -
getGFF()
: Genome Annotation Retrieval from NCBI (*.gff
) and ENSEMBL (*.gff3
) servers -
getGTF()
: Genome Annotation Retrieval (*.gtf
) from ENSEMBL servers -
getRepeatMasker() :
Repeat Masker TE Annotation Retrieval -
getAssemblyStats()
: Genome Assembly Stats Retrieval from NCBI -
getKingdomAssemblySummary()
: Helper function to retrieve the assembly_summary.txt files from NCBI for all kingdoms -
getMetaGenomeSummary()
: Helper function to retrieve the assembly_summary.txt files from NCBI genbank metagenomes -
getSummaryFile()
: Helper function to retrieve the assembly_summary.txt file from NCBI for a specific kingdom -
getENSEMBLInfo()
: Retrieve ENSEMBL info file -
getGENOMEREPORT()
: Retrieve GENOME_REPORTS file from NCBI
Import Downloaded Files
-
read\_genome()
: Import genomes as Biostrings or data.table object -
read\_proteome()
: Import proteome as Biostrings or data.table object -
read\_cds()
: Import CDS as Biostrings or data.table object -
read\_gff()
: Import GFF file -
read\_rna()
: Import RNA file -
read\_rm()
: Import Repeat Masker output file -
read\_assemblystats()
: Import Genome Assembly Stats File
Database Retrieval
-
listNCBIDatabases()
: Retrieve a List of Available NCBI Databases for Download -
download.database()
: Download a NCBI database to your local hard drive -
download.database.all()
: Download a complete NCBI Database such as e.g.NCBI nr
to your local hard drive
BioMart Queries
-
biomart()
: Main function to query the BioMart database -
getMarts()
: Retrieve All Available BioMart Databases -
getDatasets()
: Retrieve All Available Datasets for a BioMart Database -
getAttributes()
: Retrieve All Available Attributes for a Specific Dataset -
getFilters()
: Retrieve All Available Filters for a Specific Dataset -
organismBM()
: Function for organism specific retrieval of available BioMart marts and datasets -
organismAttributes()
: Function for organism specific retrieval of available BioMart attributes -
organismFilters()
: Function for organism specific retrieval of available BioMart filters
Performing Gene Ontology queries
Gene Ontology
-
getGO()
: Function to retrieve GO terms for a given set of genes
Ensembl BioMart Examples
1 | library(myTAI) |
示例
1 | # retrieve a list of available sequence databases at NCBI |
1 | # download all Drosophila melanogaster genomes,用不了 |
BiomaRt包和biomartr包的区别
BiomaRt软件包与biomartr软件包的主要区别在于,biomartr扩展了BiomaRt的功能注释检索过程,并提供了对基因组、蛋白质组、编码序列、gff文件、RNA序列、Repeat Masker注释文件的有用检索功能,以及对NCBI nr等整个数据库的检索功能。
BiomaRt包
报错信息
问题1:
1 | Error in `collect()`: ! Failed to collect lazy table. |
由于dplyr和Biocfilecache之间的不兼容导致的错误问题,而Biomart是无辜的旁观者! 解决办法:更新你的 BioManager 版本
问题2:
1 | 错误: Your query has been redirected to http://status.ensembl.org indicating this Ensembl service is currently unavailable. |
浏览器打不开该网站,等服务器好了就行了
简介
biomaRt工具包是一个连接bioMart数据库的R语言接口,能够通过这个软件包自由地链接到bioMart数据库: 1. 查找某个基因在染色体上的位置。反之,给定染色体每一区间,返回该区间的基因s; 2. 通过EntrezGene的ID查找到相关序列的GO注释。反之,给定相关的GO注释,获取相关的EntrezGene的ID; 3. 通过EntrezGene的ID查找到相关序列的上游100bp序列(可能包含启动子等调控元件); 4. 查找人类染色体上每一段区域中已知的SNPs; 5. 给定一组的序列ID,获得其中具体的序列;
1 | library(biomaRt) |
选择数据库和对应的服务
1 | # 线虫数据库WormBase ParaSite host="parasite.wormbase.org" |
在ensembl.org网站上获取所有Ensembl mart的列表
1 | listEnsembl(mirror = "asia") |
listMarts, listDatasets and useMart for the Ensembl mirrors You can connect to the following Ensembl mirrors using the listMarts, listDatasets and useMart functions:
-
Ensembl US West: https://uswest.ensembl.org/index.html
-
Ensembl US East: https://useast.ensembl.org/index.html
-
Ensembl Asia: https://asia.ensembl.org/index.html
1 | listMarts(host="https://asia.ensembl.org") |
“listDatasets
” function will give you the list of all the species available (mart datasets) for a given mart getBM
允许使用mart过滤器和属性列表(filters和attributes)构建BioMart查询。
示例1:给定 EntrezGene,获取其5’ UTR序列
1 | entrez <- c("39454","50253","32871") |
示例2:获取Ensembl Gene, Transcript IDs and Uniprot Swissprot,其映射到果蝇Ensembl Gene ID “FBgn0038539”
1 | dm_swissprot <- biomaRt::getBM(attributes=c('external_synonym','ensembl_transcript_id','uniprotswissprot','genbank'), |
示例3:获取某一染色体上某一段位置的SNPs
1 | snpmart <- useMart(biomart = "ENSEMBL_MART_SNP", dataset="hsapiens_snp",host = "https://asia.ensembl.org") |
示例4:(出错了)将果蝇基因映射到人的基因上
1 | dm_ensembl <- useMart('ensembl',dataset="dmelanogaster_gene_ensembl",host = "https://asia.ensembl.org") |
R语言批量爬取NCBI基因注释数据
1 | #library(RCurl) |
导入要爬取的基因列表:
1 | Dm_Atg_genes <- read.table('Dm_Atgs.txt',stringsAsFactors = F) |
NCBI对于基因页面的索引方式都是 https://www.ncbi.nlm.nih.gov/gene/Entrze ID 的方式。
gene symbos
转为entrze ID
使用mget()
函数获取
1 | library("org.Dm.eg.db") |
需要将gene symbos
转为entrze ID
,这里使用clusterProfiler
包的bitr
函数进行转换:
1 | # 使用clusterProfiler将gene symbol转为entrze ID |
AnnotationDbi包的mapIds
函数
1 | # 使用AnnotationDbi将gene symbol转为entrze ID |
BiomaRt用起来更加灵活,可以不同版本参考基因组ID转换,不同类型基因ID转换,甚至同基因跨物种同源基因转换
1 | # 使用BiomaRt将gene symbol转为entrze ID |
1 | # 网址数据框: |
使用XML
包的getNodeSet()
函数需要两个参数,一个是根据URL获得的网页XML document
对象,另一个是要定位的节点(xpath
格式)。
Xpath:
//*[@id=“summaryDl”]/dd[1]/text()
1 | # Official Symbol的xpath://*[@id="summaryDl"]/dd[1]/text() |
1 | # 用于Rcurl |
当HTTP请求不包含用户代理字符串时,该网络服务器似乎返回403禁止错误。 默认情况下,RCurl
不传递用户代理。 您可以使用useragent
=参数设置一个。httr2
包比RCurl
更好一些,可以发出HTTP请求(并且它默认设置了一个用户代理字符串)。
1 |
|
rvest爬取数据
1 |
|
rentrez 获取 NCBI 数据库数据
1 | library(rentrez) |
pubmed文献计量
1 | search_year <- function(year, term){ |
核酸序列 fasta 文件下载
1 | # 函数 entrez_fetch 获取条目完整信息,常用于下载数据,如选择 nucleotide 数据库并返回类型为 fasta 时可以下载对应 fasta信息,返回为字符串写入到文件,即完成了核酸序列 fasta 文件下载。 |
下载果蝇atg8a蛋白的数据
1 |
|
参考
-
Drost HG, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics (2017) 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821IF: 5.8 Q1 .
-
R语言批量爬取NCBI基因注释数据 - 简书 (jianshu.com)
-
关于r:Rcurl:如果url存在,则url.exists返回false | 码农家园 (codenong.com)
-
R语言biomaRt工具包学习笔记 - 知乎 (zhihu.com)
-
BiomaRt 包进行基因ID转换_biomart包id转化方法-CSDN博客
-
rentrez 获取 NCBI 数据库数据 - 简书 (jianshu.com)