R语言之NCBI等数据库的序列批量下载和相关R包的使用

Biomartr包

报错信息

问题1:

download.database.all(db = "taxdb", path = "taxdb")​运行后报错, 这个包检测网络连接情况的connected.to.internet​函数,源码如下:

1
2
3
4
5
6
7
8
9
connected.to.internet <- function() {
if (curl::curl_fetch_memory("www.google.com")$status_code == 200) {
return(TRUE)
} else {
message(
"It seems that you are not connected to the internet. A query to www.google.com was not successful. Could you please check?"
)
}
}

因为是检测google连接情况来判定网络是否联通的,所以国内使用会出问题。 解决办法: _ 修改hosts: _ 修改R包代码,把www.google.com​换成www.bing.com​,从github下载源码,修改后本地安装。

问题2(未解决):

getCollection( db = "refseq", organism = "Drosophila melanogaster")​报错

1
The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_assembly_stats.txt' seems not to be available at the moment. This might either be due to an instable internet connection, a firewall issue, a wrong organism name, or due to the fact that the specified organism is not available in the database you selected. If it is an internet connection issue, could you please try to re-run the function to see whether it works now?

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_assembly_stats.txt可以在浏览器中正常下载

可能出问题的:custom_download.R,getGenome.R,getBioSet.R,readAssemblyDoc.R,Refseq_Genbank_post_processing.R

简介

为了在元基因组规模上自动检索基因组和它们的注释信息,biomartr​包提供了基因组序列检索和功能注释检索的接口功能。biomartr​的主要目标是促进计算再现性和大规模处理基因组数据(元)基因组分析。有了biomartr​,用户现在可以控制并了解他们自动检索的基因组版本。许多大规模基因组学研究缺乏这些信息。当基因组版本信息的记录被忽视时,再现和数据解释几乎是不可能的。 In detail, biomartr​ automates genome, proteome, CDS, RNA, Repeats, GFF/GTF (annotation), genome assembly quality, and metagenome project data retrieval from the major biological databases such as NCBI RefSeq NCBI Genbank ENSEMBL ENSEMBLGENOMES(截止到2019年4月份 - ENSEMBL和ENSEMBLGENOMES已经合并了- see details here) UniProt

安装biomartr包

1
2
3
4
5
6
7
8
9
10
11
12
13
# On Windows, this won't work - see ?build_github_devtools
install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)

# When working with Windows, first you need to install the
# R package: rtools -> install.packages("rtools")

# Afterwards you can install devtools -> install.packages("devtools")
# and then you can run:

devtools::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)

# and then call it from the library
library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")

自带教程

1
2
3
4
5
# source the biomartr package
library(biomartr)
# look for all tutorials (vignettes) available in the biomartr package
# this will open your web browser
browseVignettes("biomartr")

Meta-Genome Retrieval

  • meta.retrieval()​ : Perform Meta-Genome Retieval from NCBI of species belonging to the same kingdom of life or to the same taxonomic subgroup

  • meta.retrieval.all()​ : Perform Meta-Genome Retieval from NCBI of the entire kingdom of life

  • getMetaGenomes()​ : Retrieve metagenomes from NCBI Genbank

  • getMetaGenomeAnnotations()​ : Retrieve annotation *.gff files for metagenomes from NCBI Genbank

  • listMetaGenomes()​ : List available metagenomes on NCBI Genbank

  • getMetaGenomeSummary()​ : Helper function to retrieve the assembly_summary.txt file from NCBI genbank metagenomes

  • clean.retrieval()​: Format meta.retrieval output

Genome Retrieval

  • listGenomes()​ : List all genomes available on NCBI and ENSEMBL servers

  • listKingdoms()​ : list the number of available species per kingdom of life on NCBI and ENSEMBL servers

  • listGroups()​ : list the number of available species per group on NCBI and ENSEMBL servers

  • getKingdoms()​ : Retrieve available kingdoms of life

  • getGroups()​ : Retrieve available groups for a kingdom of life

  • is.genome.available()​ : Check Genome Availability NCBI and ENSEMBL servers

  • getCollection()​ : Retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats

  • getGenome()​ : Download a specific genome stored on NCBI and ENSEMBL servers

  • getGenomeSet()​ : Genome Retrieval of multiple species

  • getProteome()​ : Download a specific proteome stored on NCBI and ENSEMBL servers

  • getProteomeSet()​ : Proteome Retrieval of multiple species

  • getCDS()​ : Download a specific CDS file (genome) stored on NCBI and ENSEMBL servers

  • getCDSSet()​ : CDS Retrieval of multiple species

  • getRNA()​ : Download a specific RNA file stored on NCBI and ENSEMBL servers

  • getRNASet()​ : RNA Retrieval of multiple species

  • getGFF()​ : Genome Annotation Retrieval from NCBI (*.gff​) and ENSEMBL (*.gff3​) servers

  • getGTF()​ : Genome Annotation Retrieval (*.gtf​) from ENSEMBL servers

  • getRepeatMasker() :​ Repeat Masker TE Annotation Retrieval

  • getAssemblyStats()​ : Genome Assembly Stats Retrieval from NCBI

  • getKingdomAssemblySummary()​ : Helper function to retrieve the assembly_summary.txt files from NCBI for all kingdoms

  • getMetaGenomeSummary()​ : Helper function to retrieve the assembly_summary.txt files from NCBI genbank metagenomes

  • getSummaryFile()​ : Helper function to retrieve the assembly_summary.txt file from NCBI for a specific kingdom

  • getENSEMBLInfo()​ : Retrieve ENSEMBL info file

  • getGENOMEREPORT()​ : Retrieve GENOME_REPORTS file from NCBI

Import Downloaded Files

Database Retrieval

BioMart Queries

  • biomart()​ : Main function to query the BioMart database

  • getMarts()​ : Retrieve All Available BioMart Databases

  • getDatasets()​ : Retrieve All Available Datasets for a BioMart Database

  • getAttributes()​ : Retrieve All Available Attributes for a Specific Dataset

  • getFilters()​ : Retrieve All Available Filters for a Specific Dataset

  • organismBM()​ : Function for organism specific retrieval of available BioMart marts and datasets

  • organismAttributes()​ : Function for organism specific retrieval of available BioMart attributes

  • organismFilters()​ : Function for organism specific retrieval of available BioMart filters

Performing Gene Ontology queries

Gene Ontology

  • getGO()​ : Function to retrieve GO terms for a given set of genes

Ensembl BioMart Examples

1
2
library(myTAI)

示例

1
2
3
4
5
6
7
8
9
10
11
# retrieve a list of available sequence databases at NCBI
listNCBIDatabases(db = "all")
# show all NCBI nr files
listNCBIDatabases(db = "nr")

# download the entire NCBI taxonomy database
download.database.all(db = "taxdb", path = "taxdb")


# download collection for Saccharomyces cerevisiae,用不了
# getCollection( db = "refseq", organism = "Drosophila melanogaster")
1
2
# download all Drosophila melanogaster genomes,用不了
# meta.retrieval(kingdom = "invertebrate", db = "refseq", type = "genome")

BiomaRt包和biomartr包的区别

BiomaRt软件包与biomartr软件包的主要区别在于,biomartr扩展了BiomaRt的功能注释检索过程,并提供了对基因组、蛋白质组、编码序列、gff文件、RNA序列、Repeat Masker注释文件的有用检索功能,以及对NCBI nr等整个数据库的检索功能。

BiomaRt包

报错信息

问题1:

1
2
3
4
5
6
Error in `collect()`: ! Failed to collect lazy table. 
Caused by error in `db_collect()`: ! Arguments in `...` must be used.
Problematic argument: • ..1 = InfDid you misspell an argument name? Backtrace:
1. biomaRt::listEnsembl()
11. dbplyr:::collect.tbl_sql(., Inf)
16. dbplyr::db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete, ...)

由于dplyr和Biocfilecache之间的不兼容导致的错误问题,而Biomart是无辜的旁观者! 解决办法:更新你的 BioManager 版本

问题2:

1
2
错误: Your query has been redirected to http://status.ensembl.org indicating this Ensembl service is currently unavailable.
Look at ?useEnsembl for details on how to try a mirror site.

浏览器打不开该网站,等服务器好了就行了

简介

biomaRt工具包是一个连接bioMart数据库的R语言接口,能够通过这个软件包自由地链接到bioMart数据库: 1. 查找某个基因在染色体上的位置。反之,给定染色体每一区间,返回该区间的基因s; 2. 通过EntrezGene的ID查找到相关序列的GO注释。反之,给定相关的GO注释,获取相关的EntrezGene的ID; 3. 通过EntrezGene的ID查找到相关序列的上游100bp序列(可能包含启动子等调控元件); 4. 查找人类染色体上每一段区域中已知的SNPs; 5. 给定一组的序列ID,获得其中具体的序列;

1
library(biomaRt)

选择数据库和对应的服务

1
# 线虫数据库WormBase ParaSite host="parasite.wormbase.org"

在ensembl.org网站上获取所有Ensembl mart的列表

1
2
3
listEnsembl(mirror = "asia")
#更换版本
listEnsembl("GRCh=37")

listMarts, listDatasets and useMart for the Ensembl mirrors You can connect to the following Ensembl mirrors using the listMarts, listDatasets and useMart functions:

  • Ensembl US West: https://uswest.ensembl.org/index.html

  • Ensembl US East: https://useast.ensembl.org/index.html

  • Ensembl Asia: https://asia.ensembl.org/index.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
listMarts(host="https://asia.ensembl.org")
# 选取特定的服务
ensembl_asia <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host = "https://asia.ensembl.org")
# 同ensembl_asia1 <- useMart("ensembl", host="https://asia.ensembl.org")
# 同ensembl_asia2 <- useEnsembl(biomart="ensembl",host = "http://asia.ensembl.org")
# useEnsembl函数专门用于ensembl数据库

#查看可选数据集
head(listDatasets(ensembl_asia))

# 限制为果蝇的基因注释数据集
dm_ensembl <- useDataset(dataset="dmelanogaster_gene_ensembl", mart=ensembl_asia)
# 查看该数据集下的filters
# filter:控制根据什么东西来过滤,可以是不同数据库的ID,也可以是染色体定位系统坐标
filters <- listFilters(dm_ensembl)
# 查看该数据集下的attributes
attrs <- listAttributes(dm_ensembl)
table(attrs$page)

listDatasets​” function will give you the list of all the species available (mart datasets) for a given mart getBM​允许使用mart过滤器和属性列表(filters和attributes)构建BioMart查询。

示例1:给定 EntrezGene,获取其5’ UTR序列

1
2
3
4
5
6
7
8
9
10
entrez <- c("39454","50253","32871")
utr5 <- getBM(attributes=c('entrezgene_id','5utr'), # 总共有55中序列类型,详见attrs数据框
filters="entrezgene_id",
values=entrez,
mart=dm_ensembl)
# 方法2: 针对ensembl库的通过getBM获取序列的简化版
utr5-2 <- getSequence(id=entrez,
type="entrezgene_id",
seqType='5utr',
mart=dm_ensembl)

示例2:获取Ensembl Gene, Transcript IDs and Uniprot Swissprot,其映射到果蝇Ensembl Gene ID “FBgn0038539”

1
2
3
4
5
dm_swissprot <- biomaRt::getBM(attributes=c('external_synonym','ensembl_transcript_id','uniprotswissprot','genbank'),
filters = 'external_synonym',
values = Dm_Atgs$SYMBOL,
mart = dm_ensembl)
dm_swissprot

示例3:获取某一染色体上某一段位置的SNPs

1
2
3
4
5
snpmart <- useMart(biomart = "ENSEMBL_MART_SNP", dataset="hsapiens_snp",host = "https://asia.ensembl.org")
snps <- getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'),
filters = c('chr_name','start','end'),
values = list(8,148350,148612),
mart = snpmart)

示例4:(出错了)将果蝇基因映射到人的基因上

1
2
3
4
5
6
7
8
9
10
11
dm_ensembl <- useMart('ensembl',dataset="dmelanogaster_gene_ensembl",host = "https://asia.ensembl.org")
human <- useMart('ensembl',dataset = "hsapiens_gene_ensembl",host = "https://asia.ensembl.org")
mouse <- useMart('ensembl',dataset = "mmusculus_gene_ensembl",host = "https://asia.ensembl.org")
dm2h.g <- getLDS(attributes = c("external_synonym"),
filters = "external_synonym",
values = Dm_Atg_genes_ID$SYMBOL,
mart = dm_ensembl,
attributesL = c("hgnc_symbol","chromosome_name","start_position"),
martL = human,
uniqueRows = T)

R语言批量爬取NCBI基因注释数据

1
2
3
4
5
6
#library(RCurl) 
#library(stringr)
#library(XML)
library(httr2)
library(rvest)
rm(list=ls())

导入要爬取的基因列表:

1
2
3
Dm_Atg_genes <- read.table('Dm_Atgs.txt',stringsAsFactors = F)
names(Dm_Atg_genes) <- 'SYMBOL'
Dm_Atg_genes

NCBI对于基因页面的索引方式都是 https://www.ncbi.nlm.nih.gov/gene/Entrze ID 的方式。

gene symbos​转为entrze ID

使用mget()​函数获取

1
2
3
4
5
6
7
8
9
10
library("org.Dm.eg.db")
AnnotationDbi::keytypes(org.Dm.eg.db)
mget(Dm_Atg_genes$SYMBOL, #需要转换的Symbol
org.Dm.egSYMBOL2EG, # Symbol转EntrezID
ifnotfound=NA)

# EntrezID转Symbol
mget(Dm_Atg_genes_ID$ENTREZID, #需要转换的EntrezID
org.Dm.egSYMBOL, #EntrezID转Symbol
ifnotfound=NA)

需要将gene symbos​转为entrze ID​,这里使用clusterProfiler​包的bitr​函数进行转换:

1
2
3
4
5
6
7
8
9
# 使用clusterProfiler将gene symbol转为entrze ID
library(clusterProfiler)
Dm_Atg_genes_ID <- bitr(Dm_Atg_genes$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Dm.eg.db")

# EntrezID转Symbol
bitr(Dm_Atg_genes_ID$ENTREZID, #需要转换的基因ID
fromType = "ENTREZID", #需要转换的类型
toType = "SYMBOL", #需要转换为的类型
OrgDb = org.Dm.eg.db) #注释包

AnnotationDbi包的mapIds​函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 使用AnnotationDbi将gene symbol转为entrze ID
mapIds(x = org.Dm.eg.db,#注释包
keys = Dm_Atg_genes$SYMBOL, #需要转换的基因Symbol
keytype = "SYMBOL", #需要转换的类型
column = "ENTREZID") #需要转换为的类型

# EntrezID转Symbol
mapIds(x = org.Dm.eg.db,#注释包
keys = Dm_Atg_genes_ID$ENTREZID, #需要转换的基因ID
keytype = "ENTREZID", #需要转换的类型
column = "SYMBOL") #需要转换为的类型

# 与mapIds函数相比,该函数可同时将一种ID转换成多种类型的ID。
select(x = org.Dm.eg.db,keys = Dm_Atg_genes_ID$SYMBOL,columns = c("GENENAME","ENTREZID"),keytype = "SYMBOL")

# 将Gene Symbol转换为Gene Name,key为Akt,属于Gene Symbol,所以对应的keytype为SYMBOL
mapIds(x = org.Dm.eg.db,keys = "Akt",column = "GENENAME",keytype = "SYMBOL")

BiomaRt用起来更加灵活,可以不同版本参考基因组ID转换,不同类型基因ID转换,甚至同基因跨物种同源基因转换

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# 使用BiomaRt将gene symbol转为entrze ID
library(biomaRt)

# 查看可用数据库
listMarts()
listEnsembl()

# 建立数据库链接,ensembl
mart<-useMart("ensembl")
# 寻找目标数据库,对应的物种及参考基因组版本
dataset_list=as.data.frame(listDatasets(mart))
# 选择果蝇数据库,重新建立数据库链接
mart_dm=useMart("ensembl", "dmelanogaster_gene_ensembl")

# 查看可用的输入数据类型
mart_dm_list <- listFilters(mart_dm)
write.csv(mart_dm_list,"mart_dm_list.csv")
# 查看可用的输出数据类型
listAttributes(mart_dm)

# ID 转换,不局限于ID转换,还可检索其他信息
list_gene <- getBM(attributes=c('entrezgene_id','ensembl_gene_id','external_synonym',"go_id", "flybase_gene_id", "external_gene_name","flybasename_gene"),
filters = 'entrezgene_id',
values = Dm_Atg_genes_ID$ENTREZID,
mart = mart_dm)
# gene symbol转为entrze ID
getBM(attributes=c('entrezgene_id','ensembl_gene_id','external_synonym',"go_id", "external_gene_name"),
filters = 'external_synonym',
values = Dm_Atg_genes_ID$SYMBOL,
mart = mart_dm)
# 需要注意的是,ensembl id转换时存在三种情况
# 1. 一个ensembl_gene_id转换为多个entrezgene_id,
# 2. 一个ensembl_gene_id转换为多个gene symbol
# 3. 同时还有许多基因ID无法成功转换,标记为NA
a=list_gene[!duplicated(list_gene$ensembl_gene_id),]
1
2
3
# 网址数据框:
Dm_Atg_genes_ID$NCBI_url <- paste("https://www.ncbi.nlm.nih.gov/gene/",Dm_Atg_genes_ID$ENTREZID,sep="")
head(Dm_Atg_genes_ID)

使用XML​包的getNodeSet()​函数需要两个参数,一个是根据URL获得的网页XML document​对象,另一个是要定位的节点(xpath​格式)。

Xpath:

//*[@id=“summaryDl”]/dd[1]/text()

1
2
3
4
5
6
7
8
# Official Symbol的xpath://*[@id="summaryDl"]/dd[1]/text()
# Official Full name://*[@id="summaryDl"]/dd[2]/text()
# Primary source://*[@id="summaryDl"]/dd[3]/a
# Locus tag://*[@id="summaryDl"]/dt[4]
# See related://*[@id="summaryDl"]/dd[5]/a
# Gene type://*[@id="summaryDl"]/dd[5]/text()
# Summary://*[@id="summaryDl"]/dd[10]/text()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 用于Rcurl
# 根据xpath获取节点内容:
#getNodesTxt <- function(html_txt1,xpath_p){
# els1 = getNodeSet(html_txt1, xpath_p)
# # 获得Node的内容,并且去除空字符:
els1_txt <- sapply(els1,xmlValue)[!(sapply(els1,xmlValue)=="")]
# # 去除\n:
# str_replace_all(els1_txt,"(\\n )+","")
#}

## 处理节点格式,为character且长度为0的赋值为NA:
#dealNodeTxt <- function(NodeTxt){
# ifelse(is.character(NodeTxt)==T && length(NodeTxt)!=0 , NodeTxt , NA)
#}

当HTTP请求不包含用户代理字符串时,该网络服务器似乎返回403禁止错误。 默认情况下,RCurl​不传递用户代理。 您可以使用useragent​=参数设置一个。httr2​包比RCurl​更好一些,可以发出HTTP请求(并且它默认设置了一个用户代理字符串)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

#for(i in 1:nrow(Dm_Atg_genes_ID)){

# if(url.exists(Dm_Atg_genes_ID[1,"NCBI_url"],useragent="curl/5.2.0 Rcurl/1.98.1.14")) {
# cat("网址连接正常!\t")
# } else {cat("网址有误!停止运行!\t");break}
# # 获得网页:
# doc <- getURL(Dm_Atg_genes_ID[i,"NCBI_url"],useragent="curl/5.2.0 Rcurl/1.98.1.14")
# # 使用httr2获取页面
# doc <- resp_body_html(req_perform(request(Dm_Atg_genes_ID[i,"NCBI_url"])))
# cat("成功获得网页!\t")
# # 获得网页内容
# html_txt1 <- htmlParse(doc, asText = TRUE)

# # 获得Full Name:
# Dm_Atg_genes_ID[i,"FullName"] <- dealNodeTxt(getNodesTxt(html_txt1,'//*[@id="summaryDl"]/dd[2]/text()'))
# cat("写入基因\t")
# # 获得Gene ID:
# Dm_Atg_genes_ID[i,"Gene_ID"] <- str_replace_all(dealNodeTxt(getNodesTxt(html_txt1,'//*[@id="summaryDl"]#/dd[3]/a')),"HGNC|:","")
# cat("写入Gene_ID\t")
# # 获得Gene type:
# Dm_Atg_genes_ID[i,"GeneType"] <- dealNodeTxt(getNodesTxt(html_txt1,'//*[@id="summaryDl"]/dd[5]/text()'))
# cat("写入GeneType\t")
# # 获得summary:
# Dm_Atg_genes_ID[i,"Summary"] <- ifelse(length(getNodesTxt(html_txt1,'//*[@id="summaryDl"]/dd[10]/text()'))!=0,getNodesTxt(html_txt1,'//*[@id="summaryDl"]/dd[10]/text()'),NA)
# cat("写入Summary\n")

# print(paste("完成第",i,"个了!"))

#}

#xlsx::write.xlsx(Dm_Atg_genes_ID,file = 'Dm_Atg_genes_NCBI.xlsx',sheetName = "Dm_Atgs_NCBI")

rvest爬取数据

1
2
3
4

NCBI_gene_page <- read_html(Dm_Atg_genes_ID[1,"NCBI_url"], encoding = "UTF-8")
html_nodes(NCBI_gene_page,"p.summary")
html_text()

rentrez 获取 NCBI 数据库数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
library(rentrez)

# NCBI 默认有访问限制,未设置 api key 时限制 3 次每秒,设置后限制 10 次每秒。如果有,通过 set_entrez_key 函数设置。
#set_entrez_key("xxxxxx")

# 列举可用的数据库
entrez_dbs()
# 查看某数据库总结
entrez_db_summary("pubmed")
# 查看某数据库支持的搜索字段(fields)
entrez_db_searchable("pubmed")
# 函数 entrez_link 获取不同数据库的链接
entrez_db_links("pubmed")


# 在pubmed中搜索黑腹果蝇,用 entrez_search 函数,字段用 [] 包含。
es <- entrez_search(db = "pubmed", term = "Drosophila heart[titl] AND 2017:2024[PDAT]", retmax = 30)
es
# 相关文献的IDs
es$ids

# entrez_link()可用于查找交叉引用的记录
el <- entrez_link(id = 38237924, db = "",dbfrom = "pubmed") #Sources for the full text of the paper
el$links
entrez_link(id = 38237924, db = "nuccore",dbfrom = "pubmed")
# 函数 entrez_summary 取得条目总结信息,设置 always_return_list = TRUE 让函数就算一个请求也返回列表,防止在 apply 或循环时出现非预期行为。
es <- entrez_summary(db = "pubmed", id = "38237924")
# 函数 extract_from_esummary 从返回的总结对象提取需要信息。
extract_from_esummary(es, "title")

pubmed文献计量

1
2
3
4
5
6
7
8
search_year <- function(year, term){
query <- paste(term, "AND (", year, "[PDAT])")
entrez_search(db="pubmed", term=query, retmax=0)$count
}

year <- 2014:2024
papers <- sapply(year, search_year, term="Drosophila heart[titl]", USE.NAMES=FALSE)
plot(year, papers, type='b', main="The Rise of the Drosophila heart")

核酸序列 fasta 文件下载

1
2
3
4
5
6
7
8
9
# 函数 entrez_fetch 获取条目完整信息,常用于下载数据,如选择 nucleotide 数据库并返回类型为 fasta 时可以下载对应 fasta信息,返回为字符串写入到文件,即完成了核酸序列 fasta 文件下载。
el <- entrez_link(dbfrom = "taxonomy", id = "28285", db = "nucleotide")
nu5 <- el$links$taxonomy_nuccore[1:5]
nu5

ef <- entrez_fetch(db = "nucleotide", id = nu5, rettype = "fasta")
temp <- tempfile()
write(ef, temp)

下载果蝇atg8a蛋白的数据

1
2
3
4
5
6
7
8
9
10
11

all_the_links <- entrez_link(dbfrom='gene', id=32001, db='all')
all_the_links$links
all_the_links$links$gene_protein

nuc_links <- entrez_link(dbfrom='gene', id=32001, db='protein')
nuc_links
nuc_links$links
nuc_links$links$gene_protein_refseq
ef <- entrez_fetch(db = "protein", id = 24641085, rettype = "fasta")
write(ef, 'dm_pro_24641085.fasta')

参考

  1. Drost HG, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics (2017) 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821IF: 5.8 Q1 .

  2. R语言批量爬取NCBI基因注释数据 - 简书 (jianshu.com)

  3. 关于r:Rcurl:如果url存在,则url.exists返回false | 码农家园 (codenong.com)

  4. R语言biomaRt工具包学习笔记 - 知乎 (zhihu.com)

  5. BiomaRt 包进行基因ID转换_biomart包id转化方法-CSDN博客

  6. rentrez 获取 NCBI 数据库数据 - 简书 (jianshu.com)