R语言之NCBI等数据库的序列批量下载和相关R包的使用

发表于 2024-12-08 分类于原创，科研阅读次数： Changyan：本文字数： 3.7k 阅读时长 ≈ 14 分钟

Biomartr包

报错信息

问题1：

download.database.all(db = "taxdb", path = "taxdb")运行后报错，这个包检测网络连接情况的connected.to.internet函数，源码如下:

connected.to.internet <- function() {
    if (curl::curl_fetch_memory("www.google.com")$status_code == 200) {
        return(TRUE)
    } else {
        message(
            "It seems that you are not connected to the internet. A query to www.google.com was not successful. Could you please check?"
        )
    }
}

因为是检测google连接情况来判定网络是否联通的，所以国内使用会出问题。解决办法： _ 修改hosts： _ 修改R包代码，把www.google.com换成www.bing.com，从github下载源码，修改后本地安装。

问题2（未解决）：

getCollection( db = "refseq", organism = "Drosophila melanogaster")报错

The FTP link: 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_assembly_stats.txt' seems not to be available at the moment. This might either be due to an instable internet connection, a firewall issue, a wrong organism name, or due to the fact that the specified organism is not available in the database you selected. If it is an internet connection issue, could you please try to re-run the function to see whether it works now?

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT/GCF_000001215.4_Release_6_plus_ISO1_MT_assembly_stats.txt可以在浏览器中正常下载

可能出问题的：custom_download.R，getGenome.R，getBioSet.R，readAssemblyDoc.R，Refseq_Genbank_post_processing.R

简介

为了在元基因组规模上自动检索基因组和它们的注释信息，biomartr包提供了基因组序列检索和功能注释检索的接口功能。biomartr的主要目标是促进计算再现性和大规模处理基因组数据(元)基因组分析。有了biomartr，用户现在可以控制并了解他们自动检索的基因组版本。许多大规模基因组学研究缺乏这些信息。当基因组版本信息的记录被忽视时，再现和数据解释几乎是不可能的。 In detail, biomartr automates genome, proteome, CDS, RNA, Repeats, GFF/GTF (annotation), genome assembly quality, and metagenome project data retrieval from the major biological databases such as NCBI RefSeq NCBI Genbank ENSEMBL ENSEMBLGENOMES(截止到2019年4月份 - ENSEMBL和ENSEMBLGENOMES已经合并了- see details here) UniProt

安装biomartr包

# On Windows, this won't work - see ?build_github_devtools
install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)

# When working with Windows, first you need to install the
# R package: rtools -> install.packages("rtools")

# Afterwards you can install devtools -> install.packages("devtools")
# and then you can run:

devtools::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)

# and then call it from the library
library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")

自带教程

# source the biomartr package
library(biomartr)
# look for all tutorials (vignettes) available in the biomartr package
# this will open your web browser
browseVignettes("biomartr")

Meta-Genome Retrieval

meta.retrieval() : Perform Meta-Genome Retieval from NCBI of species belonging to the same kingdom of life or to the same taxonomic subgroup
meta.retrieval.all() : Perform Meta-Genome Retieval from NCBI of the entire kingdom of life
getMetaGenomes() : Retrieve metagenomes from NCBI Genbank
getMetaGenomeAnnotations() : Retrieve annotation *.gff files for metagenomes from NCBI Genbank
listMetaGenomes() : List available metagenomes on NCBI Genbank
getMetaGenomeSummary() : Helper function to retrieve the assembly_summary.txt file from NCBI genbank metagenomes
clean.retrieval(): Format meta.retrieval output

Genome Retrieval

listGenomes() : List all genomes available on NCBI and ENSEMBL servers
listKingdoms() : list the number of available species per kingdom of life on NCBI and ENSEMBL servers
listGroups() : list the number of available species per group on NCBI and ENSEMBL servers
getKingdoms() : Retrieve available kingdoms of life
getGroups() : Retrieve available groups for a kingdom of life
is.genome.available() : Check Genome Availability NCBI and ENSEMBL servers
getCollection() : Retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats
getGenome() : Download a specific genome stored on NCBI and ENSEMBL servers
getGenomeSet() : Genome Retrieval of multiple species
getProteome() : Download a specific proteome stored on NCBI and ENSEMBL servers
getProteomeSet() : Proteome Retrieval of multiple species
getCDS() : Download a specific CDS file (genome) stored on NCBI and ENSEMBL servers
getCDSSet() : CDS Retrieval of multiple species
getRNA() : Download a specific RNA file stored on NCBI and ENSEMBL servers
getRNASet() : RNA Retrieval of multiple species
getGFF() : Genome Annotation Retrieval from NCBI (*.gff) and ENSEMBL (*.gff3) servers
getGTF() : Genome Annotation Retrieval (*.gtf) from ENSEMBL servers
getRepeatMasker() : Repeat Masker TE Annotation Retrieval
getAssemblyStats() : Genome Assembly Stats Retrieval from NCBI
getKingdomAssemblySummary() : Helper function to retrieve the assembly_summary.txt files from NCBI for all kingdoms
getMetaGenomeSummary() : Helper function to retrieve the assembly_summary.txt files from NCBI genbank metagenomes
getSummaryFile() : Helper function to retrieve the assembly_summary.txt file from NCBI for a specific kingdom
getENSEMBLInfo() : Retrieve ENSEMBL info file
getGENOMEREPORT() : Retrieve GENOME_REPORTS file from NCBI

Import Downloaded Files

read\_genome() : Import genomes as Biostrings or data.table object
read\_proteome() : Import proteome as Biostrings or data.table object
read\_cds() : Import CDS as Biostrings or data.table object
read\_gff() : Import GFF file
read\_rna() : Import RNA file
read\_rm() : Import Repeat Masker output file
read\_assemblystats() : Import Genome Assembly Stats File

Database Retrieval

listNCBIDatabases() : Retrieve a List of Available NCBI Databases for Download
download.database() : Download a NCBI database to your local hard drive
download.database.all() : Download a complete NCBI Database such as e.g. NCBI nr to your local hard drive

BioMart Queries

biomart() : Main function to query the BioMart database
getMarts() : Retrieve All Available BioMart Databases
getDatasets() : Retrieve All Available Datasets for a BioMart Database
getAttributes() : Retrieve All Available Attributes for a Specific Dataset
getFilters() : Retrieve All Available Filters for a Specific Dataset
organismBM() : Function for organism specific retrieval of available BioMart marts and datasets
organismAttributes() : Function for organism specific retrieval of available BioMart attributes
organismFilters() : Function for organism specific retrieval of available BioMart filters

Performing Gene Ontology queries

Gene Ontology

getGO() : Function to retrieve GO terms for a given set of genes

Ensembl BioMart Examples

1 2	library(myTAI)

示例

# retrieve a list of available sequence databases at NCBI
listNCBIDatabases(db = "all")
# show all NCBI nr files
listNCBIDatabases(db = "nr")

# download the entire NCBI taxonomy database
download.database.all(db = "taxdb", path = "taxdb")


# download collection for Saccharomyces cerevisiae，用不了
# getCollection( db = "refseq", organism = "Drosophila melanogaster")

1 2	# download all Drosophila melanogaster genomes,用不了 # meta.retrieval(kingdom = "invertebrate", db = "refseq", type = "genome")

BiomaRt包和biomartr包的区别

BiomaRt软件包与biomartr软件包的主要区别在于，biomartr扩展了BiomaRt的功能注释检索过程，并提供了对基因组、蛋白质组、编码序列、gff文件、RNA序列、Repeat Masker注释文件的有用检索功能，以及对NCBI nr等整个数据库的检索功能。

BiomaRt包

报错信息

问题1：

Error in `collect()`: ! Failed to collect lazy table. 
Caused by error in `db_collect()`: ! Arguments in `...` must be used. 
✖ Problematic argument: • ..1 = Inf ℹ Did you misspell an argument name? Backtrace:  
1. biomaRt::listEnsembl()  
11. dbplyr:::collect.tbl_sql(., Inf)  
16. dbplyr::db_collect(x$src$con, sql, n = n, warn_incomplete = warn_incomplete, ...)

由于dplyr和Biocfilecache之间的不兼容导致的错误问题，而Biomart是无辜的旁观者！ 解决办法：更新你的 BioManager 版本

问题2：

1 2	错误: Your query has been redirected to http://status.ensembl.org indicating this Ensembl service is currently unavailable. Look at ?useEnsembl for details on how to try a mirror site.

浏览器打不开该网站，等服务器好了就行了

简介

biomaRt工具包是一个连接bioMart数据库的R语言接口，能够通过这个软件包自由地链接到bioMart数据库： 1. 查找某个基因在染色体上的位置。反之，给定染色体每一区间，返回该区间的基因s； 2. 通过EntrezGene的ID查找到相关序列的GO注释。反之，给定相关的GO注释，获取相关的EntrezGene的ID； 3. 通过EntrezGene的ID查找到相关序列的上游100bp序列（可能包含启动子等调控元件）； 4. 查找人类染色体上每一段区域中已知的SNPs； 5. 给定一组的序列ID，获得其中具体的序列；

1	library(biomaRt)

选择数据库和对应的服务

1	# 线虫数据库WormBase ParaSite host="parasite.wormbase.org"

在ensembl.org网站上获取所有Ensembl mart的列表

1
2
3

listEnsembl(mirror = "asia")
#更换版本
listEnsembl("GRCh=37")

listMarts, listDatasets and useMart for the Ensembl mirrors You can connect to the following Ensembl mirrors using the listMarts, listDatasets and useMart functions:

Ensembl US West: https://uswest.ensembl.org/index.html
Ensembl US East: https://useast.ensembl.org/index.html
Ensembl Asia: https://asia.ensembl.org/index.html

listMarts(host="https://asia.ensembl.org")
# 选取特定的服务
ensembl_asia <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host = "https://asia.ensembl.org")
# 同ensembl_asia1 <- useMart("ensembl", host="https://asia.ensembl.org")
# 同ensembl_asia2 <- useEnsembl(biomart="ensembl",host = "http://asia.ensembl.org")
# useEnsembl函数专门用于ensembl数据库

#查看可选数据集
head(listDatasets(ensembl_asia))

# 限制为果蝇的基因注释数据集
dm_ensembl <- useDataset(dataset="dmelanogaster_gene_ensembl", mart=ensembl_asia)
# 查看该数据集下的filters
# filter：控制根据什么东西来过滤，可以是不同数据库的ID，也可以是染色体定位系统坐标
filters <- listFilters(dm_ensembl)
# 查看该数据集下的attributes
attrs <- listAttributes(dm_ensembl)
table(attrs$page)

“listDatasets” function will give you the list of all the species available (mart datasets) for a given mart getBM允许使用mart过滤器和属性列表（filters和attributes）构建BioMart查询。

示例1：给定 EntrezGene，获取其5’ UTR序列

entrez <- c("39454","50253","32871")
utr5 <- getBM(attributes=c('entrezgene_id','5utr'), # 总共有55中序列类型，详见attrs数据框
              filters="entrezgene_id",
              values=entrez,
              mart=dm_ensembl)
# 方法2: 针对ensembl库的通过getBM获取序列的简化版
utr5-2 <- getSequence(id=entrez,
                    type="entrezgene_id",
                    seqType='5utr',
                    mart=dm_ensembl)

示例2：获取Ensembl Gene, Transcript IDs and Uniprot Swissprot，其映射到果蝇Ensembl Gene ID “FBgn0038539”

dm_swissprot <- biomaRt::getBM(attributes=c('external_synonym','ensembl_transcript_id','uniprotswissprot','genbank'),
                      filters = 'external_synonym', 
                      values = Dm_Atgs$SYMBOL, 
                      mart = dm_ensembl)
dm_swissprot

示例3：获取某一染色体上某一段位置的SNPs

snpmart <- useMart(biomart = "ENSEMBL_MART_SNP", dataset="hsapiens_snp",host = "https://asia.ensembl.org")
snps <- getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'),
              filters = c('chr_name','start','end'),
              values = list(8,148350,148612),
              mart = snpmart)

示例4：(出错了)将果蝇基因映射到人的基因上

dm_ensembl <- useMart('ensembl',dataset="dmelanogaster_gene_ensembl",host = "https://asia.ensembl.org")
human <- useMart('ensembl',dataset = "hsapiens_gene_ensembl",host = "https://asia.ensembl.org")
mouse <- useMart('ensembl',dataset = "mmusculus_gene_ensembl",host = "https://asia.ensembl.org")
dm2h.g <- getLDS(attributes = c("external_synonym"),
                filters = "external_synonym",
                values = Dm_Atg_genes_ID$SYMBOL,
                mart = dm_ensembl,
                attributesL = c("hgnc_symbol","chromosome_name","start_position"),
                martL = human,
                uniqueRows = T)

R语言批量爬取NCBI基因注释数据

#library(RCurl) 
#library(stringr)
#library(XML)
library(httr2)
library(rvest)
rm(list=ls())

导入要爬取的基因列表：

1
2
3

Dm_Atg_genes <- read.table('Dm_Atgs.txt',stringsAsFactors = F)
names(Dm_Atg_genes) <- 'SYMBOL'
Dm_Atg_genes

NCBI对于基因页面的索引方式都是 https://www.ncbi.nlm.nih.gov/gene/Entrze ID 的方式。

`gene symbos`转为`entrze ID`

使用mget()函数获取

library("org.Dm.eg.db")
AnnotationDbi::keytypes(org.Dm.eg.db)
mget(Dm_Atg_genes$SYMBOL, #需要转换的Symbol
    org.Dm.egSYMBOL2EG, # Symbol转EntrezID
    ifnotfound=NA)

# EntrezID转Symbol
mget(Dm_Atg_genes_ID$ENTREZID, #需要转换的EntrezID
    org.Dm.egSYMBOL, #EntrezID转Symbol
    ifnotfound=NA)

需要将gene symbos转为entrze ID，这里使用clusterProfiler包的bitr函数进行转换：

# 使用clusterProfiler将gene symbol转为entrze ID
library(clusterProfiler)
Dm_Atg_genes_ID <- bitr(Dm_Atg_genes$SYMBOL, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Dm.eg.db")

# EntrezID转Symbol
bitr(Dm_Atg_genes_ID$ENTREZID, #需要转换的基因ID
     fromType = "ENTREZID", #需要转换的类型
     toType = "SYMBOL", #需要转换为的类型
     OrgDb = org.Dm.eg.db) #注释包

AnnotationDbi包的mapIds函数

# 使用AnnotationDbi将gene symbol转为entrze ID
mapIds(x = org.Dm.eg.db,#注释包
      keys = Dm_Atg_genes$SYMBOL, #需要转换的基因Symbol
      keytype = "SYMBOL", #需要转换的类型
      column = "ENTREZID") #需要转换为的类型

# EntrezID转Symbol
mapIds(x = org.Dm.eg.db,#注释包
      keys = Dm_Atg_genes_ID$ENTREZID, #需要转换的基因ID
      keytype = "ENTREZID", #需要转换的类型
      column = "SYMBOL")  #需要转换为的类型

# 与mapIds函数相比，该函数可同时将一种ID转换成多种类型的ID。
select(x = org.Dm.eg.db,keys = Dm_Atg_genes_ID$SYMBOL,columns = c("GENENAME","ENTREZID"),keytype = "SYMBOL")

# 将Gene Symbol转换为Gene Name，key为Akt，属于Gene Symbol，所以对应的keytype为SYMBOL
mapIds(x = org.Dm.eg.db,keys = "Akt",column = "GENENAME",keytype = "SYMBOL")

BiomaRt用起来更加灵活，可以不同版本参考基因组ID转换，不同类型基因ID转换，甚至同基因跨物种同源基因转换

# 使用BiomaRt将gene symbol转为entrze ID
library(biomaRt)

# 查看可用数据库
listMarts()
listEnsembl()

# 建立数据库链接，ensembl
mart<-useMart("ensembl")
# 寻找目标数据库，对应的物种及参考基因组版本
dataset_list=as.data.frame(listDatasets(mart))
# 选择果蝇数据库，重新建立数据库链接
mart_dm=useMart("ensembl", "dmelanogaster_gene_ensembl")

# 查看可用的输入数据类型 
mart_dm_list <- listFilters(mart_dm)
write.csv(mart_dm_list,"mart_dm_list.csv")
# 查看可用的输出数据类型
listAttributes(mart_dm)

# ID 转换，不局限于ID转换，还可检索其他信息
list_gene <- getBM(attributes=c('entrezgene_id','ensembl_gene_id','external_synonym',"go_id", "flybase_gene_id", "external_gene_name","flybasename_gene"), 
                   filters = 'entrezgene_id', 
                   values = Dm_Atg_genes_ID$ENTREZID, 
                   mart = mart_dm)
# gene symbol转为entrze ID
getBM(attributes=c('entrezgene_id','ensembl_gene_id','external_synonym',"go_id", "external_gene_name"), 
                   filters = 'external_synonym', 
                   values = Dm_Atg_genes_ID$SYMBOL, 
                   mart = mart_dm)
# 需要注意的是，ensembl id转换时存在三种情况
# 1. 一个ensembl_gene_id转换为多个entrezgene_id，
# 2. 一个ensembl_gene_id转换为多个gene symbol
# 3. 同时还有许多基因ID无法成功转换，标记为NA
a=list_gene[!duplicated(list_gene$ensembl_gene_id),]

1
2
3

# 网址数据框：
Dm_Atg_genes_ID$NCBI_url <- paste("https://www.ncbi.nlm.nih.gov/gene/",Dm_Atg_genes_ID$ENTREZID,sep="")
head(Dm_Atg_genes_ID)

使用XML包的getNodeSet()函数需要两个参数，一个是根据URL获得的网页XML document对象，另一个是要定位的节点(xpath格式)。

Xpath:

//*[@id=“summaryDl”]/dd[1]/text()

# Official Symbol的xpath：//*[@id="summaryDl"]/dd[1]/text()
# Official Full name：//*[@id="summaryDl"]/dd[2]/text()
# Primary source：//*[@id="summaryDl"]/dd[3]/a
# Locus tag：//*[@id="summaryDl"]/dt[4]
# See related：//*[@id="summaryDl"]/dd[5]/a
# Gene type：//*[@id="summaryDl"]/dd[5]/text()
# Summary：//*[@id="summaryDl"]/dd[10]/text()

# 用于Rcurl
# 根据xpath获取节点内容：
#getNodesTxt <- function(html_txt1,xpath_p){
#  els1 = getNodeSet(html_txt1, xpath_p)
#  # 获得Node的内容，并且去除空字符：
  els1_txt <- sapply(els1,xmlValue)[!(sapply(els1,xmlValue)=="")]
#  # 去除\n：
#  str_replace_all(els1_txt,"(\\n )+","")
#}

## 处理节点格式，为character且长度为0的赋值为NA：
#dealNodeTxt <- function(NodeTxt){
#  ifelse(is.character(NodeTxt)==T && length(NodeTxt)!=0 , NodeTxt , NA)
#}

当HTTP请求不包含用户代理字符串时，该网络服务器似乎返回403禁止错误。默认情况下，RCurl不传递用户代理。您可以使用useragent=参数设置一个。httr2包比RCurl更好一些，可以发出HTTP请求(并且它默认设置了一个用户代理字符串)。


#for(i in 1:nrow(Dm_Atg_genes_ID)){
  
#  if(url.exists(Dm_Atg_genes_ID[1,"NCBI_url"],useragent="curl/5.2.0 Rcurl/1.98.1.14")) {
#    cat("网址连接正常！\t")
#  } else {cat("网址有误！停止运行！\t");break}
#  # 获得网页：
#  doc <- getURL(Dm_Atg_genes_ID[i,"NCBI_url"],useragent="curl/5.2.0 Rcurl/1.98.1.14")
#  # 使用httr2获取页面
#  doc <- resp_body_html(req_perform(request(Dm_Atg_genes_ID[i,"NCBI_url"])))
#  cat("成功获得网页！\t")
#  # 获得网页内容
#  html_txt1 <- htmlParse(doc, asText = TRUE)
  
#  # 获得Full Name:
#  Dm_Atg_genes_ID[i,"FullName"] <- dealNodeTxt(getNodesTxt(html_txt1,'//*[@id="summaryDl"]/dd[2]/text()'))
#  cat("写入基因\t")
#  # 获得Gene ID:
#  Dm_Atg_genes_ID[i,"Gene_ID"] <- str_replace_all(dealNodeTxt(getNodesTxt(html_txt1,'//*[@id="summaryDl"]#/dd[3]/a')),"HGNC|:","")
#  cat("写入Gene_ID\t")
#  # 获得Gene type:
#  Dm_Atg_genes_ID[i,"GeneType"] <- dealNodeTxt(getNodesTxt(html_txt1,'//*[@id="summaryDl"]/dd[5]/text()'))
#  cat("写入GeneType\t")
#  # 获得summary：
#  Dm_Atg_genes_ID[i,"Summary"] <- ifelse(length(getNodesTxt(html_txt1,'//*[@id="summaryDl"]/dd[10]/text()'))!=0,getNodesTxt(html_txt1,'//*[@id="summaryDl"]/dd[10]/text()'),NA)
#  cat("写入Summary\n")
  
#  print(paste("完成第",i,"个了！"))

#}

#xlsx::write.xlsx(Dm_Atg_genes_ID,file = 'Dm_Atg_genes_NCBI.xlsx',sheetName = "Dm_Atgs_NCBI")

rvest爬取数据


NCBI_gene_page <- read_html(Dm_Atg_genes_ID[1,"NCBI_url"], encoding = "UTF-8")
html_nodes(NCBI_gene_page,"p.summary")
html_text()

rentrez 获取 NCBI 数据库数据

library(rentrez)

# NCBI 默认有访问限制，未设置 api key 时限制 3 次每秒，设置后限制 10 次每秒。如果有，通过 set_entrez_key 函数设置。
#set_entrez_key("xxxxxx")

# 列举可用的数据库 
entrez_dbs()
# 查看某数据库总结 
entrez_db_summary("pubmed")
# 查看某数据库支持的搜索字段（fields）
entrez_db_searchable("pubmed")
# 函数 entrez_link 获取不同数据库的链接
entrez_db_links("pubmed")


# 在pubmed中搜索黑腹果蝇，用 entrez_search 函数，字段用 [] 包含。
es <- entrez_search(db = "pubmed", term = "Drosophila heart[titl] AND 2017:2024[PDAT]", retmax = 30)
es
# 相关文献的IDs
es$ids

# entrez_link()可用于查找交叉引用的记录
el <- entrez_link(id = 38237924, db = "",dbfrom = "pubmed") #Sources for the full text of the paper 
el$links
entrez_link(id = 38237924, db = "nuccore",dbfrom = "pubmed")
# 函数 entrez_summary 取得条目总结信息，设置 always_return_list = TRUE 让函数就算一个请求也返回列表，防止在 apply 或循环时出现非预期行为。
es <- entrez_summary(db = "pubmed", id = "38237924")
# 函数 extract_from_esummary 从返回的总结对象提取需要信息。
extract_from_esummary(es, "title")

pubmed文献计量

search_year <- function(year, term){
    query <- paste(term, "AND (", year, "[PDAT])")
    entrez_search(db="pubmed", term=query, retmax=0)$count
}

year <- 2014:2024
papers <- sapply(year, search_year, term="Drosophila heart[titl]", USE.NAMES=FALSE)
plot(year, papers, type='b', main="The Rise of the Drosophila heart")

核酸序列 fasta 文件下载

# 函数 entrez_fetch 获取条目完整信息，常用于下载数据，如选择 nucleotide 数据库并返回类型为 fasta 时可以下载对应 fasta信息，返回为字符串写入到文件，即完成了核酸序列 fasta 文件下载。
el <- entrez_link(dbfrom = "taxonomy", id = "28285", db = "nucleotide")
nu5 <- el$links$taxonomy_nuccore[1:5]
nu5

ef <- entrez_fetch(db = "nucleotide", id = nu5, rettype = "fasta")
temp <- tempfile()
write(ef, temp)

下载果蝇atg8a蛋白的数据


all_the_links <- entrez_link(dbfrom='gene', id=32001, db='all')
all_the_links$links
all_the_links$links$gene_protein

nuc_links <- entrez_link(dbfrom='gene', id=32001, db='protein')
nuc_links
nuc_links$links
nuc_links$links$gene_protein_refseq
ef <- entrez_fetch(db = "protein", id = 24641085, rettype = "fasta")
write(ef, 'dm_pro_24641085.fasta')

参考

Drost HG, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics (2017) 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821IF: 5.8 Q1 .
R语言批量爬取NCBI基因注释数据 - 简书 (jianshu.com)
关于r：Rcurl：如果url存在，则url.exists返回false | 码农家园 (codenong.com)
R语言biomaRt工具包学习笔记 - 知乎 (zhihu.com)
BiomaRt 包进行基因ID转换_biomart包id转化方法-CSDN博客
rentrez 获取 NCBI 数据库数据 - 简书 (jianshu.com)

Biomartr包

报错信息

问题1：

问题2（未解决）：

简介

安装biomartr包

自带教程

Meta-Genome Retrieval

Genome Retrieval

Import Downloaded Files

Database Retrieval

BioMart Queries

Performing Gene Ontology queries

Ensembl BioMart Examples

示例

BiomaRt包和biomartr包的区别

BiomaRt包

报错信息

问题1：

问题2：

简介

选择数据库和对应的服务

示例1：给定 EntrezGene，获取其5’ UTR序列

示例2：获取Ensembl Gene, Transcript IDs and Uniprot Swissprot，其映射到果蝇Ensembl Gene ID “FBgn0038539”

示例3：获取某一染色体上某一段位置的SNPs

示例4：(出错了)将果蝇基因映射到人的基因上

R语言批量爬取NCBI基因注释数据

导入要爬取的基因列表：

gene symbos​转为entrze ID​

rvest爬取数据

rentrez 获取 NCBI 数据库数据

pubmed文献计量

核酸序列 fasta 文件下载

下载果蝇atg8a蛋白的数据

参考

`gene symbos`转为`entrze ID`