Extract gene annotation from Ensembl via AnnotationHub EnsDb
Source:R/get_ensdb_genes.R
get_ensdb_genes.Rd
Retrieves gene information for a given organism from the most appropriate Ensembl database hosted via Bioconductor's AnnotationHub and ensembldb.
Arguments
- organism_keyword
Character. Unique non-case-senstive string to search for the organism (e.g., "drosophila melanogaster").
- genome_build
Optional character. Genome build identifier to further restrict the EnsDb selection (e.g., "BDGP6").
- ensembl_version
Optional integer. Specific Ensembl version to fetch. If NULL, the latest available version is used.
- exclude_biotypes
Character vector. Gene biotypes to exclude from the result (default: c("transposable_element", "pseudogene")).
- include_gene_metadata
Character vector. Metadata columns to keep for each gene (default: c("gene_id", "gene_name")).
Value
List with:
- genes
A GRanges object of genes (metadata columns per argument).
- ensembl_version
Character. The Ensembl version string.
- genome_build
Character. Genome build identifier.
- species
Character. Latin binomial species name.
- common_name
Character. Species common name.
Details
This function queries AnnotationHub for EnsDb objects matching a supplied organism keyword, with optional filtering by genome build and Ensembl version. Genes matching excluded biotypes are filtered out. Only user-selected metadata fields are retained in the genes output.
Examples
# \donttest{
# This example requires an internet connection and will download data.
# It is wrapped in \donttest{} so it is not run by automated checks.
dm_genes <- get_ensdb_genes(
organism_keyword = "drosophila melanogaster",
ensembl_version = 110
)
#> Finding genome versions ...
#> snapshotDate(): 2024-04-30
#> Loading Ensembl genome version 'Ensembl 110 EnsDb for Drosophila melanogaster'
#> loading from cache
# View the fetched genes GRanges object
dm_genes$genes
#> GRanges object with 18040 ranges and 2 metadata columns:
#> seqnames ranges strand | gene_id
#> <Rle> <IRanges> <Rle> | <character>
#> FBgn0259849 211000022278436 1551-2815 - | FBgn0259849
#> FBgn0085506 211000022278760 562-879 - | FBgn0085506
#> FBgn0259870 211000022279165 14-1118 - | FBgn0259870
#> FBgn0259817 211000022279188 1449-2180 + | FBgn0259817
#> FBgn0085511 211000022279264 180-614 - | FBgn0085511
#> ... ... ... ... . ...
#> FBtr0472933_df_nrg rDNA 70297-70318 + | FBtr0472933_df_nrg
#> FBtr0472934_df_nrg rDNA 70330-70351 + | FBtr0472934_df_nrg
#> FBgn0267502 rDNA 70389-70511 + | FBgn0267502
#> FBgn0267503 rDNA 70540-70569 + | FBgn0267503
#> FBgn0267504 rDNA 70955-74924 + | FBgn0267504
#> gene_name
#> <character>
#> FBgn0259849 Su(Ste):CR42418
#> FBgn0085506 CG40635
#> FBgn0259870 Su(Ste):CR42439
#> FBgn0259817 SteXh:CG42398
#> FBgn0085511 lncRNA:CR40719
#> ... ...
#> FBtr0472933_df_nrg
#> FBtr0472934_df_nrg
#> FBgn0267502 5.8SrRNA:CR45842
#> FBgn0267503 2SrRNA:CR45843
#> FBgn0267504 28SrRNA:CR45844
#> -------
#> seqinfo: 29 sequences (1 circular) from BDGP6.46 genome
# }