Perform Gene Ontology (GO) Enrichment Analysis for Differentially Bound/Expressed Regions
Source:R/analyse_go_terms.R
analyse_go_enrichment.Rd
This function performs Gene Ontology (GO) enrichment analysis using `clusterProfiler` for either the upCond1ulated or upCond2ulated regions/genes identified by `differential_binding()` or `differential_accessibility()`. It automatically extracts the relevant Flybase IDs (FBgnIDs) and the background universe from the input `DamIDResults` object.
Usage
analyse_go_enrichment(
diff_results,
direction = "cond1",
org_db = org.Dm.eg.db::org.Dm.eg.db,
ontology = "BP",
pvalue_cutoff = 0.05,
qvalue_cutoff = 0.2,
plot_title = NULL,
show_category = 12,
label_format_width = 30,
save = NULL,
save_results_path = NULL,
maxGSSize = 1000,
minGSSize = 10
)
Arguments
- diff_results
A `DamIDResults` object, as returned by `differential_binding()` or `differential_accessibility()`.
- direction
Character string. Specifies which set of genes to analyse, either using condition names, "cond1" or "cond2", or "all" (for all significantly enriched genes from either direction). Default is "cond1".
- org_db
An OrgDb object specifying the organism's annotation database. For Drosophila, use `org.Dm.eg.db::org.Dm.eg.db`.
- ontology
Character string. The GO ontology to use: "BP" (Biological Process), "MF" (Molecular Function), or "CC" (Cellular Component). Default is "BP".
- pvalue_cutoff
Numeric. Adjusted p-value cutoff for significance. Default: 0.05.
- qvalue_cutoff
Numeric. Q-value cutoff for significance. Default: 0.2.
- plot_title
Character string. Title for the generated dot plot.
- show_category
Integer. Number of top enriched GO categories to display in the plot. Default: 12.
- label_format_width
Integer. Max character length for GO term labels on the plot. Default: 30.
- save
List or `NULL`. Controls saving the plot to a file (dot plot). If `NULL`, `FALSE`, or `0`, the plot is not saved. If a `list`, it specifies saving parameters:
filename
(character): The path and base name for the output file. If not specified, the default name "damidBind_GSEA_dotplot" is used.format
(character): File format ("pdf", "svg", or "png"). Default is "pdf".width
(numeric): Width of the plot in inches. Default is 6.height
(numeric): Height of the plot in inches. Default is 6.
- save_results_path
Character string or NULL. If a path is provided (e.g., "go_results.csv"), the enrichment results table will be saved to this CSV file.
- maxGSSize
Integer. Maximum size of gene sets to consider. Default: 1000.
- minGSSize
Integer. Minimum size of gene sets to consider. Default: 10.
Value
A list containing:
- enrich_go_object
`enrichResult` object from `clusterProfiler`.
- results_table
Data frame of enrichment results.
- dot_plot
`ggplot` object of the dot plot.
NULL if no significant enrichment is found or if input validation fails.
Details
This function assumes that the `analysis` slot in the `diff_results` object contains a `gene_ids` column. If this column is not present, or cannot be processed, the function will return NULL.
The function includes an internal helper `clean_gene_symbols` which filters common ambiguous gene symbols (snoRNA, snRNA, tRNA) that may not be useful for GO enrichment.
Examples
# This example requires the 'org.Dm.eg.db' package
if (requireNamespace("org.Dm.eg.db", quietly = TRUE)) {
# Helper function to create a sample DamIDResults object
.generate_example_results <- function() {
# Define a mock gene object. Note: Real, mappable FlyBase IDs are
# used for the 'gene_id' column to ensure the example runs.
mock_genes_gr <- GenomicRanges::GRanges(
seqnames = S4Vectors::Rle("2L", 7),
ranges = IRanges::IRanges(
start = c(1000, 2000, 3000, 5000, 6000, 7000, 8000),
end = c(1500, 2500, 3500, 5500, 6500, 7500, 20000000)
),
gene_id = c(
"FBgn0034439", "FBgn0031267", "FBgn0051138", "FBgn0031265",
"FBgn0004655", "FBgn0000251", "FBgn0000252"
),
gene_name = c("ap", "dpr1", "side", "dpr2", "eg", "bi", "br")
)
data_dir <- system.file("extdata", package = "damidBind")
loaded_data <- load_data_peaks(
binding_profiles_path = data_dir,
peaks_path = data_dir,
ensdb_genes = mock_genes_gr,
quantile_norm = TRUE
)
diff_results <- differential_binding(
loaded_data,
cond = c("L4", "L5"),
cond_names = c("L4 Neurons", "L5 Neurons")
)
return(diff_results)
}
diff_results <- .generate_example_results()
# Run GO Enrichment for genes enriched in the first condition ("L4")
# Note: with tiny sample data, this may not find significant terms.
go_results <- analyse_go_enrichment(
diff_results,
direction = "L4",
org_db = org.Dm.eg.db::org.Dm.eg.db
)
# Print the results table if any enrichment was found
if (!is.null(go_results)) {
print(go_results$results_table)
}
}
#> Locating binding profile files
#> Building binding profile dataframe from input files ...
#> - Loaded: Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm
#> - Loaded: Bsh_Dam_L4_r2-ext300-vs-Dam.kde-norm
#> - Loaded: Bsh_Dam_L5_r1-ext300-vs-Dam.kde-norm
#> - Loaded: Bsh_Dam_L5_r2-ext300-vs-Dam.kde-norm
#> Applying quantile normalisation
#> Locating peak files
#> Calculating occupancy over peaks
#> Calculating average occupancy per region ...
#> Differential analysis setup:
#> Condition 1: 'L4' (display as 'L4 Neurons')
#> Found 2 replicates:
#> Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm_qnorm
#> Bsh_Dam_L4_r2-ext300-vs-Dam.kde-norm_qnorm
#> Condition 2: 'L5' (display as 'L5 Neurons')
#> Found 2 replicates:
#> Bsh_Dam_L5_r1-ext300-vs-Dam.kde-norm_qnorm
#> Bsh_Dam_L5_r2-ext300-vs-Dam.kde-norm_qnorm
#> limma contrasts: L4-L5
#>
#> 277 loci enriched in L4 Neurons
#> Highest-ranked genes:
#> br, br, br, br, br, br, br, br, br, br
#>
#> 173 loci enriched in L5 Neurons
#> Highest-ranked genes:
#> br, br, br, br, br, br, br, br, br, br
#> Universe of Flybase IDs is too small or invalid for enrichment analysis. Returning NULL.