Filter genes by FDR within a specific condition — filter_genes_by

Filters a list of genes to retain only those that meet a specified False Discovery Rate (FDR) threshold. If the input is a `DamIDResults` object and a combined condition-level FDR has been calculated, that value is used. Otherwise, the function falls back to filtering against individual replicates.

Usage

filter_genes_by_fdr(data, fdr = 0.05, condition, which = "any")

Arguments

data

A `DamIDResults` object or the `list` returned by `load_data_genes()`.

fdr

A numeric value between 0 and 1 specifying the FDR cutoff. (Default: 0.05)

condition

A character string identifying the experimental condition. This string should uniquely match the relevant sample columns (e.g., "L4" will match "L4_rep1_FDR" and "L4_rep2_FDR"). If `data` is a `DamIDResults` object, this can be either the internal identifier or the display name for the condition.

which

A character string, either `"any"` or `"all"`. Only applicable when falling back to individual replicate scores. (Default: `"any"`)

If `"any"`, a gene is kept if it meets the `fdr` threshold in at least one replicate of the specified `condition`.
If `"all"`, a gene is kept only if it meets the `fdr` threshold in all replicates of the specified `condition`.

Value

A `data.frame` containing the `gene_name`, `gene_id`, `avg_occ`, and the most significant FDR value found.

Details

This function is primarily used in workflows involving RNA Polymerase TaDa data, where an FDR is calculated for gene occupancy to determine if a gene is actively transcribed. It allows users to identify genes in a single condition that can be considered to be expressed (i.e. RNA Pol occupancy is significantly greater than background).

Note that while this is an effective proxy for gene expression, there are edge cases (e.g. paused polymerase, short genes directly adjacent to an expressed gene TSS or TES) where a gene may have significant occupancy but not, in fact, be transcribed.

The function locates the relevant FDR columns in the `occupancy` table by searching for column names that end with `_FDR` and also contain the `condition` string.

Examples

# Create a mock data object with an occupancy table containing FDR values,
# similar to the output of `load_data_genes(calculate_occupancy_pvals = TRUE)`.
.generate_fdr_example_results <- function() {
    occupancy_df <- data.frame(
        gene_name = c("geneA", "geneB", "geneC"),
        gene_id = c("FBgn01", "FBgn02", "FBgn03"),
        L4_rep1 = c(1.5, 0.2, 0.8),
        L4_rep2 = c(1.7, 0.9, 0.1),
        L5_rep1 = c(0.1, 0.1, 2.0),
        L4_rep1_FDR = c(0.01, 0.10, 0.04),
        L4_rep2_FDR = c(0.03, 0.02, 0.50),
        L5_rep1_FDR = c(0.80, 0.90, 0.01),
        row.names = c("geneA", "geneB", "geneC")
    )
    diff_results_base <- list(
        occupancy = occupancy_df,
        test_category = "expressed",
        matched_samples = list("L4" = c("L4_rep1", "L4_rep2"), "L5" = "L5_rep1")
    )
    new("DamIDResults",
        analysis = data.frame(row.names = rownames(occupancy_df)),
        upCond1 = data.frame(),
        upCond2 = data.frame(),
        cond = c("L4 mock" = "L4", "L5 mock" = "L5"),
        data = diff_results_base
    )
}
mock_data <- .generate_fdr_example_results()

# Example 1: Get genes with FDR <= 0.05 in ANY L4 replicate.
# geneA (0.01, 0.03), geneB (0.02), and geneC (0.04) pass.
expressed_in_L4_any <- filter_genes_by_fdr(
    mock_data,
    fdr = 0.05,
    condition = "L4",
    which = "any"
)
#> 3 genes passed the FDR filter using rule 'any'.
print(expressed_in_L4_any)
#>       gene_name gene_id avg_occ fdr_val
#> geneA     geneA  FBgn01    1.60    0.01
#> geneB     geneB  FBgn02    0.55    0.02
#> geneC     geneC  FBgn03    0.45    0.04

# Example 2: Get genes with FDR <= 0.05 in ALL L4 replicates.
# Only geneA (0.01, 0.03) passes.
expressed_in_L4_all <- filter_genes_by_fdr(
    mock_data,
    fdr = 0.05,
    condition = "L4",
    which = "all"
)
#> 1 genes passed the FDR filter using rule 'all'.
print(expressed_in_L4_all)
#>       gene_name gene_id avg_occ fdr_val
#> geneA     geneA  FBgn01     1.6    0.01

# Example 3: Get genes with FDR <= 0.05 in any L5 replicate.
# geneC (0.01) and geneD (0.02) pass.
expressed_in_L5 <- filter_genes_by_fdr(
    mock_data,
    fdr = 0.05,
    condition = "L5",
    which = "any"
)
#> 1 genes passed the FDR filter using rule 'any'.
print(expressed_in_L5)
#>       gene_name gene_id avg_occ fdr_val
#> geneC     geneC  FBgn03       2    0.01