Skip to contents

Filters a list of genes to retain only those that meet a specified False Discovery Rate (FDR) threshold. The filtering is applied to one or more replicates within a single experimental condition, based on the `_FDR` columns which must be present in the data (e.g., as generated by `load_data_genes(calculate_fdr = TRUE)`).

Usage

filter_genes_by_fdr(data, fdr = 0.05, condition, which = "any")

Arguments

data

A `DamIDResults` object or the `list` returned by `load_data_genes()`. The data object must contain an `occupancy` table with FDR columns (e.g., `SampleA_FDR`).

fdr

A numeric value between 0 and 1 specifying the FDR cutoff. Loci with an FDR less than or equal to this value will be considered. (Default: 0.05)

condition

A character string that identifies the experimental condition to filter on. This string should uniquely match the relevant sample columns (e.g., "L4" will match "L4_rep1_FDR" and "L4_rep2_FDR"). If `data` is a `DamIDResults` object, this can be either the internal identifier or the display name for the condition.

which

A character string, either `"any"` (the default) or `"all"`.

  • If `"any"`, a gene is kept if it meets the `fdr` threshold in at least one replicate of the specified `condition`.

  • If `"all"`, a gene is kept only if it meets the `fdr` threshold in all replicates of the specified `condition`.

Value

A `data.frame` containing the `gene_name` and `gene_id` of the genes that passed the filter. If no genes pass, an empty `data.frame` is returned.

Details

This function is primarily used in workflows involving RNA Polymerase TaDa data, where an FDR is calculated for gene occupancy to determine if a gene is actively transcribed. It allows users to identify genes in a single condition that can be considered to be expressed (i.e. RNA Pol occupancy is significantly greater than background).

Note that while this is an effective proxy for gene expression, there are edge cases (e.g. paused polymerase, short genes directly adjacent to an expressed gene TSS or TES) where a gene may have significant occupancy but not, in fact, be transcribed.

The function locates the relevant FDR columns in the `occupancy` table by searching for column names that end with `_FDR` and also contain the `condition` string.

Examples

# Create a mock data object with an occupancy table containing FDR values,
# similar to the output of `load_data_genes(calculate_fdr = TRUE)`.
.create_mock_fdr_data <- function() {
    occupancy_df <- data.frame(
        gene_name = c("geneA", "geneB", "geneC", "geneD"),
        gene_id = c("FBgn01", "FBgn02", "FBgn03", "FBgn04"),
        L4_rep1_FDR = c(0.01, 0.10, 0.04, 0.06),
        L4_rep2_FDR = c(0.03, 0.02, 0.50, 0.07),
        L5_rep1_FDR = c(0.80, 0.90, 0.01, 0.02),
        row.names = c("geneA", "geneB", "geneC", "geneD")
    )
    list(occupancy = occupancy_df, test_category = "expressed")
}
mock_data <- .create_mock_fdr_data()

# Example 1: Get genes with FDR <= 0.05 in ANY L4 replicate.
# geneA (0.01, 0.03), geneB (0.02), and geneC (0.04) pass.
expressed_in_L4_any <- filter_genes_by_fdr(
    mock_data,
    fdr = 0.05,
    condition = "L4",
    which = "any"
)
#> Found 2 FDR columns for condition 'L4': L4_rep1_FDR, L4_rep2_FDR
#> 3 genes/loci passed the FDR <= 0.05 filter for condition 'L4' with rule 'any'.
print(expressed_in_L4_any)
#>       gene_name gene_id avg_occ min_fdr
#> geneA     geneA  FBgn01     NaN    0.01
#> geneB     geneB  FBgn02     NaN    0.02
#> geneC     geneC  FBgn03     NaN    0.04

# Example 2: Get genes with FDR <= 0.05 in ALL L4 replicates.
# Only geneA (0.01, 0.03) passes.
expressed_in_L4_all <- filter_genes_by_fdr(
    mock_data,
    fdr = 0.05,
    condition = "L4",
    which = "all"
)
#> Found 2 FDR columns for condition 'L4': L4_rep1_FDR, L4_rep2_FDR
#> 1 genes/loci passed the FDR <= 0.05 filter for condition 'L4' with rule 'all'.
print(expressed_in_L4_all)
#>       gene_name gene_id avg_occ min_fdr
#> geneA     geneA  FBgn01     NaN    0.01

# Example 3: Get genes with FDR <= 0.05 in any L5 replicate.
# geneC (0.01) and geneD (0.02) pass.
expressed_in_L5 <- filter_genes_by_fdr(
    mock_data,
    fdr = 0.05,
    condition = "L5",
    which = "any"
)
#> Found 1 FDR columns for condition 'L5': L5_rep1_FDR
#> 2 genes/loci passed the FDR <= 0.05 filter for condition 'L5' with rule 'any'.
print(expressed_in_L5)
#>       gene_name gene_id avg_occ min_fdr
#> geneC     geneC  FBgn03     NaN    0.01
#> geneD     geneD  FBgn04     NaN    0.02