Calculate and add gene occupancy FDR

This function calculates a False Discovery Rate (FDR) for gene occupancy scores, which is often used as a proxy for determining whether a gene is actively expressed in RNA Polymerase TaDa experiments. It applies a permutation-based null model to each sample's binding profile and adds the resulting FDR values as new columns to the occupancy data frame.

Usage

calculate_and_add_fdr(
  binding_data,
  occupancy_df,
  fdr_iterations = 50000,
  BPPARAM = BiocParallel::bpparam(),
  seed = NULL
)

Arguments

binding_data: A `GRanges` object of binding profiles, where metadata columns represent samples. This is typically the `binding_profiles_data` element from the list returned by `load_data_genes`.
occupancy_df: A data frame of gene occupancies, typically the `occupancy` element from the list returned by `load_data_genes`. It must contain sample columns and a `nfrags` column.
fdr_iterations: Integer. The number of sampling iterations to build the FDR null model. Higher values are more accurate but slower. Default is 50000. For quick tests, a much lower value (e.g., 100) can be used.
BPPARAM: A `BiocParallelParam` instance specifying the parallel backend to use for computation. See `BiocParallel::bpparam()`. For full reproducibility, especially in examples or tests, use `BiocParallel::SerialParam()`.
seed: An optional integer. If provided, it is used to set the random seed before the sampling process begins, ensuring that the FDR calculations are fully reproducible. If `NULL` (default), results will vary between runs.

Value

An updated version of the input `occupancy_df` data frame, with new columns appended. For each sample column (e.g., `SampleA`), a corresponding FDR column (e.g., `SampleA_FDR`) is added.

Details

The FDR calculation algorithm is adapted from the method described in Southall et al. (2013), Dev Cell, 26(1), 101-12, and implemented in the `polii.gene.call` tool. It operates in several stages:

For each sample, a null distribution of mean occupancy scores is simulated by repeatedly sampling random fragments from the genome-wide binding profile. This is done for various numbers of fragments.
A two-tiered regression model is fitted to the simulation results. This creates a statistical model that can predict the FDR for any given occupancy score and fragment count.
This predictive model is then applied to the actual observed mean occupancy and fragment count for each gene in the `occupancy_df`.
The calculated FDR value for each gene in each sample is appended to the `occupancy_df` in a new column.

This function is typically not called directly by the user, but is instead invoked via `load_data_genes(calculate_fdr = TRUE)`. Exporting it allows for advanced use cases where FDR needs to be calculated on a pre-existing occupancy table.

Examples

if (requireNamespace("BiocParallel", quietly = TRUE)) {
    # Prepare sample binding data (GRanges object)
    # Here, we'll load one of the sample files included with the package
    # (this is a TF binding profile, which would not normally be used for
    # occupancy calculations)
    data_dir <- system.file("extdata", package = "damidBind")
    bgraph_file <- file.path(data_dir, "Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm.gatc.2L.bedgraph.gz")

    # The internal function `import_bedgraph_as_df` is used here for convenience.
    # The column name 'score' will be replaced with the sample name.
    binding_df <- damidBind:::import_bedgraph_as_df(bgraph_file, colname = "Sample_1")

    binding_gr <- GenomicRanges::GRanges(
        seqnames = binding_df$chr,
        ranges = IRanges::IRanges(start = binding_df$start, end = binding_df$end),
        Sample_1 = binding_df$Sample_1
    )

    # Create a mock occupancy data frame.
    # In a real analysis, this would be generated by `calculate_occupancy()`.
    mock_occupancy <- data.frame(
        name = c("geneA", "geneB", "geneC"),
        nfrags = c(5, 10, 2),
        Sample_1 = c(1.5, 0.8, -0.2),
        row.names = c("geneA", "geneB", "geneC")
    )

    # Calculate FDR.
    # We use a low fdr_iterations for speed, and a seed for reproducibility.
    # BiocParallel::SerialParam() is used to ensure deterministic, single-core execution.
    occupancy_with_fdr <- calculate_and_add_fdr(
        binding_data = binding_gr,
        occupancy_df = mock_occupancy,
        fdr_iterations = 100,
        BPPARAM = BiocParallel::SerialParam(),
        seed = 42
    )

    # View the result, which now includes an FDR column for Sample_1.
    print(occupancy_with_fdr)
}
#> Building FDR models ...
#>  - Sample_1
#> FDR models prepared for all samples.
#> Calculating FDR ...
#>  - Sample_1
#> FDR calculation complete.
#>        name nfrags Sample_1 Sample_1_FDR
#> geneA geneA      5      1.5   0.01823004
#> geneB geneB     10      0.8   0.04537998
#> geneC geneC      2     -0.2   0.57182762

Usage

Arguments

Value

Details

See also

Examples