Skip to contents

This function calculates a p-value for gene occupancy scores. When adjusted for multiple-hypothesis testing, the resulting FDR may used as a proxy for determining whether a gene is actively expressed in RNA Polymerase TaDa experiments. The function applies a permutation-based null model to each sample's binding profile to determine empirical p-values, and returns these as new columns in the input occupancy dataframe.

These p-values are then aggregated and

Usage

calculate_and_add_occupancy_pvals(
  binding_data,
  occupancy_df,
  null_model_iterations = 1e+05,
  return_per_replicate_fdr = FALSE,
  plot_diagnostics = FALSE,
  BPPARAM = BiocParallel::bpparam(),
  seed = NULL
)

Arguments

binding_data

A `GRanges` object of binding profiles, where metadata columns represent samples. This is typically the `binding_profiles_data` element from the list returned by `load_data_genes`.

occupancy_df

A data frame of gene occupancies, typically the `occupancy` element from the list returned by `load_data_genes`. It must contain sample columns and a `nfrags` column.

null_model_iterations

Integer. The number of sampling iterations to build the FDR null model. Higher values are more accurate but slower. Default is 100000.

return_per_replicate_fdr

Logical. Returns individual FDR scores by applying BH adjustment on each individual sample. Use this option if not intending to apply downstream condition-level p-value aggregation via `differential_binding()`. (Default: FALSE)

plot_diagnostics

Logical. If TRUE, will plot the Tier 2 regression fits for the p-value prediction slope, intercept and mean squared error (MSE). (Default: FALSE)

BPPARAM

A `BiocParallelParam` instance specifying the parallel backend to use for computation. See `BiocParallel::bpparam()`.

seed

An optional integer. If provided, it is used to set the random seed before the sampling process begins, ensuring that the FDR calculations are fully reproducible. If `NULL` (default), results may vary between runs.

Value

An updated version of the input `occupancy_df` data frame, with new columns appended. For each sample column (e.g., `SampleA`), a corresponding p-value column (e.g., `SampleA_pval`) is added.

Details

The algorithm used here is substantially revised and adapted from the method described in Southall et al. (2013), Dev Cell, 26(1), 101-12. The broad principle is as follows:

  1. For each sample, a null distribution of mean occupancy scores is simulated by repeatedly sampling random fragments from the genome-wide binding profile. This is done for various numbers of fragments.

  2. A two-tiered regression model is fitted to the simulation results. This creates a statistical model that can predict the empirical p-value for any given occupancy score and fragment count.

  3. This predictive model is then applied to the actual observed mean occupancy and fragment count for each gene in the `occupancy_df`.

  4. The final set of p-values is adjusted using BH correction to generate FDR scores for each gene.

  5. The FDR value for each gene in each sample is appended to the `occupancy_df` in a new column.

Key differences from the original Southall et al. algorithm include fitting natural spline models, using WLS to account for heterscedasticity of models, and sampling with replacement to generate the underlying null distribution.

This function is typically not called directly by the user, but is instead invoked via `load_data_genes(calculate_occupancy_pvals = TRUE)`.

See also

[load_data_genes()] which uses this function internally.

Examples


# Prepare sample binding data (GRanges object)
# Here, we'll load one of the sample files included with the package
# (this is a TF binding profile, which would not normally be used for
# occupancy calculations)
data_dir <- system.file("extdata", package = "damidBind")
bgraph_file <- file.path(data_dir, "Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm.gatc.2L.bedgraph.gz")

# The internal function `import_bedgraph_as_df` is used here for convenience.
# The column name 'score' will be replaced with the sample name.
binding_df <- damidBind:::import_bedgraph_as_df(bgraph_file, colname = "Sample_1")

binding_gr <- GenomicRanges::GRanges(
    seqnames = binding_df$chr,
    ranges = IRanges::IRanges(start = binding_df$start, end = binding_df$end),
    Sample_1 = binding_df$Sample_1
)

# Create a mock occupancy data frame.
# In a real analysis, this would be generated by `calculate_occupancy()`.
mock_occupancy <- data.frame(
    name = c("geneA", "geneB", "geneC"),
    nfrags = c(5, 10, 2),
    Sample_1 = c(1.5, 0.8, -0.2),
    row.names = c("geneA", "geneB", "geneC")
)

# Calculate pvals.
# We use a low null_model_iterations for speed, and a seed for reproducibility.
occupancy_with_pvals <- calculate_and_add_occupancy_pvals(
    binding_data = binding_gr,
    occupancy_df = mock_occupancy,
    null_model_iterations = 100,
    BPPARAM = BiocParallel::SerialParam(),
    seed = 42
)
#> 
#> Building occupancy models ...
#>  - Sample_1
#>    Thresholds: 0.22 0.44 0.59 0.79 1.03 1.33 1.68 2.17 2.89 3.31
#> Occupancy models prepared for all samples.
#> Calculating occupancy p-values ...
#>  - Sample_1
#> FDR calculation complete.

# View the result, which now includes a _pval column for Sample_1.
print(occupancy_with_pvals)
#>        name nfrags Sample_1 Sample_1_pval
#> geneA geneA      5      1.5    0.06271075
#> geneB geneB     10      0.8    0.15605316
#> geneC geneC      2     -0.2    0.48531928