Load genome-wide binding data for gene expression (RNA polymerase occupancy)

Reads RNA Polymerase DamID binding profiles either from bedGraph files or directly from a named list of GRanges objects. Calculates binding occupancy summarised over genes.

Usage

load_data_genes(
  binding_profiles_path = NULL,
  binding_profiles = NULL,
  quantile_norm = FALSE,
  organism = "drosophila melanogaster",
  calculate_fdr = FALSE,
  fdr_iterations = 50000,
  ensdb_genes = NULL,
  BPPARAM = BiocParallel::bpparam()
)

Arguments

binding_profiles_path: Character vector of directories or file globs containing log2 ratio binding tracks in bedGraph format. Wildcards ('*') supported.
binding_profiles: Named list of GRanges objects representing binding profiles.
quantile_norm: Logical (default: FALSE) quantile-normalise across all signal columns if TRUE.
organism: Organism string (lower case) to obtain genome annotation from (if not providing a custom `ensdb_genes` object) Defautls to "drosophila melanogaster".
calculate_fdr: Calculate FDR based on RNA Pol occupancy (see details) (default: FALSE)
fdr_iterations: Number of iterations to use to determine null model for FDR (default: 50000)
ensdb_genes: GRanges object: gene annotation. Automatically obtained from `organism` if NULL.
BPPARAM: BiocParallel function (defaults to BiocParallel::bpparam())

Value

List with elements:

binding_profiles_data: data.frame of merged binding profiles, with chr, start, end, sample columns.
occupancy: data.frame of occupancy values summarised over genes.
test_category: Character scalar; will be "expressed".

Details

One of `binding_profiles_path` or `binding_profiles` must be provided.

When supplying GRanges lists, each GRanges should contain exactly one numeric metadata column representing the signal, and `binding_profiles` must be a named list, with element names used as sample names.

The algorithm for determining gene occupancy FDR (as a proxy for gene expression) is based on `polii.gene.call`, which in turn was based on that described in Southall et al. (2013). Dev Cell, 26(1), 101–12. doi:10.1016/j.devcel.2013.05.020. Briefly, the algorithm establishes a null model by simulating the distribution of mean occupancy scores from random fragments. It fits a two-tiered regression to predict the False Discovery Rate (FDR), based on fragment count and score. For each gene, the true weighted mean occupancy and fragment count are calculated from the provided binding profile. Finally, the pre-computed regression models are used to assign a specific FDR to each gene based on its observed occupancy and fragment count.

Examples

# Create a mock GRanges object for gene annotations
# This object, based on the package's unit tests, avoids network access
# and includes a very long gene to ensure overlaps with sample data.
mock_genes_gr <- GenomicRanges::GRanges(
    seqnames = S4Vectors::Rle("2L", 7),
    ranges = IRanges::IRanges(
        start = c(1000, 2000, 3000, 5000, 6000, 7000, 8000),
        end = c(1500, 2500, 3500, 5500, 6500, 7500, 20000000)
    ),
    strand = S4Vectors::Rle(GenomicRanges::strand(c("+", "-", "+", "+", "-", "-", "+"))),
    gene_id = c("FBgn001", "FBgn002", "FBgn003", "FBgn004", "FBgn005", "FBgn006", "FBgn007"),
    gene_name = c("geneA", "geneB", "geneC", "geneD", "geneE", "geneF", "LargeTestGene")
)

# Get path to sample data files included with the package
data_dir <- system.file("extdata", package = "damidBind")

# Run loading function using sample files and mock gene annotations
# This calculates occupancy over genes instead of peaks.
loaded_data_genes <- load_data_genes(
    binding_profiles_path = data_dir,
    ensdb_genes = mock_genes_gr,
    quantile_norm = FALSE
)
#> Locating binding profile files
#> Building binding profile dataframe from input files ...
#>  - Loaded: Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm
#>  - Loaded: Bsh_Dam_L4_r2-ext300-vs-Dam.kde-norm
#>  - Loaded: Bsh_Dam_L5_r1-ext300-vs-Dam.kde-norm
#>  - Loaded: Bsh_Dam_L5_r2-ext300-vs-Dam.kde-norm
#> Calculating average occupancy for 7 regions...

# View the head of the occupancy table
head(loaded_data_genes$occupancy)
#>              gene_id gene_name         name nfrags
#> 2L:1000-1500 FBgn001     geneA 2L:1000-1500      4
#> 2L:2000-2500 FBgn002     geneB 2L:2000-2500      4
#> 2L:3000-3500 FBgn003     geneC 2L:3000-3500      4
#> 2L:5000-5500 FBgn004     geneD 2L:5000-5500      4
#> 2L:6000-6500 FBgn005     geneE 2L:6000-6500      2
#> 2L:7000-7500 FBgn006     geneF 2L:7000-7500      1
#>              Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm
#> 2L:1000-1500                             0.000000
#> 2L:2000-2500                             0.000000
#> 2L:3000-3500                             0.000000
#> 2L:5000-5500                             1.465469
#> 2L:6000-6500                             1.515389
#> 2L:7000-7500                            -0.130000
#>              Bsh_Dam_L4_r2-ext300-vs-Dam.kde-norm
#> 2L:1000-1500                             0.000000
#> 2L:2000-2500                             0.000000
#> 2L:3000-3500                             0.000000
#> 2L:5000-5500                             1.444371
#> 2L:6000-6500                             1.791916
#> 2L:7000-7500                            -0.110000
#>              Bsh_Dam_L5_r1-ext300-vs-Dam.kde-norm
#> 2L:1000-1500                            0.0000000
#> 2L:2000-2500                            0.0000000
#> 2L:3000-3500                            0.0000000
#> 2L:5000-5500                            1.2137325
#> 2L:6000-6500                            0.2373453
#> 2L:7000-7500                           -0.0300000
#>              Bsh_Dam_L5_r2-ext300-vs-Dam.kde-norm
#> 2L:1000-1500                            0.0000000
#> 2L:2000-2500                            0.0000000
#> 2L:3000-3500                            0.0000000
#> 2L:5000-5500                            0.8943313
#> 2L:6000-6500                            0.6489222
#> 2L:7000-7500                           -0.0900000