Load genome-wide binding data and associated peak files or GRanges objects

Reads DamID-seq log2 ratio binding data either from bedGraph files or directly from a list of GRanges objects, and associated peak regions either from GFF/bed files or from a list of GRanges objects. This function is suitable for transcription factor binding analyses. For peak discovery, use an external peak caller (e.g. 'find_peaks').

Usage

load_data_peaks(
  binding_profiles_path = NULL,
  peaks_path = NULL,
  binding_profiles = NULL,
  peaks = NULL,
  maxgap_loci = 1000,
  quantile_norm = FALSE,
  organism = "drosophila melanogaster",
  ensdb_genes = NULL,
  BPPARAM = BiocParallel::bpparam()
)

Arguments

binding_profiles_path: Character vector. Path(s) to directories or file globs containing log2 ratio binding tracks in bedGraph format. Wildcards ('*') supported.
peaks_path: Character vector. Path(s) to directories or file globs containing the peak calls in GFF or BED format.
binding_profiles: List of GRanges objects with binding profiles, one per sample.
peaks: List of GRanges objects representing peak regions.
maxgap_loci: Integer, the maximum bp distance between a peak boundary and a gene to associate that peak with the gene. Default: 1000.
quantile_norm: Logical (default: FALSE). If TRUE, quantile-normalise the signal columns across all datasets.
organism: Organism string (lower case) to obtain genome annotation from (if not providing a custom `ensdb_genes` object) Default: "drosophila melanogaster".
ensdb_genes: GRanges object: gene annotation. Automatically obtained from `organism` if NULL.
BPPARAM: BiocParallel function (defaults to BiocParallel::bpparam())

Value

A list with components:

binding_profiles_data: data.frame: Signal matrix for all regions, with columns chr, start, end, sample columns.
peaks: list(GRanges): All loaded peak regions from input files or directly supplied.
pr: GRanges: Reduced (union) peak regions across samples.
occupancy: data.frame: Binding values summarised over reduced peaks, with overlap annotations.
test_category: Character scalar; will be "bound".

Details

One of `binding_profiles_path` or `binding_profiles` must be provided. Similarly, one of `peaks_path` or `peaks` must be provided.

When supplying GRanges lists, each GRanges should contain exactly one numeric metadata column representing the binding signal, and all GRanges should be supplied as a named list, with element names used as sample names.

Examples

# Create a mock GRanges object for gene annotation
# This object, based on the package's unit tests, avoids network access
# and includes a very long gene to ensure overlaps with sample data.
mock_genes_gr <- GenomicRanges::GRanges(
    seqnames = S4Vectors::Rle("2L", 7),
    ranges = IRanges::IRanges(
        start = c(1000, 2000, 3000, 5000, 6000, 7000, 8000),
        end = c(1500, 2500, 3500, 5500, 6500, 7500, 20000000)
    ),
    strand = S4Vectors::Rle(GenomicRanges::strand(c("+", "-", "+", "+", "-", "-", "+"))),
    gene_id = c("FBgn001", "FBgn002", "FBgn003", "FBgn004", "FBgn005", "FBgn006", "FBgn007"),
    gene_name = c("geneA", "geneB", "geneC", "geneD", "geneE", "geneF", "LargeTestGene")
)

# Get path to sample data files included with the package
data_dir <- system.file("extdata", package = "damidBind")

# Run loading function using sample files and mock gene annotations
loaded_data <- load_data_peaks(
    binding_profiles_path = data_dir,
    peaks_path = data_dir,
    ensdb_genes = mock_genes_gr,
    quantile_norm = TRUE
)
#> Locating binding profile files
#> Building binding profile dataframe from input files ...
#>  - Loaded: Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm
#>  - Loaded: Bsh_Dam_L4_r2-ext300-vs-Dam.kde-norm
#>  - Loaded: Bsh_Dam_L5_r1-ext300-vs-Dam.kde-norm
#>  - Loaded: Bsh_Dam_L5_r2-ext300-vs-Dam.kde-norm
#> Applying quantile normalisation
#> Locating peak files
#> Calculating occupancy over peaks
#> Calculating average occupancy for 1208 regions...

# View the structure of the output
str(loaded_data, max.level = 1)
#> List of 5
#>  $ binding_profiles_data:Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
#>  $ peaks                :List of 4
#>  $ pr                   :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
#>  $ occupancy            :'data.frame':	1208 obs. of  9 variables:
#>  $ test_category        : chr "bound"