Skip to contents

Reads DamID-seq log2 ratio binding data either from bedGraph files or directly from a list of GRanges objects, and associated peak regions either from GFF/bed files or from a list of GRanges objects. This function is suitable for transcription factor binding analyses. For peak discovery, use an external peak caller (e.g. 'find_peaks').

Usage

load_data_peaks(
  binding_profiles_path = NULL,
  peaks_path = NULL,
  binding_profiles = NULL,
  peaks = NULL,
  drop_samples = NULL,
  maxgap_loci = 1000,
  quantile_norm = FALSE,
  organism = "drosophila melanogaster",
  ensdb_genes = NULL,
  BPPARAM = BiocParallel::bpparam(),
  plot_diagnostics = interactive()
)

Arguments

binding_profiles_path

Character vector. Path(s) to directories or file globs containing log2 ratio binding tracks in bedGraph format. Wildcards ('*') supported.

peaks_path

Character vector. Path(s) to directories or file globs containing the peak calls in GFF or BED format.

binding_profiles

List of GRanges objects with binding profiles, one per sample.

peaks

List of GRanges objects representing peak regions.

drop_samples

A character vector of sample names or patterns to remove. Matching samples are removed from the analysis before normalisation and occupancy calculation. This can be useful for excluding samples that fail initial quality checks. Default: `NULL` (no samples are dropped).

maxgap_loci

Integer, the maximum bp distance between a peak boundary and a gene to associate that peak with the gene. Default: 1000.

quantile_norm

Logical (default: FALSE). If TRUE, quantile-normalise the signal columns across all datasets.

organism

Organism string (lower case) to obtain genome annotation from (if not providing a custom `ensdb_genes` object) Default: "drosophila melanogaster".

ensdb_genes

GRanges object: gene annotation. Automatically obtained from `organism` if NULL.

BPPARAM

BiocParallel function (defaults to BiocParallel::bpparam())

plot_diagnostics

Logical. If `TRUE` (the default in interactive sessions), diagnostic plots (PCA and correlation heatmap) will be generated and displayed for both the raw binding data and the summarised occupancy data.

Value

A list with components:

binding_profiles_data

data.frame: Signal matrix for all regions, with columns chr, start, end, sample columns.

peaks

list(GRanges): All loaded peak regions from input files or directly supplied.

pr

GRanges: Reduced (union) peak regions across samples.

occupancy

data.frame: Binding values summarised over reduced peaks, with overlap annotations.

test_category

Character scalar; will be "bound".

Details

One of `binding_profiles_path` or `binding_profiles` must be provided. Similarly, one of `peaks_path` or `peaks` must be provided.

When supplying GRanges lists, each GRanges should contain exactly one numeric metadata column representing the binding signal, and all GRanges should be supplied as a named list, with element names used as sample names.

Examples

# Create a mock GRanges object for gene annotation
# This object, based on the package's unit tests, avoids network access
# and includes a very long gene to ensure overlaps with sample data.
mock_genes_gr <- GenomicRanges::GRanges(
    seqnames = S4Vectors::Rle("2L", 7),
    ranges = IRanges::IRanges(
        start = c(1000, 2000, 3000, 5000, 6000, 7000, 8000),
        end = c(1500, 2500, 3500, 5500, 6500, 7500, 20000000)
    ),
    strand = S4Vectors::Rle(GenomicRanges::strand(c("+", "-", "+", "+", "-", "-", "+"))),
    gene_id = c("FBgn001", "FBgn002", "FBgn003", "FBgn004", "FBgn005", "FBgn006", "FBgn007"),
    gene_name = c("geneA", "geneB", "geneC", "geneD", "geneE", "geneF", "LargeTestGene")
)

# Get path to sample data files included with the package
data_dir <- system.file("extdata", package = "damidBind")

# Run loading function using sample files and mock gene annotations
loaded_data <- load_data_peaks(
    binding_profiles_path = data_dir,
    peaks_path = data_dir,
    ensdb_genes = mock_genes_gr,
    quantile_norm = TRUE
)
#> Locating binding profile files
#> Building binding profile dataframe from input files ...
#>  - Loaded: Bsh_Dam_L4_r1-ext300-vs-Dam.kde-norm
#>  - Loaded: Bsh_Dam_L4_r2-ext300-vs-Dam.kde-norm
#>  - Loaded: Bsh_Dam_L5_r1-ext300-vs-Dam.kde-norm
#>  - Loaded: Bsh_Dam_L5_r2-ext300-vs-Dam.kde-norm
#> Locating peak files
#> Applying quantile normalisation
#> Calculating occupancy over peaks
#> Calculating average occupancy for 1208 regions...
#> Generating diagnostic plots...



# View the structure of the output
str(loaded_data, max.level = 1)
#> List of 5
#>  $ binding_profiles_data:Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
#>  $ peaks                :List of 4
#>  $ pr                   :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
#>  $ occupancy            :'data.frame':	1208 obs. of  9 variables:
#>  $ test_category        : chr "bound"