Extract unique sample names from complex labels — extract_unique_sample

This function takes a vector of complex sample labels and iteratively constructs a simplified, unique name for each. It identifies all blocks of text that differ across the sample set and progressively adds them to a base name until the combination of the base name and a replicate identifier is unique for every sample.

Usage

extract_unique_sample_ids(
  sample_names,
  delimiter = "[-_\\.]",
  replicate_pattern = "^(n|N|r|rep|replicate|sample)\\d+"
)

Arguments

sample_names: A character vector of sample labels.
delimiter: A regular expression used as a delimiter to split labels into blocks. (Default: `[-_\.]`)
replicate_pattern: A regular expression used to identify the replicate block. (Default: `^(n|N|r|rep|replicate|sample)\d+`)

Value

A vector of simplified, unique names. If a unique name cannot be formed or essential information is missing for a sample, the original label for that sample is returned as a fallback.

Examples

labels <- c(
  "RNAPII_elav-GSE77860-n1-SRR3164378-2017-vs-Dam.scaled.kde-norm",
  "RNAPII_elav-GSE77860-n2-SRR3164379-2017-vs-Dam.scaled.kde-norm",
  "RNAPII_elav-GSE77860-n4-SRR3164380-2017-vs-Dam.scaled.kde-norm",
  "RNAPII_Wor-GSE77860-n1-SRR3164346-2017-vs-Dam.scaled.kde-norm",
  "RNAPII_Wor-GSE77860-n2-SRR3164347-2017-vs-Dam.scaled.kde-norm",
  "RNAPII_Wor-GSE77860-sample1-SRR2038537-2017-vs-Dam.scaled.kde-norm"
)
extract_unique_sample_ids(labels)
#> [1] "elav_n1"     "elav_n2"     "elav_n4"     "Wor_n1"      "Wor_n2"     
#> [6] "Wor_sample1"