Sample data points based on local isolation — sample_labels_by

An issue with labelling points on dense plots (e.g., volcano plots) is that high point density prevents clear labelling, even with tools like `ggrepel`. This function addresses this by retaining isolated points while sampling from points in higher-density regions. It takes a dataframe with Cartesian coordinates and returns a logical vector identifying which points to select for labelling. The result is a less cluttered plot where labels are present even in crowded areas, providing a better representation of the underlying data.

Usage

sample_labels_by_isolation(
  df,
  x_col,
  y_col,
  r,
  k_priority = 30,
  scale = TRUE,
  k_search = 30,
  k_for_r = 5
)

Arguments

df: A dataframe containing the point coordinates.
x_col: A character string with the name of the column containing x-coordinates.
y_col: A character string with the name of the column containing y-coordinates.
r: The exclusion radius. This can be a positive numeric value or the string "auto". If "auto", the radius is calculated as the median distance to the `k_for_r`-th nearest neighbour across all points. A smaller `r` will result in more points being kept. Note: The interpretation of `r` depends on whether `scale` is `TRUE`.
k_priority: An integer for calculating the isolation priority score. Must be less than or equal to `k_search`. Default: 30.
scale: A logical value. If `TRUE`, the coordinate data is centred and scaled (using `scale(center=TRUE, scale=TRUE)`) before distance calculations. Defaults: `TRUE`.
k_search: The maximum number of neighbours to find in the initial KNN search. This value must be greater than or equal to both `k_priority` and `k_for_r`. Default: 30.
k_for_r: An integer specifying which neighbour to use for the 'auto' `r` calculation. Default: 5.

Value

A logical vector of length `nrow(df)`. `TRUE` indicates the point at that index should be kept for labelling.

Details

The algorithm in detail: 1. If `scale = TRUE`, the coordinate data is centred and scaled. 2. An exact k-nearest neighbour (KNN) search for all points is conducted using the `dbscan::kNN` function. 3. A priority score is calculated for each point, defined as the median distance to its `k_priority` nearest neighbours, and the list of points sorted by this score. 4. The function iterates through the sorted list of points in descending order: a. If a point has not yet been processed, it is marked as 'processed' and 'kept'. b. A radius search is performed around this point using `dbscan::frNN`. All neighbours within the specified exclusion radius `r` are then marked as 'processed' and will be ignored in subsequent iterations. 5. A logical vector is returned, where `TRUE` corresponds to a point that should be kept for labelling.

Examples

library(ggplot2)
library(ggrepel)

# Generate sample data with a dense cluster
set.seed(42)
n_points <- 1000
cluster_data <- data.frame(
  x = rnorm(n_points, mean = 5, sd = 1),
  y = rnorm(n_points, mean = 5, sd = 1),
  label = paste("Point", 1:n_points)
)

# Use the function to get a logical vector for filtering
kept_labels <- sample_labels_by_isolation(
  df = cluster_data,
  x_col = "x",
  y_col = "y",
  scale = FALSE,
  r = "auto",
  k_priority = 30,
  k_search = 30,
  k_for_r = 5
)
#>   Label sampling: auto-calculated radius 'r' (median distance to 5-th nearest neighbour): 0.142

# Create the label dataframe for ggplot
label_df <- cluster_data[kept_labels, ]

# Plot the results
ggplot(cluster_data, aes(x = x, y = y)) +
  geom_point(colour = "grey70", alpha = 0.7) +
  geom_point(data = label_df, colour = "firebrick") +
  geom_text_repel(
    data = label_df,
    aes(label = label),
    min.segment.length = 0,
    box.padding = 0.25,
    max.overlaps = Inf
  ) +
  coord_fixed() +
  labs(
    title = "Sampled Labels",
    subtitle = paste(sum(kept_labels), "of", nrow(cluster_data), "points labelled"),
    caption = "Red points are selected for labelling."
  ) +
  theme_bw()