7  Hierachical ascending classification

7.1 Principle

7.1.1 First strategy: Agglomerative Hierarchical Clustering

  • Start from the bottom of the dendrogram (singletons),

  • Add the closest parts two by two until you get a single class

Source: @janssen2012

Source: Data analysis MOOC of Francois Husson

NoteWhere to cut the dendogram?

Rule of thumb

  • Selection of a cut when there is a significant jump in the index by visual inspection of the tree. This jump reflects the sudden passage from a certain homogeneity of classes to much less homogeneous classes.

7.1.2 Second strategy: Divide the hierarchical clustering

  • Start from the top of the dendrogram (one unique class),

  • Successive divisions until you get classes reduced to singlets.

7.2 Weaknesses and strengths

Advantages

  • Simple considerations of distances and similarities

  • No assumption on the number of classes

  • Can correspond to significant taxonomies

Disadvantages

  • Choice of the dendogram cutoff.

  • The partition obtained at a step depends on that of the previous step.

  • Once a decision is made to combine classes, it cannot be undone.

  • Too slow for large datasets.

7.3 Practical

7.3.1 Example 1

We consider the following data table where 4 individuals (here points) A,B,C and D are described on two variables (X1 and X2):

X1 X2
A 5 4
B 4 5
C 1 -2
D 0 -3

7.3.2 Example 2

library(ggplot2)
library(cluster)
library(dendextend)
library(factoextra)
library(ggdendro)

Step 1: Data preparation

set.seed(123)
data <- data.frame(
  x = c(rnorm(50, mean = 2, sd = 0.5), rnorm(50, mean = 5, sd = 0.5)),
  y = c(rnorm(50, mean = 3, sd = 0.5), rnorm(50, mean = 6, sd = 0.5))
  )

ggplot(data, aes(x = x, y = y)) +
  geom_point(color = 'blue') +
  theme_minimal() +
  ggtitle("Initial data")

Step 2: HAC

Computation of the distance matrix

distance_matrix <- dist(data, method = "euclidean")

Hierarchical ascending classification

cah <- hclust(distance_matrix, method = "ward.D2")

# Conversion into format ggplot2
dendro_data <- ggdendro::dendro_data(cah)

# Extraction of the labels of the leaves
label_data <- dendro_data$labels

# Display of the basic dendogram with ggplot2
ggplot() +
  geom_segment(data = dendro_data$segments, aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_text(data = label_data, aes(x = x, y = y, label = label),
            hjust = 2, angle = 90, size = 2) +
  labs(title = "HAC dendrogram", x = "", y = "Height") +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), panel.grid = element_blank())

Step 3: Visualization of the results with ggplot2

# Cutting in clusters

k <- 2 # Number of desired clusters
clusters <- cutree(cah, k = k)

data$cluster <- as.factor(clusters)

# clusters visualization
ggplot(data, aes(x = x, y = y, color = cluster)) +
geom_point(size = 3) +
theme_minimal() +
ggtitle(paste("Classification in", k, "clusters"))