hierarchical_clusterring

SA https://datatab.net/tutorial/hierarchical-cluster-analysis

Cluster distance

Single
Complete
Average
Centroid

Method to get distance

Euclidian distance Distance
Manhattan distance (City-block) Distance
Correlation Distance
Eisen Cosine Correlation Distance
Kendal Distance

\begin{eqnarray*} d_{euc} (x, y) & = & \sqrt{ \sum_{i=1}^{n}(x_{i} - y_{i})^2 } \\ d_{man} (x, y) & = & \sum_{i=1}^{n} | (x_{i} - y_{i}) | \\ d_{cor} (x, y) & = & 1 - \frac { \displaystyle \sum_{i=1}^{n}(x_{i} - \overline{x}) (y_{i} - \overline{y})} { \sqrt{ \displaystyle \sum_{i=1}^{n}(x_{i} - \overline{x})^2 \displaystyle \sum_{i=1}^{n}(y_{i} - \overline{y})^2 }} \\ d_{eisen} (x, y) & = & 1 - \frac {\left| \displaystyle \sum_{i=1}^{n} x_{i} \; y_{i} \right| } { \sqrt{ \displaystyle \sum_{i=1}^{n}x_{i}^{2} \displaystyle \sum_{i=1}^{n} y_{i}^2 }} \\ d_{kend} (x, y) & = & 1- \displaystyle \frac { n_{c} - n_{d} } { \displaystyle \frac{1}{2} n(n-1)} \\ \end{eqnarray*}

There are many R functions for computing distances between pairs of observations:

dist() R base function [stats package]: Accepts only numeric data as an input.
get_dist() function [factoextra package]: Accepts only numeric data as an input. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
daisy() function [cluster package]: Able to handle other variable types (e.g. nominal, ordinal, (a)symmetric binary). In that case, the Gower’s coefficient will be automatically used as the metric. It’s one of the most popular measures of proximity for mixed data types. For more details, read the R documentation of the daisy() function (?daisy).

# Subset of the data
set.seed(123)
ss <- sample(1:50, 15)   # Take 15 random rows
df <- USArrests[ss, ]    # Subset the 15 rows
df.scaled <- scale(df)   # Standardize the variables

dist.eucl <- dist(df.scaled, method = "euclidean")
plot(dist.eucl)

# Reformat as a matrix
# Subset the first 3 columns and rows and Round the values
round(as.matrix(dist.eucl)[1:3, 1:3], 1)


# Compute
library("factoextra")
dist.cor <- get_dist(df.scaled, method = "pearson")

# Display a subset
round(as.matrix(dist.cor)[1:3, 1:3], 1)

library(cluster)
# Load data
data(flower)
head(flower, 3)
# Data structure
str(flower)

# Distance matrix
dd <- daisy(flower)
round(as.matrix(dd)[1:3, 1:3], 2)

library(factoextra)
fviz_dist(dist.eucl)