User Tools

Site Tools


hierarchical_clusterring_analysis

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
hierarchical_clusterring_analysis [2024/11/21 14:08] – created hkimscilhierarchical_clusterring_analysis [2024/11/21 14:16] (current) hkimscil
Line 1: Line 1:
 +SA https://datatab.net/tutorial/hierarchical-cluster-analysis
  
 +Cluster distance
 +
 +  * Single
 +  * Complete
 +  * Average
 +  * Centroid
 +
 +Method to get distance 
   * Euclidian distance Distance   * Euclidian distance Distance
   * Manhattan distance (City-block) Distance   * Manhattan distance (City-block) Distance
Line 6: Line 15:
   * Kendal Distance    * Kendal Distance 
  
-\begin{eqnarray*} 
  
 +\begin{eqnarray*}
 d_{euc} (x, y) & = & \sqrt{ \sum_{i=1}^{n}(x_{i} - y_{i})^2 } \\ d_{euc} (x, y) & = & \sqrt{ \sum_{i=1}^{n}(x_{i} - y_{i})^2 } \\
 d_{man} (x, y) & = & \sum_{i=1}^{n} | (x_{i} - y_{i}) |  \\ d_{man} (x, y) & = & \sum_{i=1}^{n} | (x_{i} - y_{i}) |  \\
Line 15: Line 24:
 \end{eqnarray*} \end{eqnarray*}
  
 +There are many R functions for computing distances between pairs of observations:
 +
 +  * dist() R base function [stats package]: Accepts only numeric data as an input.
 +  * get_dist() function [factoextra package]: Accepts only numeric data as an input. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
 +  * daisy() function [cluster package]: Able to handle other variable types (e.g. nominal, ordinal, (a)symmetric binary). In that case, the Gower’s coefficient will be automatically used as the metric. It’s one of the most popular measures of proximity for mixed data types. For more details, read the R documentation of the daisy() function (?daisy).
 +
 +
 +<code>
 +# Subset of the data
 +set.seed(123)
 +ss <- sample(1:50, 15)   # Take 15 random rows
 +df <- USArrests[ss, ]    # Subset the 15 rows
 +df.scaled <- scale(df)   # Standardize the variables
 +
 +dist.eucl <- dist(df.scaled, method = "euclidean")
 +plot(dist.eucl)
 +
 +# Reformat as a matrix
 +# Subset the first 3 columns and rows and Round the values
 +round(as.matrix(dist.eucl)[1:3, 1:3], 1)
 +
 +
 +# Compute
 +library("factoextra")
 +dist.cor <- get_dist(df.scaled, method = "pearson")
 +
 +# Display a subset
 +round(as.matrix(dist.cor)[1:3, 1:3], 1)
 +
 +library(cluster)
 +# Load data
 +data(flower)
 +head(flower, 3)
 +# Data structure
 +str(flower)
 +
 +# Distance matrix
 +dd <- daisy(flower)
 +round(as.matrix(dd)[1:3, 1:3], 2)
 +
 +library(factoextra)
 +fviz_dist(dist.eucl)
 +
 +</code>
hierarchical_clusterring_analysis.1732165691.txt.gz · Last modified: 2024/11/21 14:08 by hkimscil

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki