Features are computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image. There are several features highly correlated to each other, this suggest that il would be appropriated to preprocess the data by PCA (principal component analysis).
Different scales between variables suggest that it is appropriate to normalize the data before performing clustering analysis.
Radius (mean of distances from center to points on the perimeter)
Texture (standard deviation of gray-scale values)
Six other more real-valued features were computed for each cell nucleus.
Perform principal component analysis in order to reduce the dimension and eliminate the correlation between variables
Diagnosis (M = malignant, B = benign)
with seven component we can explain 90% of the variance in the data set
Accuracy : 90%
Accuracy : 91%
diagnosis | 1 | 2 |
---|---|---|
Hierarchical clustering | ||
B | 28 | 329 |
M | 188 | 24 |
K means clustering | ||
B | 343 | 14 |
M | 37 | 175 |
diagnosis | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Hierarchical clustering | ||||
B | 5 | 350 | 2 | 0 |
M | 113 | 97 | 0 | 2 |
K means clustering | ||||
B | 0 | 36 | 321 | 0 |
M | 37 | 54 | 24 | 97 |