Exploratory data analysis

Row

Correlation between numerical variables

Features are computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image. There are several features highly correlated to each other, this suggest that il would be appropriated to preprocess the data by PCA (principal component analysis).

Relation by diagnosis (M = malignant, B = benign)

Different scales between variables suggest that it is appropriate to normalize the data before performing clustering analysis.

Row

Radius

Radius (mean of distances from center to points on the perimeter)

Texture

Texture (standard deviation of gray-scale values)

Perimeter

Area

Six other more real-valued features were computed for each cell nucleus.

Perform PCA

Row

First two components by diagnosis

Perform principal component analysis in order to reduce the dimension and eliminate the correlation between variables

First and third components by diagnosis

Second and third components by diagnosis

Diagnosis (M = malignant, B = benign)

Row

Percentage of variance explained by each component

Cumulative percentage of variance explained by each component

with seven component we can explain 90% of the variance in the data set

Clustering analysis

Column

Hierarchical clustering, ward method with k=2

Accuracy : 90%

Hierarchical clustering, complete method with k=4

Column

K-means clustering k=2

Accuracy : 91%

K-means clustering k=4

Column

k=2

number of clusters : 2
diagnosis 1 2
Hierarchical clustering
B 28 329
M 188 24
K means clustering
B 343 14
M 37 175

k=4

number of clusters : 4
diagnosis 1 2 3 4
Hierarchical clustering
B 5 350 2 0
M 113 97 0 2
K means clustering
B 0 36 321 0
M 37 54 24 97