Group up¶
Topic: k means clusters, unsupervised learning, data grouping
- Task A:
Implement the k means clustering algorithm.
Create random sets of data with, e.g., 4 centers, and try to find them. You may also try changing the dimensionality of the problem.
- Task B:
Analyse the iris flower data with the k means algorithm.
Since we already know the species of each flower, the program prints a report on how well it did. Does k means recognize the flowers as well as the neural network did?
- Task C:
Get the questionnaire data and analyse it with the k means algorithm.
We have no idea how many categories describe this set of data or whether it can be described as clusters at all. What do you think? How many clusters would you use to split up this data set? How would you describe these groups?
Template: clustering.py
Data: The iris set combines the training and test sets from the neural network task. The
questions.txt
set was collected from university teaching staff in a Finnish university in 2011. The original questionnaire contained 30 questions on what motivates teachers in their work. The questions were grouped in 5 categories and a total score between -1 (highly demotivating) and 1 (highly motivating) was calculated for each. The categories were (1) personal benefit, (2) importance of work, (3) received feedback, (4) available resources, and (5) development possibilities.
clustering.py¶
- class clustering.Point(coordinates, true_cluster=- 1)[source]¶
Class representing a data point.
Each point has coordinates. It also stores the index of the cluster it belongs to.
In test cases we may know, how the points should be divided to different classes. The index of the true class may also be recorded.
- Parameters:
coordinates (array) – coordinate vector
true_cluster (int) – index of the class the point really belongs to
- clustering.animate_profile(history, center_history, k_clusters)[source]¶
Animates the simulation as a profile plot.
- Parameters:
history (list) – list of systems at different times
center_history (list) – list of centers at different times
k_clusters (int) – number of clusters
- clustering.animate_scatter(history, center_history, k_clusters)[source]¶
Animates the simulation as a scatter plot.
- Parameters:
history (list) – list of systems at different times
center_history (list) – list of centers at different times
k_clusters (int) – number of clusters
- clustering.calculate_cluster_centers(data, k_clusters)[source]¶
Calculates cluster center coordinates.
The center of a cluster is calculated simply as the average of all the coordinate vectors of the points that belong to the said cluster.
Mathematically, if we denote the set of points in cluster \(j\) as \(C_j\), the center is calculated as the sum
\[\vec{c}_j = \frac{1}{n_j} \sum_{\vec{p} \in C_j} \vec{p}\]where \(n_j\) is the number of points in cluster \(j\).
- clustering.check_results(data, k)[source]¶
Prints performance report for a test case where the correct groups are known.
Performance is evaluated in two ways. First, the method checks each cluster found by the k means algorithm and counts the number of points from each group in that cluster. The group with most points is declared to be the dominant group of that cluster and all the points belonging to that group are evaluated as correctly grouped while the rest are incorrectly grouped. The homogeneity score is calculated as the number of correctly grouped points divided by the total number of points. This score is 100 %, if each cluster contains points only from one group.
Secondly, the method checks each known group and counts the number of the points of that group in each cluster found by k means. The cluster with the most points is declared to be the dominant cluster of that group, and points in that clsuter are defined to be correctly clustered. The connectedness score is calculated as the number of correctly clustered points divided by the total number. This score is 100 %, if the points of each group is contained in a single cluster.
For a perfect score, each cluster should contain exactly all the points from a single group. This is possible only if number of clusters k is equal to the true number of groups in the data. Decreasing the number of clusters k tends to make the connectedness score better and the homogeneity score worse. (For \(k=1\), connectedness would be trivially 100 % because all points would be in the same cluster.) Increasing k tends to make homogeneity better and connectedness worse. (If k would equal the number of points, each point would be its own cluster making homogeneity 100 % and connectedness close to 0 %.)
- Parameters:
data (list) – list of data points as
Point
objectsk (int) – number of clusters assumed by the algorithm
- clustering.create_data(n_centers, dimensions=2)[source]¶
Creates a random set of data points.
The method randomly draws \(k\) center points from \([0,10]^d\), and then 100 - 300 normally distributed data points around them.
- Parameters:
n_centers (int) – number of clusters to create, \(k\)
dimensions (int) – dimensionality of the data to create, \(d\)
- Returns:
generated data as
Point
objects- Return type:
list
- clustering.data_to_matrix(data, cluster=None)[source]¶
Creates an array containing data point coordinates.
The coordinates are saved in a matrix where matrix[i,j] is the jth coordinate of the ith point.
- Parameters:
data (list) – list of data points as
Point
objectscluster (int) – If None, all points are included. If a number, only the points in that cluster are included.
- Returns:
coordinates as a matrix
- Return type:
array
- clustering.draw_profile(frame, history, center_history, k_clusters)[source]¶
Draws a profile plot of the data.
Used for animation.
- Parameters:
frame (int) – index of the frame to draw
history (list) – list of systems at different times
center_history (list) – list of centers at different times
k_clusters (int) – number of clusters
- clustering.draw_scatter(frame, history, center_history, k_clusters)[source]¶
Draws the data as a scatter plot.
Used for animation.
- Parameters:
frame (int) – index of the frame to draw
history (list) – list of systems at different times
center_history (list) – list of centers at different times
k_clusters (int) – number of clusters
- clustering.find_clusters(data, centers)[source]¶
Allocates each data point to a cluster.
The algorithm loops over all data points and checks for each one, which cluster center is the closest one. The index of the closest center is saved in point.cluster.
- clustering.k_means_clustering_algorithm(data, k_clusters, n_repeats=10)[source]¶
Divides the data points in k clusters.
In k means clustering, data points are divided in k different categories, i.e., clusters. The algorithm assumes all data points can be represented as numeric vectors in (possibly very high dimensional) data space. \(\vec{p} = [x_0, x_1, x_2, \ldots, x_N]\). All the coordinates should also have similar scales, because the similarity of any two data points is measured by their Euclidian distance
\[d_{i,j} = |\vec{p}_j - \vec{p}_i| = \sqrt{ \sum_n^N (x_{n,j} - x_{n,i})^2 }.\](If this is not the case, it can always be achieved by appropriate scaling.)
No prior information about the categories is needed except their total number, \(k\).
Initially, \(k\) random points are chosen and their coordinates are used as the initial coordinates of the cluster centers, \(\vec{c}\). Then these steps are repeated:
Each point \(\vec{p}_i\) is assigned to the cluster whose center \(\vec{c}_j\) is the closest. This is done with
find_clusters()
.If every point remains in the same cluster where it was before, we have reached a stable solution and the run ends. If any changes have been made, we continue.
New center coordinates are calculated for each cluster as the average of the coordinates of the points that belong to the cluster, This is done with
calculate_cluster_centers()
.
The result is a self-consistent solution, where each point belongs to its nearest cluster and each cluster is centered in the center of its points.
The solution is typically not unique though. Therefore the algorithm is run several times from different, randomly chosen initial clusters, and the best solution is picked. Ranking is done according to the sum of squared distances between data points and cluster centers.
Note
This function is incomplete!
- clustering.print_progress(step, total)[source]¶
Prints a progress bar.
- Parameters:
step (int) – progress counter
total (int) – counter at completion
- clustering.profile_plot(data, k_clusters, centers=None)[source]¶
Plots the data as a series of line plots.
Each line is drawn so that the x-coordinates are evenly spaced and the y-coordinates are the coordinates of the data point.
If cluster centers are given, they are also plotted as lines with a black outline.
- Parameters:
data (list) – list of data points as
Point
objectsk_clusters (int) – number of clusters
centers (list) – cluster center coordinates
- clustering.read_data(filename, data_cols=0, categories=0, delimiter=None)[source]¶
Reads data from a file.
Each line in the file should define one data point by listing its coordinates.
By default, the coordinates should be separated by spaces. If a delimiter is given, it is used as the separator instead.
Also by default, each column is assumed to represent a coordinate. If data_cols is given, each line is assumed to hold that many columns of coordinate data followed by columns specifying the true class of the point. The classification is given as columns of zeros and ones so that a one marks the class of the point.
The file could thus read, e.g.
1.2 3.4 5.6 1 0 0 1.3 5.7 9.1 0 0 1 ...
Where the first point would be of of the first class (index 0) and the second of the third class (index 2).
- Parameters:
filename (str) – name of file to read
data_cols (int) – Number of data columns, i.e., dimensionality of the data. Only give this argument if the file also contains category data.
categories (int) – number of known categories
delimiter (str) – delimiter, if something else than whitespace.
- Returns:
generated data as
Point
objects- Return type:
list
- clustering.scatter_plot(data, k_clusters, axes=[0, 1], centers=None)[source]¶
Plots the data as a 2D scatter plot.
The function assumes the data is sorted in k_clusters categories and it plots each cluster in a different color.
Only two coordinates are plotted even if the data is high dimensional. The coordinates to plot can be chosen with the axis parameter, which must contain the indices of the coordinates to plot.
If cluster centers are given, they are also plotted as dots with a black outline.
- Parameters:
data (list) – list of data points as
Point
objectsk_clusters (int) – number of clusters
axes (array) – indices of the coordinates to plot
centers (list) – cluster center coordinates