Group up

  • Topic: k means clusters, unsupervised learning, data grouping

  • Task A:
    1. Implement the k means clustering algorithm.

    2. Create random sets of data with, e.g., 4 centers, and try to find them. You may also try changing the dimensionality of the problem.

  • Task B:
    1. Analyse the iris flower data with the k means algorithm.

    2. Since we already know the species of each flower, the program prints a report on how well it did. Does k means recognize the flowers as well as the neural network did?

  • Task C:
    1. Get the questionnaire data and analyse it with the k means algorithm.

    2. We have no idea how many categories describe this set of data or whether it can be described as clusters at all. What do you think? How many clusters would you use to split up this data set? How would you describe these groups?

  • Template: clustering.py

  • Data: The iris set combines the training and test sets from the neural network task. The questions.txt set was collected from university teaching staff in a Finnish university in 2011. The original questionnaire contained 30 questions on what motivates teachers in their work. The questions were grouped in 5 categories and a total score between -1 (highly demotivating) and 1 (highly motivating) was calculated for each. The categories were (1) personal benefit, (2) importance of work, (3) received feedback, (4) available resources, and (5) development possibilities.

  • Further reading:

clustering.py

class clustering.Point(coordinates, true_cluster=- 1)[source]

Class representing a data point.

Each point has coordinates. It also stores the index of the cluster it belongs to.

In test cases we may know, how the points should be divided to different classes. The index of the true class may also be recorded.

Parameters:
  • coordinates (array) – coordinate vector

  • true_cluster (int) – index of the class the point really belongs to

sq_distance(another_point)[source]

Calculates the squared distance between this point and another.

Parameters:

another_point (Point) – the other point

Returns:

squared distance between the points

Return type:

float

clustering.animate_profile(history, center_history, k_clusters)[source]

Animates the simulation as a profile plot.

Parameters:
  • history (list) – list of systems at different times

  • center_history (list) – list of centers at different times

  • k_clusters (int) – number of clusters

clustering.animate_scatter(history, center_history, k_clusters)[source]

Animates the simulation as a scatter plot.

Parameters:
  • history (list) – list of systems at different times

  • center_history (list) – list of centers at different times

  • k_clusters (int) – number of clusters

clustering.calculate_cluster_centers(data, k_clusters)[source]

Calculates cluster center coordinates.

The center of a cluster is calculated simply as the average of all the coordinate vectors of the points that belong to the said cluster.

Mathematically, if we denote the set of points in cluster \(j\) as \(C_j\), the center is calculated as the sum

\[\vec{c}_j = \frac{1}{n_j} \sum_{\vec{p} \in C_j} \vec{p}\]

where \(n_j\) is the number of points in cluster \(j\).

Parameters:
  • data (list) – list of data points as Point objects

  • k_clusters (int) – number of clusters

Returns:

list of cluster centers as Point objects

Return type:

list

clustering.check_results(data, k)[source]

Prints performance report for a test case where the correct groups are known.

Performance is evaluated in two ways. First, the method checks each cluster found by the k means algorithm and counts the number of points from each group in that cluster. The group with most points is declared to be the dominant group of that cluster and all the points belonging to that group are evaluated as correctly grouped while the rest are incorrectly grouped. The homogeneity score is calculated as the number of correctly grouped points divided by the total number of points. This score is 100 %, if each cluster contains points only from one group.

Secondly, the method checks each known group and counts the number of the points of that group in each cluster found by k means. The cluster with the most points is declared to be the dominant cluster of that group, and points in that clsuter are defined to be correctly clustered. The connectedness score is calculated as the number of correctly clustered points divided by the total number. This score is 100 %, if the points of each group is contained in a single cluster.

For a perfect score, each cluster should contain exactly all the points from a single group. This is possible only if number of clusters k is equal to the true number of groups in the data. Decreasing the number of clusters k tends to make the connectedness score better and the homogeneity score worse. (For \(k=1\), connectedness would be trivially 100 % because all points would be in the same cluster.) Increasing k tends to make homogeneity better and connectedness worse. (If k would equal the number of points, each point would be its own cluster making homogeneity 100 % and connectedness close to 0 %.)

Parameters:
  • data (list) – list of data points as Point objects

  • k (int) – number of clusters assumed by the algorithm

clustering.create_data(n_centers, dimensions=2)[source]

Creates a random set of data points.

The method randomly draws \(k\) center points from \([0,10]^d\), and then 100 - 300 normally distributed data points around them.

Parameters:
  • n_centers (int) – number of clusters to create, \(k\)

  • dimensions (int) – dimensionality of the data to create, \(d\)

Returns:

generated data as Point objects

Return type:

list

clustering.data_to_matrix(data, cluster=None)[source]

Creates an array containing data point coordinates.

The coordinates are saved in a matrix where matrix[i,j] is the jth coordinate of the ith point.

Parameters:
  • data (list) – list of data points as Point objects

  • cluster (int) – If None, all points are included. If a number, only the points in that cluster are included.

Returns:

coordinates as a matrix

Return type:

array

clustering.draw_profile(frame, history, center_history, k_clusters)[source]

Draws a profile plot of the data.

Used for animation.

Parameters:
  • frame (int) – index of the frame to draw

  • history (list) – list of systems at different times

  • center_history (list) – list of centers at different times

  • k_clusters (int) – number of clusters

clustering.draw_scatter(frame, history, center_history, k_clusters)[source]

Draws the data as a scatter plot.

Used for animation.

Parameters:
  • frame (int) – index of the frame to draw

  • history (list) – list of systems at different times

  • center_history (list) – list of centers at different times

  • k_clusters (int) – number of clusters

clustering.find_clusters(data, centers)[source]

Allocates each data point to a cluster.

The algorithm loops over all data points and checks for each one, which cluster center is the closest one. The index of the closest center is saved in point.cluster.

Parameters:
  • data (list) – list of data points as Point objects

  • centers (list) – list of cluster centers as Point objects

Returns:

True if any changes were made, sum of squares of point-center distances

Return type:

bool, float

clustering.k_means_clustering_algorithm(data, k_clusters, n_repeats=10)[source]

Divides the data points in k clusters.

In k means clustering, data points are divided in k different categories, i.e., clusters. The algorithm assumes all data points can be represented as numeric vectors in (possibly very high dimensional) data space. \(\vec{p} = [x_0, x_1, x_2, \ldots, x_N]\). All the coordinates should also have similar scales, because the similarity of any two data points is measured by their Euclidian distance

\[d_{i,j} = |\vec{p}_j - \vec{p}_i| = \sqrt{ \sum_n^N (x_{n,j} - x_{n,i})^2 }.\]

(If this is not the case, it can always be achieved by appropriate scaling.)

No prior information about the categories is needed except their total number, \(k\).

Initially, \(k\) random points are chosen and their coordinates are used as the initial coordinates of the cluster centers, \(\vec{c}\). Then these steps are repeated:

  • Each point \(\vec{p}_i\) is assigned to the cluster whose center \(\vec{c}_j\) is the closest. This is done with find_clusters().

  • If every point remains in the same cluster where it was before, we have reached a stable solution and the run ends. If any changes have been made, we continue.

  • New center coordinates are calculated for each cluster as the average of the coordinates of the points that belong to the cluster, This is done with calculate_cluster_centers().

The result is a self-consistent solution, where each point belongs to its nearest cluster and each cluster is centered in the center of its points.

The solution is typically not unique though. Therefore the algorithm is run several times from different, randomly chosen initial clusters, and the best solution is picked. Ranking is done according to the sum of squared distances between data points and cluster centers.

Note

This function is incomplete!

Parameters:
  • data (list) – list of data points as Point objects

  • k_clusters (int) – number of clusters

  • n_repeats (int) – number of tries

Returns:

cluster centers as Point objects, evolution of data points, evolution of centers

Return type:

list, list, list

clustering.print_progress(step, total)[source]

Prints a progress bar.

Parameters:
  • step (int) – progress counter

  • total (int) – counter at completion

clustering.profile_plot(data, k_clusters, centers=None)[source]

Plots the data as a series of line plots.

Each line is drawn so that the x-coordinates are evenly spaced and the y-coordinates are the coordinates of the data point.

If cluster centers are given, they are also plotted as lines with a black outline.

Parameters:
  • data (list) – list of data points as Point objects

  • k_clusters (int) – number of clusters

  • centers (list) – cluster center coordinates

clustering.read_data(filename, data_cols=0, categories=0, delimiter=None)[source]

Reads data from a file.

Each line in the file should define one data point by listing its coordinates.

By default, the coordinates should be separated by spaces. If a delimiter is given, it is used as the separator instead.

Also by default, each column is assumed to represent a coordinate. If data_cols is given, each line is assumed to hold that many columns of coordinate data followed by columns specifying the true class of the point. The classification is given as columns of zeros and ones so that a one marks the class of the point.

The file could thus read, e.g.

1.2 3.4 5.6 1 0 0
1.3 5.7 9.1 0 0 1
...

Where the first point would be of of the first class (index 0) and the second of the third class (index 2).

Parameters:
  • filename (str) – name of file to read

  • data_cols (int) – Number of data columns, i.e., dimensionality of the data. Only give this argument if the file also contains category data.

  • categories (int) – number of known categories

  • delimiter (str) – delimiter, if something else than whitespace.

Returns:

generated data as Point objects

Return type:

list

clustering.scatter_plot(data, k_clusters, axes=[0, 1], centers=None)[source]

Plots the data as a 2D scatter plot.

The function assumes the data is sorted in k_clusters categories and it plots each cluster in a different color.

Only two coordinates are plotted even if the data is high dimensional. The coordinates to plot can be chosen with the axis parameter, which must contain the indices of the coordinates to plot.

If cluster centers are given, they are also plotted as dots with a black outline.

Parameters:
  • data (list) – list of data points as Point objects

  • k_clusters (int) – number of clusters

  • axes (array) – indices of the coordinates to plot

  • centers (list) – cluster center coordinates