K-Means algorithm in Matlab

In Matlab live scripts, the functions must be defined after they are called, so the main part of the code starts at the beginning of the report, and the utility functions will be defined later.

Let us see how we can use Matlab's built in k-means clustering algorithm

We need to define our variables for the function and create some random dataset. You can read more about what parameters it takes, here

% The variables:
k = 3; %number of cluster centers
n = 50; %number of datapoints
d = 2; %number of dimensions
minVal = 0; %minimum value in the dataset
maxVal = 10; %maximum value in the dataset

% The random dataset:
data = (maxVal - minVal) .* rand(n, d) + minVal;

% Cluster the dataset
[assignment, cluster_centers] = kmeans(data, k);

% Visualize how the clusters look like along with the cluster centers
plot_clusters(data, cluster_centers, assignment);

Own implementation

Let's start by defining the variables we need

n = 50; %number of datapoints
k = 3; %number of cluster centers
d = 2; %number of dimensions
minVal = 0; %minimum value in the dataset
maxVal = 10; %maximum value in the dataset

Then, generate some random dataset and the initial random cluster centers.

data = (maxVal - minVal) .* rand(n, d) + minVal;
cluster_centers = (maxVal - minVal) .* rand(k, d) + minVal;

After all of this is done, we can run the algorithm until convergence and see how the the cluster centers converge.

previous_centers = [];
while true
    if(size(data,2) ~= size(cluster_centers, 2))
        error('Number of dimensions doesn\''t match for the dataset and cluster centers')
    end
    if (length(previous_centers) ~= 0 && all(all(previous_centers == cluster_centers)))
        break
    end
    previous_centers = cluster_centers;
    assignment = assign_to_closest(data, cluster_centers);
    plot_clusters(data, cluster_centers, assignment);
    cluster_centers = new_cluster_centers(data, assignment);
end

Utility functions

Euclidean distance function calculating the distance between matrices

function distance = euclidean_distance(a,b)
    distance = sqrt(sum((a-b).^2, 2));
end
Then, we define the function, which assigns each datapoint to the closest
cluser center with respect to the distance function.
function closest = assign_to_closest(datapoints, c_centers)
    %get variables
    n = length(datapoints);
    d = size(c_centers, 2);
    k = length(c_centers);
    temp = zeros(n, d);
    for i = 1:k
        temp(:,i) = euclidean_distance(datapoints, c_centers(i,:));
    end
    [~, closest] = min(temp, [], 2);
end

Now, we will define the function, which calculates the new cluster centers (centroid of the datapoints in a given cluster).

function new_representatives = new_cluster_centers(datapoints, closest_cluster)
    %get variables
    d = size(datapoints, 2);
    k = length(unique(closest_cluster));
    new_representatives = zeros(k, d);
    for i = 1:k
        temp = find(closest_cluster == i);
        new_representatives(i,:) = sum(datapoints(temp,:)) / length(temp);
    end
end

The following function is not necessary, but is useful for visualizing how the cluster centers converge to a local minimum point and how the cluster assignment of the datapoints change during the learning, so we can get a better understanding of the algorithm.

function plot_clusters(datapoints, c_centers, assignment)
    k = length(c_centers);
    if(size(datapoints, 2) ~= 2)  %matlab: size(datapoints,2)
        printf('Number of dimensions %2d is not supported, only dimension %2d', size(datapoints, 2), 2);
        return
    end
    if(length(c_centers) > 3)
        printf('Number of cluster centers %2d is not supported, only clusters of %2d or less', k, 3);
    end
    datapoints_temp = [datapoints assignment];
    c_centers_temp = [c_centers (1:k)'];
    figure;
    voronoi(c_centers(:,1), c_centers(:,2));
    axis([min(datapoints(:, 1)) max(datapoints(:, 1)) min(datapoints(:, 2)) max(datapoints(:, 2))]);
    hold on;
    scatter(datapoints_temp(:,1), datapoints_temp(:,2), 10, datapoints_temp(:,3)/k, 'filled');
    scatter(c_centers_temp(:,1), c_centers_temp(:,2), 30, c_centers_temp(:,3)/k, 'Marker', 'X');
    hold off;
end
                                                                                              Péter Szinger