K-Means algorithm in Matlab
In Matlab live scripts, the functions must be defined after they are called, so the main part of the code starts at the beginning of the report, and the utility functions will be defined later.
In Matlab live scripts, the functions must be defined after they are called, so the main part of the code starts at the beginning of the report, and the utility functions will be defined later.
We need to define our variables for the function and create some random dataset. You can read more about what parameters it takes, here
% The variables: k = 3; %number of cluster centers n = 50; %number of datapoints d = 2; %number of dimensions minVal = 0; %minimum value in the dataset maxVal = 10; %maximum value in the dataset % The random dataset: data = (maxVal - minVal) .* rand(n, d) + minVal; % Cluster the dataset [assignment, cluster_centers] = kmeans(data, k); % Visualize how the clusters look like along with the cluster centers plot_clusters(data, cluster_centers, assignment);
Let's start by defining the variables we need
n = 50; %number of datapoints k = 3; %number of cluster centers d = 2; %number of dimensions minVal = 0; %minimum value in the dataset maxVal = 10; %maximum value in the dataset
Then, generate some random dataset and the initial random cluster centers.
data = (maxVal - minVal) .* rand(n, d) + minVal; cluster_centers = (maxVal - minVal) .* rand(k, d) + minVal;
After all of this is done, we can run the algorithm until convergence and see how the the cluster centers converge.
previous_centers = []; while true if(size(data,2) ~= size(cluster_centers, 2)) error('Number of dimensions doesn\''t match for the dataset and cluster centers') end if (length(previous_centers) ~= 0 && all(all(previous_centers == cluster_centers))) break end previous_centers = cluster_centers; assignment = assign_to_closest(data, cluster_centers); plot_clusters(data, cluster_centers, assignment); cluster_centers = new_cluster_centers(data, assignment); end
Euclidean distance function calculating the distance between matrices
function distance = euclidean_distance(a,b) distance = sqrt(sum((a-b).^2, 2)); end
Then, we define the function, which assigns each datapoint to the closest cluser center with respect to the distance function.
function closest = assign_to_closest(datapoints, c_centers) %get variables n = length(datapoints); d = size(c_centers, 2); k = length(c_centers); temp = zeros(n, d); for i = 1:k temp(:,i) = euclidean_distance(datapoints, c_centers(i,:)); end [~, closest] = min(temp, [], 2); end
Now, we will define the function, which calculates the new cluster centers (centroid of the datapoints in a given cluster).
function new_representatives = new_cluster_centers(datapoints, closest_cluster) %get variables d = size(datapoints, 2); k = length(unique(closest_cluster)); new_representatives = zeros(k, d); for i = 1:k temp = find(closest_cluster == i); new_representatives(i,:) = sum(datapoints(temp,:)) / length(temp); end end
The following function is not necessary, but is useful for visualizing how the cluster centers converge to a local minimum point and how the cluster assignment of the datapoints change during the learning, so we can get a better understanding of the algorithm.
function plot_clusters(datapoints, c_centers, assignment) k = length(c_centers); if(size(datapoints, 2) ~= 2) %matlab: size(datapoints,2) printf('Number of dimensions %2d is not supported, only dimension %2d', size(datapoints, 2), 2); return end if(length(c_centers) > 3) printf('Number of cluster centers %2d is not supported, only clusters of %2d or less', k, 3); end datapoints_temp = [datapoints assignment]; c_centers_temp = [c_centers (1:k)']; figure; voronoi(c_centers(:,1), c_centers(:,2)); axis([min(datapoints(:, 1)) max(datapoints(:, 1)) min(datapoints(:, 2)) max(datapoints(:, 2))]); hold on; scatter(datapoints_temp(:,1), datapoints_temp(:,2), 10, datapoints_temp(:,3)/k, 'filled'); scatter(c_centers_temp(:,1), c_centers_temp(:,2), 30, c_centers_temp(:,3)/k, 'Marker', 'X'); hold off; end
Péter Szinger