K-Means Document Clustering

K-Means Document Clustering

November 6th, 2009  |  Published in Uncategorized

Using our previous example as a basis to move to the next step let’s take a look at clustering using the K-Means clustering algorithm to group the documents in to their appropriate categories.

In the paper “Indexing by Latent Semantic Analysis” (Deerwester et al.) they have an example of 9 titles of different papers grouped in to two categories “human computer interaction” & “graphs & trees”. So far, we’ve used Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) to better understand the relationship of words and documents. In the last blog post we then took the results in LSI to plot words and documents on a two dimensional Cartesian plane.

All of this is pretty interesting stuff in and of itself however, the next step really is to see which documents belong in each group. One way to do this is by using K-Means clustering.

Simply speaking k-means clustering is an algorithm to classify or to group your objects based on attributes/features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify the data. [Kardi Teknomo]

A big chunk of the code is built off of the same project we are working on. I am using Aresh Saharkhiz K-Means implementation in the project with some minor changes/refactoring done by me.

Let take a look at the code!

This first part is the display (an ASP.NET app.)

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Default.aspx.cs" Inherits="LSITest._Default" %>
<%@ Register Assembly="DundasWebChart" Namespace="Dundas.Charting.WebControl" TagPrefix="DCWC" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" >
<head runat="server">
    <title>LSI Test</title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
        <DCWC:Chart ID="Chart1" runat="server" Height="400px" Width="400px"
            ImageType="Jpeg">
            <Legends>
                <DCWC:Legend Name="Default" Alignment="Center" Docking="Bottom"></DCWC:Legend>
            </Legends>
            <Titles>
                <DCWC:Title Name="Title1">
                </DCWC:Title>
            </Titles>
            <Series>
                <DCWC:Series Name="Series1" ChartType="Point" MarkerBorderColor="64, 64, 64"
                    ShadowOffset="1">
                </DCWC:Series>
                <DCWC:Series Name="Series2" ChartType="Point" MarkerBorderColor="64, 64, 64"
                    ShadowOffset="1">
                </DCWC:Series>
                <DCWC:Series Name="Series3" ChartType="Point" MarkerBorderColor="64, 64, 64"
                    ShadowOffset="1">
                </DCWC:Series>
            </Series>
            <ChartAreas>
                <DCWC:ChartArea Name="Series2">
                    <axisy interval="0.5" maximum="2" minimum="-1">
                        <majorgrid linecolor="Gray" linestyle="Dash" />
                    </axisy>
                    <axisx interval="0.5" maximum="2.5" minimum="-0.5">
                        <majorgrid linecolor="Gray" linestyle="Dash" />
                    </axisx>
                </DCWC:ChartArea>
            </ChartAreas>
        </DCWC:Chart>
    </div>
    </form>
</body>
</html>

This is the code behind for the ASP.NET page. Because we are only dealing with two known categories K-Means is plotting out the two categories and if you wanted to do more you would definitely have to re-write the ColorCodeDocuments function.

using System;
using System.Data;
using System.Drawing;
using System.Web.UI;
using Dundas.Charting.WebControl;

namespace LSITest
{
    public partial class _Default : Page
    {
        /// <summary>
        /// Handles the Load event of the Page control.
        /// </summary>
        /// <param name="sender">The source of the event.</param>
        /// <param name="e">The <see cref="System.EventArgs"/> instance containing the event data.</param>
        protected void Page_Load(object sender, EventArgs e)
        {
            // Perform LSI
            var mylsi = new lsi();
            mylsi.LSITest();
            double[,] myDocs = mylsi.MyDocs;

            // Plot Documents and the k-means
            const string distanceType = "manhattan";
            PlotDocuments(myDocs, mylsi.MyDocRowCount);
            PlotKMeansPoints(myDocs, 2, distanceType);
            ColorCodeDocuments(distanceType);

            // If you want to plot the words just un-comment the next two lines
            //double[,] myWords = mylsi.MyWords;
            //PlotWords(myDocs, mylsi.MyWordsRowCount);

            // comment this line out to show words in legend
            Chart1.Series["Series2"].ShowInLegend = false;
        }

        /// <summary>
        /// Plots the words.
        /// </summary>
        /// <param name="myWords">My words.</param>
        /// <param name="myWordsRowCount">My words row count.</param>
        private void PlotWords(double[,] myWords, int myWordsRowCount)
        {
            for (int i = 0; i < myWordsRowCount; i++)
            {
                Chart1.Series["Series2"].Points.AddXY(myWords[i, 0], myWords[i, 1]);
            }

            // Set point colors and shapes
            Chart1.Series["Series2"].LegendText = "Words";
            Chart1.Series["Series2"].Color = Color.Gray;
            Chart1.Series["Series2"].MarkerStyle = MarkerStyle.Circle;
            Chart1.Series["Series2"].MarkerSize = 6;
        }

        /// <summary>
        /// Plots the documents.
        /// </summary>
        /// <param name="myDocs">My docs.</param>
        /// <param name="myDocRowCount">My doc row count.</param>
        private void PlotDocuments(double[,] myDocs, int myDocRowCount)
        {
            // Load documents
            for (int i = 0; i < myDocRowCount; i++)
            {
                Chart1.Series["Series1"].Points.AddXY(myDocs[i, 0], myDocs[i, 1]);
            }

            // Set point colors and shapes
            Chart1.Series["Series1"].LegendText = "Documents";
            Chart1.Series["Series1"].Color = Color.Red;
            Chart1.Series["Series1"].MarkerStyle = MarkerStyle.Diamond;
            Chart1.Series["Series1"].MarkerSize = 12;
        }

        /// <summary>
        /// Plots the K means points.
        /// </summary>
        /// <param name="items">The items.</param>
        /// <param name="k">The k.</param>
        /// <param name="distanceType"></param>
        private void PlotKMeansPoints(double[,] items, int k, string distanceType)
        {
            ClusterCollection clusters = kmeans.ClusterDataSet(k, items, distanceType);

            for (int i = 0; i < clusters.Count; i++)
            {
                Chart1.Series["Series3"].Points.AddXY(clusters[i].ClusterMean[0], clusters[i].ClusterMean[1]);
            }

            // Set point colors and shapes
            Chart1.Series["Series3"].LegendText = "Cluster";
            Chart1.Series["Series3"].Color = Color.Gold;
            Chart1.Series["Series3"].MarkerStyle = MarkerStyle.Star6;
            Chart1.Series["Series3"].MarkerSize = 18;
        }

        /// <summary>
        /// Colors the code documents.
        /// </summary>
        /// <param name="distanceType">Type of the distance.</param>
        private void ColorCodeDocuments(string distanceType)
        {
            var myDist = new similarity();

            // Extract data
            DataSet myDocs = Chart1.DataManipulator.ExportSeriesValues("Series1");
            DataSet myKMeansPoints = Chart1.DataManipulator.ExportSeriesValues("Series3");

            // Document counter
            int count = 0;

            // Get co-ordinates for k-means points
            double firstKMeansX = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[0]["X"]);
            double firstKMeansY = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[0]["Y"]);
            double secondKMeansX = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[1]["X"]);
            double secondKMeansY = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[1]["Y"]);

            foreach (DataRow docRow in myDocs.Tables[0].Rows)
            {
                // get co-ordinates for current doc
                double currentDocX = Convert.ToDouble(docRow["X"]);
                double currentDocY = Convert.ToDouble(docRow["Y"]);

                // load in to arrays
                double[] firstX = {currentDocX, currentDocY};
                double[] firstY = {firstKMeansX, firstKMeansY};
                double[] secondX = {currentDocX, currentDocY};
                double[] secondY = {secondKMeansX, secondKMeansY};

                // find the distance
                double firstDist = myDist.FindDistance(firstX, firstY, distanceType);
                double secondDist = myDist.FindDistance(secondX, secondY, distanceType);

                // Color accordingly
                Chart1.Series["Series1"].Points[count].Color = firstDist < secondDist ? Color.Blue : Color.Gray;
                count++;
            }
        }
    }
}

This is the K-Means class written by Aresh Saharkhiz with my changes

/// Most of this code was written by Aresh Saharkhiz
/// Re-organized by me
/// See Code Project: http://www.codeproject.com/KB/recipes/K-Mean_Clustering.aspx
using System;
using System.Collections;
using System.Data;
using System.Diagnostics;

namespace LSITest
{
    public class kmeans
    {
        /// <summary>
        /// Calculates The Mean Of A Cluster OR The Cluster Center
        /// </summary>
        /// <param name="cluster">
        /// A two-dimensional array containing a dataset of numeric values
        /// </param>
        /// <returns>
        /// Returns an Array Defining A Data Point Representing The Cluster Mean or Centroid
        /// </returns>
        public static double[] ClusterMean(double[,] cluster)
        {
            int rowCount = cluster.GetUpperBound(0) + 1;
            int fieldCount = cluster.GetUpperBound(1) + 1;
            var dataSum = new double[1,fieldCount];
            var centroid = new double[fieldCount];

            for (int j = 0; j < fieldCount; j++)
            {
                for (int i = 0; i < rowCount; i++)
                {
                    dataSum[0, j] = dataSum[0, j] + cluster[i, j];
                }

                centroid[j] = (dataSum[0, j]/rowCount);
            }

            return centroid;
        }

        /// <summary>
        /// Seperates a dataset into clusters or groups with similar characteristics
        /// </summary>
        /// <param name="clusterCount">The number of clusters or groups to form</param>
        /// <param name="data">An array containing data that will be clustered</param>
        /// <param name="type"></param>
        /// <returns>A collection of clusters of data</returns>
        public static ClusterCollection ClusterDataSet(int clusterCount, double[,] data, string type)
        {
            int rowCount = data.GetUpperBound(0) + 1;
            int fieldCount = data.GetUpperBound(1) + 1;
            int stableClustersCount = 0;
            double[] dataPoint;
            var random = new Random();
            Cluster cluster;
            var clusters = new ClusterCollection();
            var clusterNumbers = new ArrayList(clusterCount);
            var myDist = new similarity();

            while (clusterNumbers.Count < clusterCount)
            {
                int clusterNumber = random.Next(0, rowCount - 1);

                if (!clusterNumbers.Contains(clusterNumber))
                {
                    cluster = new Cluster();
                    clusterNumbers.Add(clusterNumber);
                    dataPoint = new double[fieldCount];

                    for (int field = 0; field < fieldCount; field++)
                    {
                        dataPoint.SetValue((data[clusterNumber, field]), field);
                    }

                    cluster.Add(dataPoint);
                    clusters.Add(cluster);
                }
            }

            while (stableClustersCount != clusters.Count)
            {
                stableClustersCount = 0;
                ClusterCollection newClusters = ClusterDataSet(clusters, data, type);

                for (int clusterIndex = 0; clusterIndex < clusters.Count; clusterIndex++)
                {
                    if ((myDist.FindDistance(newClusters[clusterIndex].ClusterMean, clusters[clusterIndex].ClusterMean, type)) == 0)
                    {
                        stableClustersCount++;
                    }
                }

                clusters = newClusters;
            }

            return clusters;
        }

        /// <summary>
        /// Seperates a dataset into clusters or groups with similar characteristics
        /// </summary>
        /// <param name="clusters">A collection of data clusters</param>
        /// <param name="data">An array containing data to b eclustered</param>
        /// <param name="type"></param>
        /// <returns>A collection of clusters of data</returns>
        public static ClusterCollection ClusterDataSet(ClusterCollection clusters, double[,] data, string type)
        {
            double[] dataPoint;
            double firstClusterDistance = 0.0;
            int rowCount = data.GetUpperBound(0) + 1;
            int fieldCount = data.GetUpperBound(1) + 1;
            int position = 0;
            var myDist = new similarity();

            // create a new collection of clusters
            var newClusters = new ClusterCollection();

            for (int count = 0; count < clusters.Count; count++)
            {
                var newCluster = new Cluster();
                newClusters.Add(newCluster);
            }

            if (clusters.Count <= 0)
            {
                throw new SystemException("Cluster Count Cannot Be Zero!");
            }

            for (int row = 0; row < rowCount; row++)
            {
                dataPoint = new double[fieldCount];

                for (int field = 0; field < fieldCount; field++)
                {
                    dataPoint.SetValue((data[row, field]), field);
                }

                for (int cluster = 0; cluster < clusters.Count; cluster++)
                {
                    double[] clusterMean = clusters[cluster].ClusterMean;

                    if (cluster == 0)
                    {
                        firstClusterDistance = myDist.FindDistance(dataPoint, clusterMean, type);
                        position = cluster;
                    }
                    else
                    {
                        double secondClusterDistance = myDist.FindDistance(dataPoint, clusterMean, type);

                        if (firstClusterDistance > secondClusterDistance)
                        {
                            firstClusterDistance = secondClusterDistance;
                            position = cluster;
                        }
                    }
                }

                newClusters[position].Add(dataPoint);
            }

            return newClusters;
        }

        /// <summary>
        /// Converts the data table to array.
        /// </summary>
        /// <param name="table">The table.</param>
        /// <returns></returns>
        public static double[,] ConvertDataTableToArray(DataTable table)
        {
            int rowCount = table.Rows.Count;
            int fieldCount = table.Columns.Count;

            var dataPoints = new double[rowCount,fieldCount];

            for (int rowPosition = 0; rowPosition < rowCount; rowPosition++)
            {
                DataRow row = table.Rows[rowPosition];

                for (int fieldPosition = 0; fieldPosition < fieldCount; fieldPosition++)
                {
                    double fieldValue;
                    try
                    {
                        fieldValue = double.Parse(row[fieldPosition].ToString());
                    }
                    catch (Exception ex)
                    {
                        Debug.WriteLine(ex.ToString());
                        throw new InvalidCastException("Invalid row at " + rowPosition + " and field " + fieldPosition,
                                                       ex);
                    }

                    dataPoints[rowPosition, fieldPosition] = fieldValue;
                }
            }

            return dataPoints;
        }
    }

    /// <summary>
    /// A class containing a group of data with similar characteristics (cluster)
    /// </summary>
    [Serializable]
    public class Cluster : CollectionBase
    {
        private double[] _clusterMean;
        private double[] _clusterSum;

        /// <summary>
        /// The sum of all the data in the cluster
        /// </summary>
        public double[] ClusterSum
        {
            get { return _clusterSum; }
        }

        /// <summary>
        /// The mean of all the data in the cluster
        /// </summary>
        public double[] ClusterMean
        {
            get
            {
                for (int count = 0; count < this[0].Length; count++)
                {
                    _clusterMean[count] = (_clusterSum[count]/List.Count);
                }

                return _clusterMean;
            }
        }

        /// <summary>
        /// Returns the one dimensional array data located at the index
        /// </summary>
        public virtual double[] this[int index]
        {
            get
            {
                //return the Neuron at IList[index]
                return (double[]) List[index];
            }
        }

        /// <summary>
        /// Adds a single dimension array data to the cluster
        /// </summary>
        /// <param name="data">A 1-dimensional array containing data that will be added to the cluster</param>
        public virtual void Add(double[] data)
        {
            List.Add(data);

            if (List.Count == 1)
            {
                _clusterSum = new double[data.Length];

                _clusterMean = new double[data.Length];
            }

            for (int count = 0; count < data.Length; count++)
            {
                _clusterSum[count] = _clusterSum[count] + data[count];
            }
        }
    }

    /// <summary>
    /// A collection of Cluster objects or Clusters
    /// </summary>
    [Serializable]
    public class ClusterCollection : CollectionBase
    {
        /// <summary>
        /// Returns the Cluster at this index
        /// </summary>
        public virtual Cluster this[int index]
        {
            get
            {
                //return the Neuron at IList[index]
                return (Cluster) List[index];
            }
        }

        /// <summary>
        /// Adds a Cluster to the collection of Clusters
        /// </summary>
        /// <param name="cluster">A Cluster to be added to the collection of clusters</param>
        public virtual void Add(Cluster cluster)
        {
            List.Add(cluster);
        }
    }
}

Here is the similarity class than can calculate Euclidean, Manhattan, Chebyshev, Minkowski distances

/// Most of this code was written by Aresh Saharkhiz
/// Re-organized by me
/// See Code Project: http://www.codeproject.com/KB/recipes/Quantitative_Distances.aspx
using System;

namespace LSITest
{
    public class similarity
    {
        /// <summary>
        /// Finds the distance.
        /// </summary>
        /// <param name="x">The x.</param>
        /// <param name="y">The y.</param>
        /// <param name="type">The type.</param>
        /// <param name="distanceType"></param>
        /// <returns></returns>
        public double FindDistance(double[] x, double[] y, string distanceType)
        {
            double distance;

            switch (distanceType.ToLower())
            {
                case "euclidean":
                    distance = EuclideanDistance(x, y);
                    break;
                case "manhattan":
                    distance = ManhattanDistance(x, y);
                    break;
                case "minkowski":
                    distance = MinkowskiDistance(x, y, 1);
                    break;
                case "chebyshev":
                    distance = ChebyshevDistance(x, y);
                    break;
                default:
                    distance = 0.0;
                    break;
            }

            return distance;
        }

        /// <summary>
        /// Finds the Euclideans distance.
        /// </summary>
        /// <param name="x">The x.</param>
        /// <param name="y">The y.</param>
        /// <returns></returns>
        public double EuclideanDistance(double[] x, double[] y)
        {
            double sum = 0.0;

            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException("the number of elements in x must match the number of elements in y");
            }

            int count = x.Length;

            for (int i = 0; i < count; i++)
            {
                sum += Math.Pow(Math.Abs(x[i] - y[i]), 2);
            }

            double distance = Math.Sqrt(sum);
            return distance;
        }

        /// <summary>
        /// Finds Manhattan distance.
        /// </summary>
        /// <param name="x">The x.</param>
        /// <param name="y">The y.</param>
        /// <returns></returns>
        public double ManhattanDistance(double[] x, double[] y)
        {
            double sum = 0.0;

            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException("the number of elements in x must match the number of elements in y");
            }

            int count = x.Length;

            for (int i = 0; i < count; i++)
            {
                sum += Math.Abs(x[i] - y[i]);
            }

            double distance = sum;
            return distance;
        }

        /// <summary>
        /// Finds Chebyshevs distance.
        /// </summary>
        /// <param name="x">The x.</param>
        /// <param name="y">The y.</param>
        /// <returns></returns>
        public static double ChebyshevDistance(double[] x, double[] y)
        {
            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException("the number of elements in x must match the number of elements in y");
            }
            int count = x.Length;
            var newData = new double[count];

            for (int i = 0; i < count; i++)
            {
                newData[i] = Math.Abs(x[i] - y[i]);
            }
            double max = double.MinValue;

            foreach (double num in newData)
            {
                if (num > max)
                {
                    max = num;
                }
            }
            return max;
        }

        /// <summary>
        /// Finds Minkowskis distance.
        /// </summary>
        /// <param name="x">The x.</param>
        /// <param name="y">The y.</param>
        /// <param name="order">The order.</param>
        /// <returns></returns>
        public double MinkowskiDistance(double[] x, double[] y, double order)
        {
            double sum = 0.0;

            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException("the number of elements in x must match the number of elements in y");
            }
            int count = x.Length;

            for (int i = 0; i < count; i++)
            {
                sum = sum + Math.Pow(Math.Abs(x[i] - y[i]), order);
            }

            double distance = Math.Pow(sum, (1 / order));
            return distance;
        }
    }
}

And finally the same LSI class used in the previous examples.

using System;
using SmartMathLibrary;

namespace LSITest
{
    public class lsi
    {
        // this returns the formated html results
        public int MyDocColumnCount;
        public int MyDocRowCount;
        public double[,] MyDocs;
        public double[,] MyWords;
        public int MyWordsColumnCount;
        public int MyWordsRowCount;
        public string ToPrint;

        /// <summary>
        /// LISs the test.
        /// </summary>
        public void LSITest()
        {
            //Create Matrix
            var testArray = new double[,]
                                {
                                    {1, 0, 0, 1, 0, 0, 0, 0, 0},
                                    {1, 0, 1, 0, 0, 0, 0, 0, 0},
                                    {1, 1, 0, 0, 0, 0, 0, 0, 0},
                                    {0, 1, 1, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 1, 2, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 0, 1, 1, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 0, 0, 0, 0, 1},
                                    {0, 0, 0, 0, 0, 1, 1, 1, 0},
                                    {0, 0, 0, 0, 0, 0, 1, 1, 1},
                                    {0, 0, 0, 0, 0, 0, 0, 1, 1}
                                };

            // Load array in to Matrix
            var a = new Matrix(testArray);

            // print original matrix
            PrintMatrix(a);

            // preform Latent Semantic Indexing
            GetDocumentWordPlots(a);
        }

        /// <summary>
        /// Prints the matrix.
        /// </summary>
        /// <param name="myMatrix">My matrix.</param>
        private void PrintMatrix(IMatrix myMatrix)
        {
            ToPrint += "<br /><br />";

            for (int i = 0; i < myMatrix.Rows; i++)
            {
                for (int j = 0; j < myMatrix.Columns; j++)
                {
                    ToPrint += String.Format("{0:0.##}", myMatrix.MatrixData[i, j]) + "\t";
                }
                ToPrint += "<br />";
            }
        }

        /// <summary>
        /// Gets the document word plots.
        /// </summary>
        /// <param name="myMatrix">My matrix.</param>
        private void GetDocumentWordPlots(Matrix myMatrix)
        {
            // Run single value decomposition
            var svd = new SingularValueDecomposition(myMatrix);
            svd.ExecuteDecomposition();

            // Put components into individual matrices
            Matrix wordVector = svd.U.Copy();
            Matrix sigma = svd.S.ToMatrix();
            Matrix documentVector = svd.V.Copy();

            // get value of k
            // you can also manually set the value of k
            var k = (int) Math.Floor(Math.Sqrt(myMatrix.Columns));

            // reduce the vectors
            Matrix reducedWordVector = CopyMatrix(wordVector, wordVector.Rows, k - 1);
            Matrix reducedSigma = CreateSigmaMatrix(sigma, k - 1, k - 1);
            Matrix reducedDocumentVector = CopyMatrix(documentVector, documentVector.Rows, k - 1);

            // Recalculate the matrix
            Matrix docs = reducedDocumentVector*reducedSigma;
            Matrix words = reducedWordVector*reducedSigma;

            // Fill doc plot locations
            MyDocs = new double[docs.Rows,docs.Columns];
            for (int i = 0; i < docs.Rows; i++)
            {
                for (int j = 0; j < docs.Columns; j++)
                {
                    MyDocs[i, j] = docs.MatrixData[i, j];
                }
            }

            // Fill word plot locations
            MyWords = new double[words.Rows,words.Columns];
            for (int i = 0; i < words.Rows; i++)
            {
                for (int j = 0; j < words.Columns; j++)
                {
                    MyWords[i, j] = words.MatrixData[i, j];
                }
            }

            // Set counts for charts
            MyDocRowCount = docs.Rows;
            MyWordsRowCount = words.Rows;

            PrintMatrix(docs);
            PrintMatrix(words);
        }

        /// <summary>
        /// Creates the sigma matrix.
        /// </summary>
        /// <param name="matrix">The matrix.</param>
        /// <param name="rowEnd">The row end.</param>
        /// <param name="columnEnd">The column end.</param>
        /// <returns></returns>
        private static Matrix CreateSigmaMatrix(IMatrix matrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i < columnEnd; i++)
            {
                copyMatrix.MatrixData[i, i] = matrix.MatrixData[i, 0];
            }

            return copyMatrix;
        }

        /// <summary>
        /// Copies the matrix.
        /// </summary>
        /// <param name="myMatrix">My matrix.</param>
        /// <param name="rowEnd">The row end.</param>
        /// <param name="columnEnd">The column end.</param>
        /// <returns></returns>
        private static Matrix CopyMatrix(IMatrix myMatrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i < rowEnd; i++)
            {
                for (int j = 0; j < columnEnd; j++)
                {
                    copyMatrix.MatrixData[i, j] = myMatrix.MatrixData[i, j];
                }
            }

            return copyMatrix;
        }
    }
}

And what do the results look like?

kmeansresults

As you can see the K-Means clustering algorithm correctly grouped the documents in the appropriate categories.

Recommended reading and thanks goes to Aresh Saharkhiz for sharing his implementation of K-Means Clustering.



Related Posts

Data Driven Maps Part 2: KML Choropleth Maps
Problems with Html.DropDownList
Euclidean Distance Score
Cheap GPS and Code Project Tutorial

Archives