<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>eric.ness.net &#187; Machine Learning</title>
	<atom:link href="http://eric.ness.net/archives/category/machine-learning/feed/" rel="self" type="application/rss+xml" />
	<link>http://eric.ness.net</link>
	<description>...I never learned to read.</description>
	<lastBuildDate>Fri, 23 Jul 2010 05:22:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Cryptanalysis Using n-Gram Probabilities</title>
		<link>http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/</link>
		<comments>http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/#comments</comments>
		<pubDate>Sat, 01 May 2010 09:35:31 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Natural Language Processing]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=472</guid>
		<description><![CDATA[Cryptanalysis Using Microsoft Web N-Gram Service]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/&amp;n=Cryptanalysis+Using+n-Gram+Probabilities&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/&amp;title=Cryptanalysis+Using+n-Gram+Probabilities" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/&amp;title=Cryptanalysis+Using+n-Gram+Probabilities" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/&amp;t=Cryptanalysis+Using+n-Gram+Probabilities" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/&amp;title=Cryptanalysis+Using+n-Gram+Probabilities" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/&amp;title=Cryptanalysis+Using+n-Gram+Probabilities" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=Cryptanalysis+Using+n-Gram+Probabilities+-+http://b2l.me/r867k&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>One of my favorite programmers is <a href="http://norvig.com/">Peter Norvig</a> who is currently Director of Research at Google. This summer I picked up a book called <a href="http://oreilly.com/catalog/9780596157128">Beautiful Data</a> in which Norvig contributed a chapter called &#8220;Natural Language Corpus Data&#8221; in which he outlined a number of very cool things you can do with n-grams in the google  corpus. It covers some of the things you&#8217;d imagine that it would cover: spelling correction, word segmentation, etc. The one item covered that I had never considered was in the area of cryptanalysis.</p>
<p>The cool thing is that Google will give you their corpus to <a href="http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html">download</a>. The only problem is that it contains &#8220;1,024,908,267,229 words of running text&#8221; and is 24 GB compressed in size. This is a bit impractical to run on your dev box. Enter Microsoft &#8211; the <a href="http://web-ngram.research.microsoft.com/info/">Microsoft Web N-gram Service </a>just went Beta and is now available to Professors and Students so I immediately signed up and I have to say that it pretty cool!</p>
<p>So I wanted to try out the new service using one of Norvig&#8217;s examples in his book &#8211; specifically using n-gram probabilities and character shifting. This is a very simple example and fairly basic type of encryption where the if the user types an &#8216;a&#8217; it gets shifted to &#8216;n&#8217; or whatever. So you simply run through all 26 possibilities and use the individual words combined probabilities to determine the answer to the encoded message.</p>
<p>This project has a Service Refrence connected to <a href="http://web-ngram.research.microsoft.com/info/">Microsoft&#8217;s n-Gram Service</a>. The service requires an n-gram model and a user id which you get when you sign up (<a href="http://web-ngram.research.microsoft.com/info/quickstart.htm">see their quickstart tutorial</a>). So let&#8217;s take a look at some code:</p>
<pre class="brush: jscript;">
using System;
using System.Collections.Generic;
using System.Configuration;
using System.Linq;
using MicrosoftNGramTest.NGramService;

namespace MicrosoftNGramTest.classes
{
    internal class Shift
    {
        #region Variables

        private readonly string _alphabet = &quot;abcdefghijklmnopqrstuvwxyz&quot;;
        private readonly string _ngramModel = ConfigurationManager.AppSettings.Get(&quot;ngramModel&quot;);
        private readonly string _userToken = ConfigurationManager.AppSettings.Get(&quot;userToken&quot;);

        #endregion

        #region Run The Test

        /// &lt;summary&gt;
        /// Runs the test
        /// &lt;/summary&gt;
        public void Test()
        {
            // Print title
            Console.WriteLine(&quot;Character Shift Cryptanalysis&quot;);
            Console.WriteLine(&quot;#############################&quot;);

            // Local Variables
            const string phrase = &quot;Yvfgra, qb lbh jnag gb xabj n frperg?&quot;;
            string[] words = phrase.ToLower().Split(' ');
            var newPhrase = new string[26];
            var client = new LookupServiceClient();
            var result = new Dictionary&lt;string, int&gt;();

            try
            {
                // Loop the word variations
                foreach (string s in words)
                {
                    char[] currentWord = s.ToCharArray();

                    foreach (char c in currentWord)
                    {
                        for (int i = 0; i &lt; 26; i++)
                        {
                            newPhrase[i] += CharShift(c, i);
                        }
                    }

                    for (int i = 0; i &lt; newPhrase.Count(); i++)
                    {
                        newPhrase[i] += &quot; &quot;;
                    }
                }

                // Print phrases with probabilities
                foreach (string s in newPhrase)
                {
                    string[] newWords = s.Split(' ');
                    double prob = 0;
                    foreach (string word in newWords)
                    {
                        prob += client.GetProbability(_userToken, _ngramModel, word);
                    }
                    Console.WriteLine(s + &quot; &quot; + Convert.ToInt32(prob));
                    result.Add(s, Convert.ToInt32(prob));
                }

                // Print answer
                Console.WriteLine();
                Console.WriteLine(&quot;The answer is:&quot;);
                KeyValuePair&lt;string, int&gt; q = (from t in result
                                               orderby t.Value descending
                                               select t).FirstOrDefault();
                Console.WriteLine(q.Key + &quot; &quot; + q.Value);
            }
            finally
            {
                client.Close();
            }
        }

        #endregion

        #region Shifting

        /// &lt;summary&gt;
        /// Gets the alphabet array.
        /// &lt;/summary&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private char[] GetAlphabetArray()
        {
            return _alphabet.ToCharArray();
        }

        /// &lt;summary&gt;
        /// Gets the current char array position.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;c&quot;&gt;The c.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private int GetCurrentCharArrayPosition(char c)
        {
            int position = 0;
            int count = 0;

            foreach (char letter in GetAlphabetArray())
            {
                if (letter == c)
                {
                    position = count;
                }
                count++;
            }
            return position;
        }

        /// &lt;summary&gt;
        /// Shifts the character.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;c&quot;&gt;The c.&lt;/param&gt;
        /// &lt;param name=&quot;increase&quot;&gt;The increase.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private char CharShift(char c, int increase)
        {
            const int numOfLetters = 26;
            char[] alphabet = GetAlphabetArray();
            int currentArrayPosition = GetCurrentCharArrayPosition(c);
            char letter = c;

            if (IsCharInArray(c))
            {
                if ((currentArrayPosition + increase) &lt; numOfLetters)
                {
                    letter = alphabet[currentArrayPosition + increase];
                }
                else
                {
                    int newPosition = (currentArrayPosition + increase) - numOfLetters;
                    letter = alphabet[newPosition];
                }
            }
            return letter;
        }

        /// &lt;summary&gt;
        /// Determines whether the char is in the array.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;c&quot;&gt;The c.&lt;/param&gt;
        /// &lt;returns&gt;
        /// 	&lt;c&gt;true&lt;/c&gt; if [is char in array] [the specified c]; otherwise, &lt;c&gt;false&lt;/c&gt;.
        /// &lt;/returns&gt;
        private bool IsCharInArray(char c)
        {
            bool isCharInArray = false;
            IEnumerable&lt;char&gt; q = (from t in GetAlphabetArray()
                                   where t == c
                                   select t);
            if (q.Count() &gt; 0)
            {
                isCharInArray = true;
            }
            return isCharInArray;
        }

        #endregion
    }
}
</pre>
<p>And here is the result!<br />
<img src="/wp-content/uploads/2010/05/crypt_results.jpg" width="577" alt="Results" /></p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/cryptanalysis-using-n-gram-probabilities/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apriori Algorithm</title>
		<link>http://eric.ness.net/archives/apriori-algorithm/</link>
		<comments>http://eric.ness.net/archives/apriori-algorithm/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 00:43:31 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=445</guid>
		<description><![CDATA[Review of Apriori algorithm and changes.]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/apriori-algorithm/&amp;n=Apriori+Algorithm&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/apriori-algorithm/&amp;title=Apriori+Algorithm" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/apriori-algorithm/&amp;title=Apriori+Algorithm" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/apriori-algorithm/&amp;t=Apriori+Algorithm" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/apriori-algorithm/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/apriori-algorithm/&amp;title=Apriori+Algorithm" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/apriori-algorithm/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/apriori-algorithm/&amp;title=Apriori+Algorithm" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=Apriori+Algorithm+-+http://b2l.me/kwmmq&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>I&#8217;ve been meaning to get in to the <a href="http://datamining.codeplex.com/">Data Mining SDK</a> at code plex for a while as it has a couple of good items in it. The one item I was really interested in was the <a href="http://en.wikipedia.org/wiki/Apriori_algorithm">apriori algorithm</a>.</p>
<p>Wikipedia describes Apriori:</p>
<blockquote><p>In computer science and data mining, Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing).</p></blockquote>
<p>The classic example is if you own a store and someone buys milk what is the probability that he will also buy bread and eggs or if voters in one state voted for one issue what is the chance he voted for something else. The applications for this approach are pretty much limitless.</p>
<p>The code in the SDK is pretty good with a couple of exceptions: there is little documentation and it only supports XML files and OleDb data connections. I have reworked it so it will also connect to a MSSQL database.</p>
<p>For this test application I created a simple C# Console Application and imported the &#8220;APriori&#8221; project in to the solution. In the APriori project you will to add these two bits of code to classes to the APriori project:</p>
<p>Add this method to DataAccessLayer.cs</p>
<pre class="brush: jscript;">
	public Data GetTransactionsData(string rdbmsConnectionString, string dataSource)
        {
            myDatabase = new Data();
            string query = &quot;SELECT * FROM &quot; + dataSource;
            var myConn = new SqlConnection(rdbmsConnectionString);
            var myDBAdapter = new SqlDataAdapter(query, myConn);

            myConn.Open();
            try
            {
                myDBAdapter.Fill(myDatabase, &quot;TransactionTable&quot;);
            }
            finally
            {
                myConn.Close();
            }
            return myDatabase;
        }
</pre>
<p>Add this method to DataMining.cs</p>
<pre class="brush: jscript;">
public Data MarketBasedAnalysis(double supportCount, double minimumConfidence, string connectionString, string dataSource)
        {

            Database database = new Database();
            ItemsetCandidate Item = new ItemsetCandidate();

            this.AP = new APriori.Apriori();
            this.AP.ProgressMonitorEvent += new ProgressMonitorEventHandler(this.OnProgressMonitoringCompletedEvent);
            this.dataBase = database.GetTransactionsData(connectionString, dataSource);
            database.Transactions = this.dataBase;
            this.transactionsCount = this.dataBase.TransactionTable.Count;

            supportCount = ((supportCount / 100) * this.transactionsCount);

            minimumConfidence = (minimumConfidence / 100);

            string support = &quot;SupportCount &gt;= &quot; + supportCount + &quot; AND Level &gt; 1&quot;;

            string sort = &quot;SupportCount, Level&quot;;
            ItemsetCandidate uniqueItems = AP.CreateOneItemsets(database);
            AP.AprioriGenerator(uniqueItems, database, Convert.ToInt32(supportCount));
            ItemsetArrayList[] keys = database.GetItemset(support, sort);
            string msg = &quot;Creating Frequent Subsets for Items&quot;;
            ProgressMonitorEventArgs e = new ProgressMonitorEventArgs(1, 100, 95, &quot;DataMining.MarketBasedAnalysis(3)&quot;, msg);
            this.OnProgressMonitorEvent(e);

            for (int counter = 0; counter &lt; keys.Length; counter++)
            {
                AP.CreateItemsetSubsets(0, keys[counter], null, database);
            }

            msg = &quot;Completed C#.NET Data Mining Market Based Analysis&quot;;
            e = new ProgressMonitorEventArgs(1, 100, 100, &quot;DataMining.MarketBasedAnalysis(3)&quot;, msg);
            this.OnProgressMonitorEvent(e);

            //Set the public properties of the class
            this.minimumSupportCount = supportCount;
            this.minimumConfidence = minimumConfidence;
            this.connectionString = connectionString;
            this.dataSource = dataSource;
            this.dataSourceCommand = dataSourceCommand;

            //return the database of transactions
            return this.dataBase;

        }
</pre>
<p>Here is my class in my console application</p>
<pre class="brush: jscript;">
using System;
using System.Data;
using VISUAL_BASIC_DATA_MINING_NET;
using VISUAL_BASIC_DATA_MINING_NET.CustomEvents;

namespace APr2.classes
{
    internal class testrun
    {
        private Data _dataAnalysis;
        public event ProgressMonitorEventHandler ProgressMonitorEvent;

        /// &lt;summary&gt;
        /// Runs the Apriori.
        /// &lt;/summary&gt;
        public void RunApriori()
        {
            // Create Data Mining Object
            var myDM = new DataMining();

            // Register Event
            myDM.ProgressMonitorEvent += OnProgressMonitorEvent;

            // Connect To Data Base &amp; Process Items
            _dataAnalysis = myDM.MarketBasedAnalysis(2,             // Support Count
                                                     2,             // Minimum Confidence
                                                     @&quot;Data Source=(local);Initial Catalog=Apriori;Integrated Security=True;&quot;, // Connection String
                                                     &quot;Example&quot;);    // Table in db

            // Copy to Data View
            var dataView = new ViewData();
            _dataAnalysis.Tables.Add(dataView.CreateViewRulesTable(2, _dataAnalysis).Copy());
            _dataAnalysis.Tables.Add(dataView.CreateViewSubsetTable(_dataAnalysis).Copy());

            // Spacer Line
            Console.WriteLine();

            // Print Items
            foreach (DataRow row in dataView.ViewDataSet.Tables[1].Rows)
            {
                double per = Convert.ToDouble(row.ItemArray[2].ToString().Substring(0, (row.ItemArray[2].ToString().Length -1)));
                Console.WriteLine(row.ItemArray[0] + &quot;\t&quot; + row.ItemArray[1] + &quot;\t&quot; + String.Format(&quot;{0:###.##%}&quot;, (per/100)));
            }
        }

        /// &lt;summary&gt;
        /// Called when [progress monitor event].
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;sender&quot;&gt;The sender.&lt;/param&gt;
        /// &lt;param name=&quot;e&quot;&gt;The &lt;see cref=&quot;VISUAL_BASIC_DATA_MINING_NET.CustomEvents.ProgressMonitorEventArgs&quot;/&gt; instance containing the event data.&lt;/param&gt;
        public void OnProgressMonitorEvent(object sender, ProgressMonitorEventArgs e)
        {
            // Prints Event Messages
            Console.Write(&quot;\r&quot; + e.EventMessage);
        }
    }
}
</pre>
<p>Your MSSQL Code will be this</p>
<pre class="brush: jscript;">
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Example](
	[TransactionID] [int] IDENTITY(1,1) NOT NULL,
	[Transactions] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
 CONSTRAINT [PK_Example] PRIMARY KEY CLUSTERED
(
	[TransactionID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
</pre>
<p>And these records:</p>
<pre class="brush: jscript;">
1	Books, CD, Video
2	CD, Games
3	CD, DVD
4	Books, CD, Games
5	Books, DVD
6	CD, DVD
7	Books, DVD
8	Books, CD, DVD, Video
9	Books, CD, DVD
10	Books, Games
11	Games, Lasers
</pre>
<p>Run the RunApriori() method in my class and it will yield you the correct results. Have fun.</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2010/03/ap_full.jpg"><img class="alignnone size-full wp-image-448" title="ap_full" src="http://eric.ness.net/wp-content/uploads/2010/03/ap_full.jpg" alt="" width="577" height="369" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/apriori-algorithm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>K-Means Document Clustering</title>
		<link>http://eric.ness.net/archives/k-means-document-clustering/</link>
		<comments>http://eric.ness.net/archives/k-means-document-clustering/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 17:35:48 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Visualization]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=357</guid>
		<description><![CDATA[K-Means Document Clustering in C#]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/k-means-document-clustering/&amp;n=K-Means+Document+Clustering&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/k-means-document-clustering/&amp;title=K-Means+Document+Clustering" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/k-means-document-clustering/&amp;title=K-Means+Document+Clustering" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/k-means-document-clustering/&amp;t=K-Means+Document+Clustering" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/k-means-document-clustering/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/k-means-document-clustering/&amp;title=K-Means+Document+Clustering" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/k-means-document-clustering/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/k-means-document-clustering/&amp;title=K-Means+Document+Clustering" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=K-Means+Document+Clustering+-+http://b2l.me/kxmjk&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>Using our <a href="http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/">previous example</a> as a basis to move to the next step let&#8217;s take a look at clustering using the <a href="http://en.wikipedia.org/wiki/K-means_clustering">K-Means</a> clustering algorithm to group the documents in to their appropriate categories.</p>
<p>In the paper â€œ<a href="http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf">Indexing by Latent Semantic Analysis</a>â€ (Deerwester et al.) they have an example of 9 titles of different papers grouped in to two categories â€œhuman computer interactionâ€ &amp; â€œgraphs &amp; treesâ€. So far, we&#8217;ve used <a href="http://eric.ness.net/archives/singular-value-decomposition/">Singular Value Decomposition</a> (SVD) and <a href="http://eric.ness.net/archives/latent-semantic-indexing/">Latent Semantic Indexing</a> (LSI) to better understand the relationship of words and documents. In the <a href="http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/">last blog post</a> we then took the results in LSI to plot words and documents on a two dimensional Cartesian plane.</p>
<p>All of this is pretty interesting stuff in and of itself however, the next step really is to see which documents belong in each group. One way to do this is by using K-Means clustering.</p>
<blockquote><p>Simply speaking k-means clustering is an algorithm to classify or to group your objects based on attributes/features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify the data. [<a href="http://people.revoledu.com/kardi/tutorial/kMean/WhatIs.htm">Kardi Teknomo</a>]</p></blockquote>
<p>A big chunk of the code is built off of the same project we are working on. I am using <a href="http://sites.google.com/site/docaresh/">Aresh Saharkhiz</a> K-Means implementation in the project with some minor changes/refactoring done by me.</p>
<p>Let take a look at the code!</p>
<p>This first part is the display (an ASP.NET app.)</p>
<pre class="brush: jscript;">
&lt;%@ Page Language=&quot;C#&quot; AutoEventWireup=&quot;true&quot; CodeBehind=&quot;Default.aspx.cs&quot; Inherits=&quot;LSITest._Default&quot; %&gt;
&lt;%@ Register Assembly=&quot;DundasWebChart&quot; Namespace=&quot;Dundas.Charting.WebControl&quot; TagPrefix=&quot;DCWC&quot; %&gt;
&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot; &gt;
&lt;head runat=&quot;server&quot;&gt;
    &lt;title&gt;LSI Test&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;form id=&quot;form1&quot; runat=&quot;server&quot;&gt;
    &lt;div&gt;
        &lt;DCWC:Chart ID=&quot;Chart1&quot; runat=&quot;server&quot; Height=&quot;400px&quot; Width=&quot;400px&quot;
            ImageType=&quot;Jpeg&quot;&gt;
            &lt;Legends&gt;
                &lt;DCWC:Legend Name=&quot;Default&quot; Alignment=&quot;Center&quot; Docking=&quot;Bottom&quot;&gt;&lt;/DCWC:Legend&gt;
            &lt;/Legends&gt;
            &lt;Titles&gt;
                &lt;DCWC:Title Name=&quot;Title1&quot;&gt;
                &lt;/DCWC:Title&gt;
            &lt;/Titles&gt;
            &lt;Series&gt;
                &lt;DCWC:Series Name=&quot;Series1&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
                &lt;DCWC:Series Name=&quot;Series2&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
                &lt;DCWC:Series Name=&quot;Series3&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
            &lt;/Series&gt;
            &lt;ChartAreas&gt;
                &lt;DCWC:ChartArea Name=&quot;Series2&quot;&gt;
                    &lt;axisy interval=&quot;0.5&quot; maximum=&quot;2&quot; minimum=&quot;-1&quot;&gt;
                        &lt;majorgrid linecolor=&quot;Gray&quot; linestyle=&quot;Dash&quot; /&gt;
                    &lt;/axisy&gt;
                    &lt;axisx interval=&quot;0.5&quot; maximum=&quot;2.5&quot; minimum=&quot;-0.5&quot;&gt;
                        &lt;majorgrid linecolor=&quot;Gray&quot; linestyle=&quot;Dash&quot; /&gt;
                    &lt;/axisx&gt;
                &lt;/DCWC:ChartArea&gt;
            &lt;/ChartAreas&gt;
        &lt;/DCWC:Chart&gt;
    &lt;/div&gt;
    &lt;/form&gt;
&lt;/body&gt;
&lt;/html&gt;
</pre>
<p>This is the code behind for the ASP.NET page. Because we are only dealing with two known categories K-Means is plotting out the two categories and if you wanted to do more you would definitely have to re-write the ColorCodeDocuments function.</p>
<pre class="brush: jscript;">
using System;
using System.Data;
using System.Drawing;
using System.Web.UI;
using Dundas.Charting.WebControl;

namespace LSITest
{
    public partial class _Default : Page
    {
        /// &lt;summary&gt;
        /// Handles the Load event of the Page control.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;sender&quot;&gt;The source of the event.&lt;/param&gt;
        /// &lt;param name=&quot;e&quot;&gt;The &lt;see cref=&quot;System.EventArgs&quot;/&gt; instance containing the event data.&lt;/param&gt;
        protected void Page_Load(object sender, EventArgs e)
        {
            // Perform LSI
            var mylsi = new lsi();
            mylsi.LSITest();
            double[,] myDocs = mylsi.MyDocs;

            // Plot Documents and the k-means
            const string distanceType = &quot;manhattan&quot;;
            PlotDocuments(myDocs, mylsi.MyDocRowCount);
            PlotKMeansPoints(myDocs, 2, distanceType);
            ColorCodeDocuments(distanceType);

            // If you want to plot the words just un-comment the next two lines
            //double[,] myWords = mylsi.MyWords;
            //PlotWords(myDocs, mylsi.MyWordsRowCount);

            // comment this line out to show words in legend
            Chart1.Series[&quot;Series2&quot;].ShowInLegend = false;
        }

        /// &lt;summary&gt;
        /// Plots the words.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myWords&quot;&gt;My words.&lt;/param&gt;
        /// &lt;param name=&quot;myWordsRowCount&quot;&gt;My words row count.&lt;/param&gt;
        private void PlotWords(double[,] myWords, int myWordsRowCount)
        {
            for (int i = 0; i &lt; myWordsRowCount; i++)
            {
                Chart1.Series[&quot;Series2&quot;].Points.AddXY(myWords[i, 0], myWords[i, 1]);
            }

            // Set point colors and shapes
            Chart1.Series[&quot;Series2&quot;].LegendText = &quot;Words&quot;;
            Chart1.Series[&quot;Series2&quot;].Color = Color.Gray;
            Chart1.Series[&quot;Series2&quot;].MarkerStyle = MarkerStyle.Circle;
            Chart1.Series[&quot;Series2&quot;].MarkerSize = 6;
        }

        /// &lt;summary&gt;
        /// Plots the documents.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myDocs&quot;&gt;My docs.&lt;/param&gt;
        /// &lt;param name=&quot;myDocRowCount&quot;&gt;My doc row count.&lt;/param&gt;
        private void PlotDocuments(double[,] myDocs, int myDocRowCount)
        {
            // Load documents
            for (int i = 0; i &lt; myDocRowCount; i++)
            {
                Chart1.Series[&quot;Series1&quot;].Points.AddXY(myDocs[i, 0], myDocs[i, 1]);
            }

            // Set point colors and shapes
            Chart1.Series[&quot;Series1&quot;].LegendText = &quot;Documents&quot;;
            Chart1.Series[&quot;Series1&quot;].Color = Color.Red;
            Chart1.Series[&quot;Series1&quot;].MarkerStyle = MarkerStyle.Diamond;
            Chart1.Series[&quot;Series1&quot;].MarkerSize = 12;
        }

        /// &lt;summary&gt;
        /// Plots the K means points.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;items&quot;&gt;The items.&lt;/param&gt;
        /// &lt;param name=&quot;k&quot;&gt;The k.&lt;/param&gt;
        /// &lt;param name=&quot;distanceType&quot;&gt;&lt;/param&gt;
        private void PlotKMeansPoints(double[,] items, int k, string distanceType)
        {
            ClusterCollection clusters = kmeans.ClusterDataSet(k, items, distanceType);

            for (int i = 0; i &lt; clusters.Count; i++)
            {
                Chart1.Series[&quot;Series3&quot;].Points.AddXY(clusters[i].ClusterMean[0], clusters[i].ClusterMean[1]);
            }

            // Set point colors and shapes
            Chart1.Series[&quot;Series3&quot;].LegendText = &quot;Cluster&quot;;
            Chart1.Series[&quot;Series3&quot;].Color = Color.Gold;
            Chart1.Series[&quot;Series3&quot;].MarkerStyle = MarkerStyle.Star6;
            Chart1.Series[&quot;Series3&quot;].MarkerSize = 18;
        }

        /// &lt;summary&gt;
        /// Colors the code documents.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;distanceType&quot;&gt;Type of the distance.&lt;/param&gt;
        private void ColorCodeDocuments(string distanceType)
        {
            var myDist = new similarity();

            // Extract data
            DataSet myDocs = Chart1.DataManipulator.ExportSeriesValues(&quot;Series1&quot;);
            DataSet myKMeansPoints = Chart1.DataManipulator.ExportSeriesValues(&quot;Series3&quot;);

            // Document counter
            int count = 0;

            // Get co-ordinates for k-means points
            double firstKMeansX = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[0][&quot;X&quot;]);
            double firstKMeansY = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[0][&quot;Y&quot;]);
            double secondKMeansX = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[1][&quot;X&quot;]);
            double secondKMeansY = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[1][&quot;Y&quot;]);

            foreach (DataRow docRow in myDocs.Tables[0].Rows)
            {
                // get co-ordinates for current doc
                double currentDocX = Convert.ToDouble(docRow[&quot;X&quot;]);
                double currentDocY = Convert.ToDouble(docRow[&quot;Y&quot;]);

                // load in to arrays
                double[] firstX = {currentDocX, currentDocY};
                double[] firstY = {firstKMeansX, firstKMeansY};
                double[] secondX = {currentDocX, currentDocY};
                double[] secondY = {secondKMeansX, secondKMeansY};

                // find the distance
                double firstDist = myDist.FindDistance(firstX, firstY, distanceType);
                double secondDist = myDist.FindDistance(secondX, secondY, distanceType);

                // Color accordingly
                Chart1.Series[&quot;Series1&quot;].Points[count].Color = firstDist &lt; secondDist ? Color.Blue : Color.Gray;
                count++;
            }
        }
    }
}
</pre>
<p>This is the K-Means class written by Aresh Saharkhiz with my changes</p>
<pre class="brush: jscript;">
/// Most of this code was written by Aresh Saharkhiz
/// Re-organized by me
/// See Code Project: http://www.codeproject.com/KB/recipes/K-Mean_Clustering.aspx
using System;
using System.Collections;
using System.Data;
using System.Diagnostics;

namespace LSITest
{
    public class kmeans
    {
        /// &lt;summary&gt;
        /// Calculates The Mean Of A Cluster OR The Cluster Center
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;cluster&quot;&gt;
        /// A two-dimensional array containing a dataset of numeric values
        /// &lt;/param&gt;
        /// &lt;returns&gt;
        /// Returns an Array Defining A Data Point Representing The Cluster Mean or Centroid
        /// &lt;/returns&gt;
        public static double[] ClusterMean(double[,] cluster)
        {
            int rowCount = cluster.GetUpperBound(0) + 1;
            int fieldCount = cluster.GetUpperBound(1) + 1;
            var dataSum = new double[1,fieldCount];
            var centroid = new double[fieldCount];

            for (int j = 0; j &lt; fieldCount; j++)
            {
                for (int i = 0; i &lt; rowCount; i++)
                {
                    dataSum[0, j] = dataSum[0, j] + cluster[i, j];
                }

                centroid[j] = (dataSum[0, j]/rowCount);
            }

            return centroid;
        }

        /// &lt;summary&gt;
        /// Seperates a dataset into clusters or groups with similar characteristics
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;clusterCount&quot;&gt;The number of clusters or groups to form&lt;/param&gt;
        /// &lt;param name=&quot;data&quot;&gt;An array containing data that will be clustered&lt;/param&gt;
        /// &lt;param name=&quot;type&quot;&gt;&lt;/param&gt;
        /// &lt;returns&gt;A collection of clusters of data&lt;/returns&gt;
        public static ClusterCollection ClusterDataSet(int clusterCount, double[,] data, string type)
        {
            int rowCount = data.GetUpperBound(0) + 1;
            int fieldCount = data.GetUpperBound(1) + 1;
            int stableClustersCount = 0;
            double[] dataPoint;
            var random = new Random();
            Cluster cluster;
            var clusters = new ClusterCollection();
            var clusterNumbers = new ArrayList(clusterCount);
            var myDist = new similarity();

            while (clusterNumbers.Count &lt; clusterCount)
            {
                int clusterNumber = random.Next(0, rowCount - 1);

                if (!clusterNumbers.Contains(clusterNumber))
                {
                    cluster = new Cluster();
                    clusterNumbers.Add(clusterNumber);
                    dataPoint = new double[fieldCount];

                    for (int field = 0; field &lt; fieldCount; field++)
                    {
                        dataPoint.SetValue((data[clusterNumber, field]), field);
                    }

                    cluster.Add(dataPoint);
                    clusters.Add(cluster);
                }
            }

            while (stableClustersCount != clusters.Count)
            {
                stableClustersCount = 0;
                ClusterCollection newClusters = ClusterDataSet(clusters, data, type);

                for (int clusterIndex = 0; clusterIndex &lt; clusters.Count; clusterIndex++)
                {
                    if ((myDist.FindDistance(newClusters[clusterIndex].ClusterMean, clusters[clusterIndex].ClusterMean, type)) == 0)
                    {
                        stableClustersCount++;
                    }
                }

                clusters = newClusters;
            }

            return clusters;
        }

        /// &lt;summary&gt;
        /// Seperates a dataset into clusters or groups with similar characteristics
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;clusters&quot;&gt;A collection of data clusters&lt;/param&gt;
        /// &lt;param name=&quot;data&quot;&gt;An array containing data to b eclustered&lt;/param&gt;
        /// &lt;param name=&quot;type&quot;&gt;&lt;/param&gt;
        /// &lt;returns&gt;A collection of clusters of data&lt;/returns&gt;
        public static ClusterCollection ClusterDataSet(ClusterCollection clusters, double[,] data, string type)
        {
            double[] dataPoint;
            double firstClusterDistance = 0.0;
            int rowCount = data.GetUpperBound(0) + 1;
            int fieldCount = data.GetUpperBound(1) + 1;
            int position = 0;
            var myDist = new similarity();

            // create a new collection of clusters
            var newClusters = new ClusterCollection();

            for (int count = 0; count &lt; clusters.Count; count++)
            {
                var newCluster = new Cluster();
                newClusters.Add(newCluster);
            }

            if (clusters.Count &lt;= 0)
            {
                throw new SystemException(&quot;Cluster Count Cannot Be Zero!&quot;);
            }

            for (int row = 0; row &lt; rowCount; row++)
            {
                dataPoint = new double[fieldCount];

                for (int field = 0; field &lt; fieldCount; field++)
                {
                    dataPoint.SetValue((data[row, field]), field);
                }

                for (int cluster = 0; cluster &lt; clusters.Count; cluster++)
                {
                    double[] clusterMean = clusters[cluster].ClusterMean;

                    if (cluster == 0)
                    {
                        firstClusterDistance = myDist.FindDistance(dataPoint, clusterMean, type);
                        position = cluster;
                    }
                    else
                    {
                        double secondClusterDistance = myDist.FindDistance(dataPoint, clusterMean, type);

                        if (firstClusterDistance &gt; secondClusterDistance)
                        {
                            firstClusterDistance = secondClusterDistance;
                            position = cluster;
                        }
                    }
                }

                newClusters[position].Add(dataPoint);
            }

            return newClusters;
        }

        /// &lt;summary&gt;
        /// Converts the data table to array.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;table&quot;&gt;The table.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public static double[,] ConvertDataTableToArray(DataTable table)
        {
            int rowCount = table.Rows.Count;
            int fieldCount = table.Columns.Count;

            var dataPoints = new double[rowCount,fieldCount];

            for (int rowPosition = 0; rowPosition &lt; rowCount; rowPosition++)
            {
                DataRow row = table.Rows[rowPosition];

                for (int fieldPosition = 0; fieldPosition &lt; fieldCount; fieldPosition++)
                {
                    double fieldValue;
                    try
                    {
                        fieldValue = double.Parse(row[fieldPosition].ToString());
                    }
                    catch (Exception ex)
                    {
                        Debug.WriteLine(ex.ToString());
                        throw new InvalidCastException(&quot;Invalid row at &quot; + rowPosition + &quot; and field &quot; + fieldPosition,
                                                       ex);
                    }

                    dataPoints[rowPosition, fieldPosition] = fieldValue;
                }
            }

            return dataPoints;
        }
    }

    /// &lt;summary&gt;
    /// A class containing a group of data with similar characteristics (cluster)
    /// &lt;/summary&gt;
    [Serializable]
    public class Cluster : CollectionBase
    {
        private double[] _clusterMean;
        private double[] _clusterSum;

        /// &lt;summary&gt;
        /// The sum of all the data in the cluster
        /// &lt;/summary&gt;
        public double[] ClusterSum
        {
            get { return _clusterSum; }
        }

        /// &lt;summary&gt;
        /// The mean of all the data in the cluster
        /// &lt;/summary&gt;
        public double[] ClusterMean
        {
            get
            {
                for (int count = 0; count &lt; this[0].Length; count++)
                {
                    _clusterMean[count] = (_clusterSum[count]/List.Count);
                }

                return _clusterMean;
            }
        }

        /// &lt;summary&gt;
        /// Returns the one dimensional array data located at the index
        /// &lt;/summary&gt;
        public virtual double[] this[int index]
        {
            get
            {
                //return the Neuron at IList[index]
                return (double[]) List[index];
            }
        }

        /// &lt;summary&gt;
        /// Adds a single dimension array data to the cluster
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;data&quot;&gt;A 1-dimensional array containing data that will be added to the cluster&lt;/param&gt;
        public virtual void Add(double[] data)
        {
            List.Add(data);

            if (List.Count == 1)
            {
                _clusterSum = new double[data.Length];

                _clusterMean = new double[data.Length];
            }

            for (int count = 0; count &lt; data.Length; count++)
            {
                _clusterSum[count] = _clusterSum[count] + data[count];
            }
        }
    }

    /// &lt;summary&gt;
    /// A collection of Cluster objects or Clusters
    /// &lt;/summary&gt;
    [Serializable]
    public class ClusterCollection : CollectionBase
    {
        /// &lt;summary&gt;
        /// Returns the Cluster at this index
        /// &lt;/summary&gt;
        public virtual Cluster this[int index]
        {
            get
            {
                //return the Neuron at IList[index]
                return (Cluster) List[index];
            }
        }

        /// &lt;summary&gt;
        /// Adds a Cluster to the collection of Clusters
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;cluster&quot;&gt;A Cluster to be added to the collection of clusters&lt;/param&gt;
        public virtual void Add(Cluster cluster)
        {
            List.Add(cluster);
        }
    }
}
</pre>
<p>Here is the similarity class than can calculate Euclidean, Manhattan, Chebyshev, Minkowski distances</p>
<pre class="brush: jscript;">
/// Most of this code was written by Aresh Saharkhiz
/// Re-organized by me
/// See Code Project: http://www.codeproject.com/KB/recipes/Quantitative_Distances.aspx
using System;

namespace LSITest
{
    public class similarity
    {
        /// &lt;summary&gt;
        /// Finds the distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;param name=&quot;type&quot;&gt;The type.&lt;/param&gt;
        /// &lt;param name=&quot;distanceType&quot;&gt;&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double FindDistance(double[] x, double[] y, string distanceType)
        {
            double distance;

            switch (distanceType.ToLower())
            {
                case &quot;euclidean&quot;:
                    distance = EuclideanDistance(x, y);
                    break;
                case &quot;manhattan&quot;:
                    distance = ManhattanDistance(x, y);
                    break;
                case &quot;minkowski&quot;:
                    distance = MinkowskiDistance(x, y, 1);
                    break;
                case &quot;chebyshev&quot;:
                    distance = ChebyshevDistance(x, y);
                    break;
                default:
                    distance = 0.0;
                    break;
            }

            return distance;
        }

        /// &lt;summary&gt;
        /// Finds the Euclideans distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double EuclideanDistance(double[] x, double[] y)
        {
            double sum = 0.0;

            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException(&quot;the number of elements in x must match the number of elements in y&quot;);
            }

            int count = x.Length;

            for (int i = 0; i &lt; count; i++)
            {
                sum += Math.Pow(Math.Abs(x[i] - y[i]), 2);
            }

            double distance = Math.Sqrt(sum);
            return distance;
        }

        /// &lt;summary&gt;
        /// Finds Manhattan distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double ManhattanDistance(double[] x, double[] y)
        {
            double sum = 0.0;

            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException(&quot;the number of elements in x must match the number of elements in y&quot;);
            }

            int count = x.Length;

            for (int i = 0; i &lt; count; i++)
            {
                sum += Math.Abs(x[i] - y[i]);
            }

            double distance = sum;
            return distance;
        }

        /// &lt;summary&gt;
        /// Finds Chebyshevs distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public static double ChebyshevDistance(double[] x, double[] y)
        {
            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException(&quot;the number of elements in x must match the number of elements in y&quot;);
            }
            int count = x.Length;
            var newData = new double[count];

            for (int i = 0; i &lt; count; i++)
            {
                newData[i] = Math.Abs(x[i] - y[i]);
            }
            double max = double.MinValue;

            foreach (double num in newData)
            {
                if (num &gt; max)
                {
                    max = num;
                }
            }
            return max;
        }

        /// &lt;summary&gt;
        /// Finds Minkowskis distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;param name=&quot;order&quot;&gt;The order.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double MinkowskiDistance(double[] x, double[] y, double order)
        {
            double sum = 0.0;

            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException(&quot;the number of elements in x must match the number of elements in y&quot;);
            }
            int count = x.Length;

            for (int i = 0; i &lt; count; i++)
            {
                sum = sum + Math.Pow(Math.Abs(x[i] - y[i]), order);
            }

            double distance = Math.Pow(sum, (1 / order));
            return distance;
        }
    }
}
</pre>
<p>And finally the same LSI class used in the previous examples.</p>
<pre class="brush: jscript;">
using System;
using SmartMathLibrary;

namespace LSITest
{
    public class lsi
    {
        // this returns the formated html results
        public int MyDocColumnCount;
        public int MyDocRowCount;
        public double[,] MyDocs;
        public double[,] MyWords;
        public int MyWordsColumnCount;
        public int MyWordsRowCount;
        public string ToPrint;

        /// &lt;summary&gt;
        /// LISs the test.
        /// &lt;/summary&gt;
        public void LSITest()
        {
            //Create Matrix
            var testArray = new double[,]
                                {
                                    {1, 0, 0, 1, 0, 0, 0, 0, 0},
                                    {1, 0, 1, 0, 0, 0, 0, 0, 0},
                                    {1, 1, 0, 0, 0, 0, 0, 0, 0},
                                    {0, 1, 1, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 1, 2, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 0, 1, 1, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 0, 0, 0, 0, 1},
                                    {0, 0, 0, 0, 0, 1, 1, 1, 0},
                                    {0, 0, 0, 0, 0, 0, 1, 1, 1},
                                    {0, 0, 0, 0, 0, 0, 0, 1, 1}
                                };

            // Load array in to Matrix
            var a = new Matrix(testArray);

            // print original matrix
            PrintMatrix(a);

            // preform Latent Semantic Indexing
            GetDocumentWordPlots(a);
        }

        /// &lt;summary&gt;
        /// Prints the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void PrintMatrix(IMatrix myMatrix)
        {
            ToPrint += &quot;&lt;br /&gt;&lt;br /&gt;&quot;;

            for (int i = 0; i &lt; myMatrix.Rows; i++)
            {
                for (int j = 0; j &lt; myMatrix.Columns; j++)
                {
                    ToPrint += String.Format(&quot;{0:0.##}&quot;, myMatrix.MatrixData[i, j]) + &quot;\t&quot;;
                }
                ToPrint += &quot;&lt;br /&gt;&quot;;
            }
        }

        /// &lt;summary&gt;
        /// Gets the document word plots.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void GetDocumentWordPlots(Matrix myMatrix)
        {
            // Run single value decomposition
            var svd = new SingularValueDecomposition(myMatrix);
            svd.ExecuteDecomposition();

            // Put components into individual matrices
            Matrix wordVector = svd.U.Copy();
            Matrix sigma = svd.S.ToMatrix();
            Matrix documentVector = svd.V.Copy();

            // get value of k
            // you can also manually set the value of k
            var k = (int) Math.Floor(Math.Sqrt(myMatrix.Columns));

            // reduce the vectors
            Matrix reducedWordVector = CopyMatrix(wordVector, wordVector.Rows, k - 1);
            Matrix reducedSigma = CreateSigmaMatrix(sigma, k - 1, k - 1);
            Matrix reducedDocumentVector = CopyMatrix(documentVector, documentVector.Rows, k - 1);

            // Recalculate the matrix
            Matrix docs = reducedDocumentVector*reducedSigma;
            Matrix words = reducedWordVector*reducedSigma;

            // Fill doc plot locations
            MyDocs = new double[docs.Rows,docs.Columns];
            for (int i = 0; i &lt; docs.Rows; i++)
            {
                for (int j = 0; j &lt; docs.Columns; j++)
                {
                    MyDocs[i, j] = docs.MatrixData[i, j];
                }
            }

            // Fill word plot locations
            MyWords = new double[words.Rows,words.Columns];
            for (int i = 0; i &lt; words.Rows; i++)
            {
                for (int j = 0; j &lt; words.Columns; j++)
                {
                    MyWords[i, j] = words.MatrixData[i, j];
                }
            }

            // Set counts for charts
            MyDocRowCount = docs.Rows;
            MyWordsRowCount = words.Rows;

            PrintMatrix(docs);
            PrintMatrix(words);
        }

        /// &lt;summary&gt;
        /// Creates the sigma matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;matrix&quot;&gt;The matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CreateSigmaMatrix(IMatrix matrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; columnEnd; i++)
            {
                copyMatrix.MatrixData[i, i] = matrix.MatrixData[i, 0];
            }

            return copyMatrix;
        }

        /// &lt;summary&gt;
        /// Copies the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CopyMatrix(IMatrix myMatrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; rowEnd; i++)
            {
                for (int j = 0; j &lt; columnEnd; j++)
                {
                    copyMatrix.MatrixData[i, j] = myMatrix.MatrixData[i, j];
                }
            }

            return copyMatrix;
        }
    }
}
</pre>
<p>And what do the results look like?</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/kmeansresults.jpg"><img class="alignnone size-full wp-image-361" style="margin-left: 100px; margin-right: 100px;" title="kmeansresults" src="http://eric.ness.net/wp-content/uploads/2009/11/kmeansresults.jpg" alt="kmeansresults" width="400" height="400" /></a></p>
<p>As you can see the K-Means clustering algorithm correctly grouped the documents in the appropriate categories.</p>
<p>Recommended reading and thanks goes to <a href="http://www.codeproject.com/KB/recipes/K-Mean_Clustering.aspx">Aresh Saharkhiz</a> for sharing his implementation of K-Means Clustering.</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/k-means-document-clustering/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Plotting Documents &amp; Words: Using Latent Semantic Indexing</title>
		<link>http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/</link>
		<comments>http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 00:32:22 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Visualization]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=338</guid>
		<description><![CDATA[Plotting Documents &#038; Words from LSI results]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/&amp;n=Plotting+Documents+%26+Words%3A+Using+Latent+Semantic+Indexing&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/&amp;title=Plotting+Documents+%26+Words%3A+Using+Latent+Semantic+Indexing" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/&amp;title=Plotting+Documents+%26+Words%3A+Using+Latent+Semantic+Indexing" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/&amp;t=Plotting+Documents+%26+Words%3A+Using+Latent+Semantic+Indexing" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/&amp;title=Plotting+Documents+%26+Words%3A+Using+Latent+Semantic+Indexing" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/&amp;title=Plotting+Documents+%26+Words%3A+Using+Latent+Semantic+Indexing" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=Plotting+Documents+%26+Words%3A+Using+Latent+Semantic+Indexing+-+http://b2l.me/kya8g&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>In the<a href="http://eric.ness.net/archives/latent-semantic-indexing/"> last blog post</a> we looked over a couple of great papers talking about using <a href="http://eric.ness.net/archives/singular-value-decomposition/">Singular Value Decomposition</a> (SVD) to do <a href="http://eric.ness.net/archives/latent-semantic-indexing/">Latent Semantic Indexing</a> (LSI) using the <a href="http://smartmathlibrary.codeplex.com/">SmartMathLibrary</a>. Now that we have the results we should plot them to get a sense of where these words and documents lay on a two dimensional Cartesian plane.</p>
<p style="text-align: left;">Jennifer Flynnâ€™s presentation &#8220;<a href="http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf">Latent Semantic Indexing Using SVD and Riemannian SVD</a>&#8221; actually goes on to tell us how to do this. Essentially the process is the same as before however, k must equal 2. We ended up with k = 2 in our previous example however, in larger examples k will more than likely be a different number. Regardless, here we want to end up with a matrix with two columns giving us our (x,y) &#8211; if you wanted to plot these items in a three dimensional space k=3 and if you find an awesome way to plot where k=5 e-mail me.Â  The formulas we use are as follows after we have performed SVD.</p>
<blockquote>
<p style="text-align: left;"><strong>Documents = U*âˆ‘</strong></p>
<p style="text-align: left;"><strong>Words = V*âˆ‘</strong></p>
</blockquote>
<p style="text-align: left;">The resulting matrices give us our (x,y) co-ordinates that we can then plot. I have been using the Dundas charting library for over two years now but the library is expensive so you should go and get the free library <a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=130f7986-bf49-4fe5-9ca8-910ae6ea442c&amp;DisplayLang=en">here</a> since Microsoft acquired them and is a free download. And again for simplicities sake, this project is just a simple ASP.NET application.</p>
<p style="text-align: left;">The LSI Class:</p>
<p style="text-align: left;">Please note that this is almost exactly the same as in the previous blog however, here at the end of the GetDocumentWordPlots function we use the formulas mention above to load the co-ordinates of the words and documents in to a double array that we will ultimately pass to the chart.</p>
<pre class="brush: jscript;">
using System;
using SmartMathLibrary;

namespace LSITest
{
    public class lsi
    {
        // this returns the formated html results
        public int MyDocColumnCount;
        public int MyDocRowCount;
        public double[,] MyDocs;
        public double[,] MyWords;
        public int MyWordsColumnCount;
        public int MyWordsRowCount;
        public string ToPrint;

        /// &lt;summary&gt;
        /// LISs the test.
        /// &lt;/summary&gt;
        public void LSITest()
        {
            //Create Matrix
            var testArray = new double[,]
                                {
                                    {1, 0, 0, 1, 0, 0, 0, 0, 0},
                                    {1, 0, 1, 0, 0, 0, 0, 0, 0},
                                    {1, 1, 0, 0, 0, 0, 0, 0, 0},
                                    {0, 1, 1, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 1, 2, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 0, 1, 1, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 0, 0, 0, 0, 1},
                                    {0, 0, 0, 0, 0, 1, 1, 1, 0},
                                    {0, 0, 0, 0, 0, 0, 1, 1, 1},
                                    {0, 0, 0, 0, 0, 0, 0, 1, 1}
                                };

            // Load array in to Matrix
            var a = new Matrix(testArray);

            // print original matrix
            PrintMatrix(a);

            // preform Latent Semantic Indexing
            GetDocumentWordPlots(a);
        }

        /// &lt;summary&gt;
        /// Prints the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void PrintMatrix(IMatrix myMatrix)
        {
            ToPrint += &quot;&lt;br /&gt;&lt;br /&gt;&quot;;

            for (int i = 0; i &lt; myMatrix.Rows; i++)
            {
                for (int j = 0; j &lt; myMatrix.Columns; j++)
                {
                    ToPrint += String.Format(&quot;{0:0.##}&quot;, myMatrix.MatrixData[i, j]) + &quot;\t&quot;;
                }
                ToPrint += &quot;&lt;br /&gt;&quot;;
            }
        }

        /// &lt;summary&gt;
        /// Gets the document word plots.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void GetDocumentWordPlots(Matrix myMatrix)
        {
            // Run single value decomposition
            var svd = new SingularValueDecomposition(myMatrix);
            svd.ExecuteDecomposition();

            // Put components into individual matrices
            Matrix wordVector = svd.U.Copy();
            Matrix sigma = svd.S.ToMatrix();
            Matrix documentVector = svd.V.Copy();

            // get value of k
            var k = 2;

            // reduce the vectors
            Matrix reducedWordVector = CopyMatrix(wordVector, wordVector.Rows, k - 1);
            Matrix reducedSigma = CreateSigmaMatrix(sigma, k - 1, k - 1);
            Matrix reducedDocumentVector = CopyMatrix(documentVector, documentVector.Rows, k - 1);

            // Recalculate the matrix
            Matrix docs = reducedDocumentVector*reducedSigma;
            Matrix words = reducedWordVector*reducedSigma;

            // Fill doc plot locations
            MyDocs = new double[docs.Rows,docs.Columns];
            for (int i = 0; i &lt; docs.Rows; i++)
            {
                for (int j = 0; j &lt; docs.Columns; j++)
                {
                    MyDocs[i, j] = docs.MatrixData[i, j];
                }
            }

            // Fill word plot locations
            MyWords = new double[words.Rows,words.Columns];
            for (int i = 0; i &lt; words.Rows; i++)
            {
                for (int j = 0; j &lt; words.Columns; j++)
                {
                    MyWords[i, j] = words.MatrixData[i, j];
                }
            }

            // Set counts for charts
            MyDocRowCount = docs.Rows;
            MyWordsRowCount = words.Rows;

            PrintMatrix(docs);
            PrintMatrix(words);
        }

        /// &lt;summary&gt;
        /// Creates the sigma matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;matrix&quot;&gt;The matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CreateSigmaMatrix(IMatrix matrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; columnEnd; i++)
            {
                copyMatrix.MatrixData[i, i] = matrix.MatrixData[i, 0];
            }

            return copyMatrix;
        }

        /// &lt;summary&gt;
        /// Copies the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CopyMatrix(IMatrix myMatrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; rowEnd; i++)
            {
                for (int j = 0; j &lt; columnEnd; j++)
                {
                    copyMatrix.MatrixData[i, j] = myMatrix.MatrixData[i, j];
                }
            }

            return copyMatrix;
        }
    }
}
</pre>
<p>Here is the code behind for the web page that displays the chart and the matrices. Essentially we iterate through the double array pulled from the LSI class and load them in to a chart series.</p>
<pre class="brush: jscript;">
using System;
using System.Drawing;
using System.Web.UI;
using Dundas.Charting.WebControl;

namespace LSITest
{
    public partial class _Default : Page
    {
        /// &lt;summary&gt;
        /// Handles the Load event of the Page control.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;sender&quot;&gt;The source of the event.&lt;/param&gt;
        /// &lt;param name=&quot;e&quot;&gt;The &lt;see cref=&quot;System.EventArgs&quot;/&gt; instance containing the event data.&lt;/param&gt;
        protected void Page_Load(object sender, EventArgs e)
        {
            var mylsi = new lsi();
            mylsi.LSITest();
            Label1.Text = mylsi.ToPrint;
            double[,] myDocs = mylsi.MyDocs;
            double[,] myWords = mylsi.MyWords;

            // Load documents
            for (int i = 0; i &lt; mylsi.MyDocRowCount; i++)
            {
                Chart1.Series[&quot;Series1&quot;].Points.AddXY(myDocs[i, 0], myDocs[i, 1]);
            }

            // Load words
            for (int i = 0; i &lt; mylsi.MyWordsRowCount; i++)
            {
                Chart1.Series[&quot;Series2&quot;].Points.AddXY(myWords[i, 0], myWords[i, 1]);
            }

            // Set title
            Chart1.Series[&quot;Series1&quot;].LegendText = &quot;Documents&quot;;
            Chart1.Series[&quot;Series2&quot;].LegendText = &quot;Words&quot;;

            // Set point colors and shapes
            Chart1.Series[&quot;Series1&quot;].Color = Color.Red;
            Chart1.Series[&quot;Series1&quot;].MarkerStyle = MarkerStyle.Diamond;
            Chart1.Series[&quot;Series1&quot;].MarkerSize = 12;
            Chart1.Series[&quot;Series2&quot;].Color = Color.Gray;
            Chart1.Series[&quot;Series2&quot;].MarkerStyle = MarkerStyle.Circle;
            Chart1.Series[&quot;Series2&quot;].MarkerSize = 6;
        }
    }
}
</pre>
<p>And finally here is the ASP.NET web page.</p>
<pre class="brush: jscript;">
&lt;%@ Page Language=&quot;C#&quot; AutoEventWireup=&quot;true&quot; CodeBehind=&quot;Default.aspx.cs&quot; Inherits=&quot;LSITest._Default&quot; %&gt;
&lt;%@ Register Assembly=&quot;DundasWebChart&quot; Namespace=&quot;Dundas.Charting.WebControl&quot; TagPrefix=&quot;DCWC&quot; %&gt;
&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot; &gt;
&lt;head runat=&quot;server&quot;&gt;
    &lt;title&gt;LSI Test&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;form id=&quot;form1&quot; runat=&quot;server&quot;&gt;
    &lt;div&gt;
        &lt;DCWC:Chart ID=&quot;Chart1&quot; runat=&quot;server&quot; Height=&quot;400px&quot; Width=&quot;400px&quot;
            ImageType=&quot;Jpeg&quot;&gt;
            &lt;Legends&gt;
                &lt;DCWC:Legend Name=&quot;Default&quot; Alignment=&quot;Center&quot; Docking=&quot;Bottom&quot;&gt;&lt;/DCWC:Legend&gt;
            &lt;/Legends&gt;
            &lt;Titles&gt;
                &lt;DCWC:Title Name=&quot;Title1&quot;&gt;
                &lt;/DCWC:Title&gt;
            &lt;/Titles&gt;
            &lt;Series&gt;
                &lt;DCWC:Series Name=&quot;Series1&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
                &lt;DCWC:Series Name=&quot;Series2&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
            &lt;/Series&gt;
            &lt;ChartAreas&gt;
                &lt;DCWC:ChartArea Name=&quot;Series2&quot;&gt;
                    &lt;axisy interval=&quot;0.5&quot; maximum=&quot;2&quot; minimum=&quot;-1&quot;&gt;
                        &lt;majorgrid linecolor=&quot;Gray&quot; linestyle=&quot;Dash&quot; /&gt;
                    &lt;/axisy&gt;
                    &lt;axisx interval=&quot;0.5&quot; maximum=&quot;2.5&quot; minimum=&quot;-0.5&quot;&gt;
                        &lt;majorgrid linecolor=&quot;Gray&quot; linestyle=&quot;Dash&quot; /&gt;
                    &lt;/axisx&gt;
                &lt;/DCWC:ChartArea&gt;
            &lt;/ChartAreas&gt;
        &lt;/DCWC:Chart&gt;
        &lt;br /&gt;
        &lt;asp:Label ID=&quot;Label1&quot; runat=&quot;server&quot; Text=&quot;&quot;&gt;&lt;/asp:Label&gt;
    &lt;/div&gt;
    &lt;/form&gt;
&lt;/body&gt;
&lt;/html&gt;
</pre>
<p>So lets see the result!</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/documents.jpg"><img class="alignnone size-full wp-image-341" style="margin-left: 100px; margin-right: 100px;" title="documents" src="http://eric.ness.net/wp-content/uploads/2009/11/documents.jpg" alt="documents" width="400" height="400" /></a></p>
<p>Now obviously you could/should probably write this a different way but it gets you to where you need to be.</p>
<p>I would also recommend you read in Flynn&#8217;s presentation on how to compare two words/documents by using the dot product of two row vectors. Or one could also use the <a href="http://eric.ness.net/archives/euclidean-distance-score/">Euclidean Distance Score</a>. And if you are also interested I would recommend Sujit Pal&#8217;s blog post &#8220;<a href="http://sujitpal.blogspot.com/2008/10/ir-math-in-java-cluster-visualization.html">IR Math in Java : Cluster Visualization</a>&#8221; for additional reading.</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Latent Semantic Indexing</title>
		<link>http://eric.ness.net/archives/latent-semantic-indexing/</link>
		<comments>http://eric.ness.net/archives/latent-semantic-indexing/#comments</comments>
		<pubDate>Sun, 01 Nov 2009 15:48:26 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=309</guid>
		<description><![CDATA[Latent Semantic Indexing in C#]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/latent-semantic-indexing/&amp;n=Latent+Semantic+Indexing&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/latent-semantic-indexing/&amp;title=Latent+Semantic+Indexing" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/latent-semantic-indexing/&amp;title=Latent+Semantic+Indexing" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/latent-semantic-indexing/&amp;t=Latent+Semantic+Indexing" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/latent-semantic-indexing/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/latent-semantic-indexing/&amp;title=Latent+Semantic+Indexing" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/latent-semantic-indexing/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/latent-semantic-indexing/&amp;title=Latent+Semantic+Indexing" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=Latent+Semantic+Indexing+-+http://b2l.me/kyby8&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>Latent Semantic Indexing (LSI) is commonly described as a &#8220;indexing and retrieval method that uses a mathematical technique called <a href="http://eric.ness.net/archives/singular-value-decomposition/">Singular Value Decomposition</a> (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.&#8221;. To be a bit more clear Sujit Pal has one of the best descriptions of what LSI is and how it occures:</p>
<blockquote><p>Latent Semantic Indexing attempts to uncover latent relationships among documents based on word co-occurence. So if document A contains (w1,w2) and document B contains (w2,w3), we can conclude that there is something common between documents A and B. LSI does this by decomposing the input raw term frequency matrix (A, see below) into three different matrices (U, S and V) using Singular Value Decomposition (SVD). Once that is done, the three vectors are &#8220;reduced&#8221; and the original vector rebuilt from the reduced vectors. Because of the reduction, noisy relationships are suppressed and relations become very clearly visible.</p></blockquote>
<p><strong>So how is this done?</strong></p>
<p>To start with let&#8217;s use the example in &#8220;<a href="http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf">Indexing by Latent Semantic Analysis</a>&#8221; (Deerwester et al.) because you see this example repeated in a number of places on the web. The example in the paper says let&#8217;s take a look at 9 titles of papers that fall in to two categories &#8220;human computer interaction&#8221; &amp; &#8220;graphs &amp; trees&#8221;. <a href="http://eric.ness.net/wp-content/uploads/2009/11/listofwords.jpg"></a></p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/listofwords.jpg"><img class="alignnone size-full wp-image-310" style="margin-left: 100px; margin-right: 100px;" title="listofwords" src="http://eric.ness.net/wp-content/uploads/2009/11/listofwords.jpg" alt="listofwords" width="400" height="505" /></a></p>
<p>In this example the matrix is comprised of the word counts in the different document. The next step is take this matrix and break it down in to it&#8217;s different parts using SVD. The result looks like this:</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/svd.jpg"><img class="alignnone size-full wp-image-320" title="svd" src="http://eric.ness.net/wp-content/uploads/2009/11/svd.jpg" alt="svd" width="600" /></a></p>
<p>After you have preformed SVD on the original matrix you then reduce the individual vectors. What the reduction of the vectors does is get rid of some of the &#8220;noise&#8221; &#8211; exposing the relationship between words and documents.</p>
<p>One question that arises is how much do you want to reduce the vectors (often called k)? There seems to be no hard and fast rule to this as different papers have different approaches/results with different values. In Sujit Pal&#8217;s <a href="http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html">post</a> he uses the square root of the number of columns of the original matrix (m) which is then rounded down minus 1 which is I think a good method to use. It also happens to be the value that is used in the Deerwester paper k=2. The following picture shows what this looks like (please see Jennifer Flynn&#8217;s presentation <a href="http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf">Latent Semantic Indexing Using SVD and Riemannian SVD</a> for a more elaborate example):</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/reduce.jpg"><img class="alignnone size-full wp-image-323" title="reduce" src="http://eric.ness.net/wp-content/uploads/2009/11/reduce.jpg" alt="reduce" width="600" /></a></p>
<p>After the vectors have been reduced all that is required to do is take the vectors and multiply them back together again and that is it. See the result:</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/lsi1.jpg"><img class="alignnone size-full wp-image-326" title="lsi" src="http://eric.ness.net/wp-content/uploads/2009/11/lsi1.jpg" alt="lsi" width="600" /></a></p>
<p>[<strong>Update</strong>: as one of the readers (Jorge) noted v is not exactly correct it should be V.Transpose. Please check out the “Indexing by Latent Semantic Analysis” (Deerwester et al.) paper starting on page 26 for the correct values of the matrices <a rel="nofollow" href="http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf">http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf</a> I will try to update this here shortly]</p>
<p>So lets take a look at the code &#8211; it follows the example outlined in the Deerwester paper and please keep in mind that is just a little class i put together in a asp.net test app that shows a html formatted matrix of the original (m) and the LSI result:</p>
<pre class="brush: jscript;">
using System;
using SmartMathLibrary;

namespace LSITest
{
    public class lsi
    {
        // this returns the formated html results
        public string ToPrint;

        /// &lt;summary&gt;
        /// LISs the test.
        /// &lt;/summary&gt;
        public void LSITest()
        {
            //Create Matrix
            var testArray = new double[,]
                                {
                                    {1, 0, 0, 1, 0, 0, 0, 0, 0},
                                    {1, 0, 1, 0, 0, 0, 0, 0, 0},
                                    {1, 1, 0, 0, 0, 0, 0, 0, 0},
                                    {0, 1, 1, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 1, 2, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 0, 1, 1, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 0, 0, 0, 0, 1},
                                    {0, 0, 0, 0, 0, 1, 1, 1, 0},
                                    {0, 0, 0, 0, 0, 0, 1, 1, 1},
                                    {0, 0, 0, 0, 0, 0, 0, 1, 1}
                                };

            // Load array in to Matrix
            var a = new Matrix(testArray);

            // print original matrix
            PrintMatrix(a);

            // preform Latent Semantic Indexing
            Transform(a);
        }

        /// &lt;summary&gt;
        /// Prints the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void PrintMatrix(IMatrix myMatrix)
        {
            ToPrint += &quot;&lt;br /&gt;&lt;br /&gt;&quot;;

            for (int i = 0; i &lt; myMatrix.Rows; i++)
            {
                for (int j = 0; j &lt; myMatrix.Columns; j++)
                {
                    ToPrint += String.Format(&quot;{0:0.##}&quot;, myMatrix.MatrixData[i, j]) + &quot;\t&quot;;
                }
                ToPrint += &quot;&lt;br /&gt;&quot;;
            }
        }

        /// &lt;summary&gt;
        /// Transforms the specified my matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void Transform(Matrix myMatrix)
        {
            // Run single value decomposition
            var svd = new SingularValueDecomposition(myMatrix);
            svd.ExecuteDecomposition();

            // Put components into individual matrices
            Matrix wordVector = svd.U.Copy();
            Matrix sigma = svd.S.ToMatrix();
            Matrix documentVector = svd.V.Copy();

            // get value of k
            // you can also manually set the value of k
            var k = (int) Math.Floor(Math.Sqrt(myMatrix.Columns));

            // reduce the vectors
            Matrix reducedWordVector = CopyMatrix(wordVector, wordVector.Rows, k - 1);
            Matrix reducedSigma = CreateSigmaMatrix(sigma, k - 1, k - 1);
            Matrix reducedDocumentVector = CopyMatrix(documentVector, documentVector.Rows, k - 1);

            // re-compute matrix
            Matrix a = reducedWordVector*reducedSigma*reducedDocumentVector.Transpose();

            // print result
            PrintMatrix(a);
        }

        /// &lt;summary&gt;
        /// Creates the sigma matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;matrix&quot;&gt;The matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CreateSigmaMatrix(IMatrix matrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; columnEnd; i++)
            {
                copyMatrix.MatrixData[i, i] = matrix.MatrixData[i, 0];
            }

            return copyMatrix;
        }

        /// &lt;summary&gt;
        /// Copies the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CopyMatrix(IMatrix myMatrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; rowEnd; i++)
            {
                for (int j = 0; j &lt; columnEnd; j++)
                {
                    copyMatrix.MatrixData[i, j] = myMatrix.MatrixData[i, j];
                }
            }

            return copyMatrix;
        }
    }
}
</pre>
<p>With much thanks to Sujit Pal, Jennifer Flynn and Deerwester for their excellent explanations.</p>
<p><strong>Recommended Reading</strong></p>
<p><a href="http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html">IR Math with Java : TF, IDF and LSI</a> &#8211; Sujit Pal</p>
<p><a href="http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf">Indexing by Latent Semantic Analysis</a> &#8211; Deerwester et al</p>
<p><a href="http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf">Latent Semantic Indexing Using SVD and Riemannian SVD</a> &#8211; Jennifer Flynn</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/latent-semantic-indexing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Singular Value Decomposition</title>
		<link>http://eric.ness.net/archives/singular-value-decomposition/</link>
		<comments>http://eric.ness.net/archives/singular-value-decomposition/#comments</comments>
		<pubDate>Mon, 26 Oct 2009 16:00:06 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=299</guid>
		<description><![CDATA[Singular Value Decomposition using the SmartMathLibrary in C#]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/singular-value-decomposition/&amp;n=Singular+Value+Decomposition&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/singular-value-decomposition/&amp;title=Singular+Value+Decomposition" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/singular-value-decomposition/&amp;title=Singular+Value+Decomposition" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/singular-value-decomposition/&amp;t=Singular+Value+Decomposition" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/singular-value-decomposition/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/singular-value-decomposition/&amp;title=Singular+Value+Decomposition" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/singular-value-decomposition/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/singular-value-decomposition/&amp;title=Singular+Value+Decomposition" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=Singular+Value+Decomposition+-+http://b2l.me/kwt6t&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>Singular Value Decomposition is something I&#8217;ve been wanting to wrap my head around for a while now that I am getting really into Machine Learning. Unfortunately, a lot of the material out there is often hard to understand and believe it or not there are few libraries that are available in .NET.</p>
<p>So what is singular value decomposition (SVD)? Probabably, the best description I&#8217;ve run across is:</p>
<blockquote><p>Singular Value Decomposition is a way of factoring matrices into a series of linear approximations that expose the underlying structure of the matrix. SVD is extraordinarily useful and has many applications such as data analysis, signal processing, pattern recognition, image compression, weather prediction, and Latent Semantic Analysis. [<a href="http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html">iMetaSearch</a>]</p></blockquote>
<p>SVD formula is:</p>
<p style="text-align: center;"><strong>M=Uâˆ‘V*</strong></p>
<p style="text-align: left;">M is simply a m-by-n matrix, U form a set of orthonormal &#8220;output&#8221; basis vector directions for M, Î£ are the singular values, which can be thought of as scalar &#8220;gain controls&#8221; by which each corresponding input is multiplied to give a corresponding output and V* form a set of orthonormal &#8220;input&#8221; or &#8220;analysing&#8221; basis vector directions for M. The best walk through I&#8217;ve come across is over at iMetaSearch <a href="http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html">here</a>.</p>
<p style="text-align: left;"><strong>A lack of good .NET Libraries.</strong></p>
<p style="text-align: left;">I tried out four different libraries: <a href="http://smartmathlibrary.codeplex.com/">SmartMathLibrary</a>, <a href="http://latoolnet.codeplex.com/">LatoolNet</a>, <a href="http://www.alglib.net/">ALGLIB</a> and <a href="http://www.codeproject.com/KB/recipes/psdotnetmatrix.aspx?msg=2345970">DotNetMatrix</a>. Out of these four I could only get two of them completely working and I ultimately came to the conclusion that <a href="http://smartmathlibrary.codeplex.com/">SmartMathLibrary</a> was the best for doing SVD.</p>
<p style="text-align: left;"><strong>The Code</strong></p>
<p style="text-align: left;">Here is the code to replicate this tutorial over at <a href="http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm">MIT</a>.</p>
<pre class="brush: jscript;">
using System;
using SmartMathLibrary;

namespace MatrixTest2
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            SVDTest();
            Console.ReadLine();
        }

        private static void SVDTest()
        {
            // Create/load array
            var holeDifficulty = new double[,]
                                     {
                                         {2, 1, 0, 0},
                                         {4, 3, 0, 0}
                                     };

            // Load in to Matrix
            var a = new Matrix(holeDifficulty);

            // Singular Value Decomposition
            var SVD = new SingularValueDecomposition(a);
            SVD.ExecuteDecomposition();

            // Get the general vector
            GeneralVector s = SVD.S;

            // Display results
            Console.WriteLine(a.Transpose().ToString());
            Console.WriteLine();
            Console.WriteLine(s.ToString());
            Console.WriteLine();
            Console.WriteLine(SVD.U.ToString());
            Console.WriteLine();
            Console.WriteLine(SVD.V.ToString());
        }
    }
}
</pre>
<p><strong>Additional Resources</strong></p>
<p><a href="http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html">iMetaSearch</a></p>
<p><a href="http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html">IR Math with Java : TF, IDF and LSI</a></p>
<p><a href="http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html">SVD Tutorial</a></p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/singular-value-decomposition/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Pearson&#8217;s Correlation Coefficient</title>
		<link>http://eric.ness.net/archives/pearsons-correlation-coefficient/</link>
		<comments>http://eric.ness.net/archives/pearsons-correlation-coefficient/#comments</comments>
		<pubDate>Sun, 25 Oct 2009 19:33:25 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=272</guid>
		<description><![CDATA[Pearson's Correlation Coefficient walk through]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/pearsons-correlation-coefficient/&amp;n=Pearson%27s+Correlation+Coefficient&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/pearsons-correlation-coefficient/&amp;title=Pearson%27s+Correlation+Coefficient" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/pearsons-correlation-coefficient/&amp;title=Pearson%27s+Correlation+Coefficient" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/pearsons-correlation-coefficient/&amp;t=Pearson%27s+Correlation+Coefficient" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/pearsons-correlation-coefficient/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/pearsons-correlation-coefficient/&amp;title=Pearson%27s+Correlation+Coefficient" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/pearsons-correlation-coefficient/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/pearsons-correlation-coefficient/&amp;title=Pearson%27s+Correlation+Coefficient" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=Pearson%27s+Correlation+Coefficient+-+http://b2l.me/kwzud&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>In Toby Segaran&#8217;s book &#8220;Programming Collective Intelligence&#8221; one additional methods used &#8220;to determine the similarity between people&#8217;s interests is to use the Pearson&#8217;s correlation coefficient. In statistics Pearson&#8217;s correlation coefficient is often symbolized as simply r. I also covered Toby&#8217;s Euclidean Distance Score <a title="http://eric.ness.net/archives/euclidean-distance-score/" href="http://eric.ness.net/archives/euclidean-distance-score/">here</a>.</p>
<p><img src="http://eric.ness.net/wp-content/uploads/2009/10/hl_correl_frm_r.png" alt="hl_correl_frm_r" width="197" height="102" /></p>
<p style="padding-left: 30px;">.</p>
<p style="padding-left: 60px;">.</p>
<p style="padding-left: 60px;">.</p>
<p style="padding-left: 60px;">.</p>
<p style="padding-left: 30px;">
<p style="padding-left: 30px;">
<p>Is how r is calculated.</p>
<p>And here is some sloppy source code to get you going:</p>
<pre class="brush: jscript;">
using System;
using System.Linq;

namespace PearsonTest
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            var myP = new Correlation();

            var lisaRose = new double[] {0, 2, 4, 6, 8, 10, 12};
            var jackMatthews = new[] {2.1, 5, 9, 12.6, 17.3, 21, 24.7};

            double score = myP.PearsonCorrelation(lisaRose, jackMatthews);

            Console.WriteLine(score);
            Console.ReadLine();

            // The answer is 0.99887956534852
        }
    }

    internal class Correlation
    {
        public double PearsonCorrelation(double[] x, double[] y)
        {
            double result;
            double xMean = 0;
            double yMean = 0;
            double xDenom = 0;
            double yDenom = 0;
            double denominator;
            double numerator = 0;
            double n;

            // Make sure arrays are same size and greater than 1
            if ((x.Count() == y.Count()) &amp;&amp; (x.Count() &gt;= 1))
            {
                n = x.Count();
            }
            else
            {
                result = 0;
                return result;
            }

            // Find Means
            for (int i = 0; i &lt;= n - 1; i++)
            {
                xMean += x[i];
                yMean += y[i];
            }
            xMean = xMean/n;
            yMean = yMean/n;

            // Caluculate numerator and denominator
            for (int i = 0; i &lt;= n - 1; i++)
            {
                //Caluculate numerator
                double numX = x[i] - xMean;
                double numY = y[i] - yMean;
                numerator += numX*numY;

                // Caluculate denominator parts
                xDenom += Math.Pow(numX, 2);
                yDenom += Math.Pow(numY, 2);
            }

            // Caluculate denominator
            denominator = Math.Sqrt(xDenom*yDenom);

            // Check for division by zero
            if (denominator == 0)
            {
                result = 0;
            }
            else
            {
                result = numerator/denominator;
            }

            return result;
        }
    }
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/pearsons-correlation-coefficient/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Elements of Statistical Learning: Data Mining, Inference, and Prediction. [Free Book]</title>
		<link>http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/</link>
		<comments>http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/#comments</comments>
		<pubDate>Thu, 15 Oct 2009 17:41:21 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=233</guid>
		<description><![CDATA[Free e-book - The Elements of Statistical Learning: Data Mining, Inference, and Prediction]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/&amp;n=The+Elements+of+Statistical+Learning%3A+Data+Mining%2C+Inference%2C+and+Prediction.+%5BFree+Book%5D&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/&amp;title=The+Elements+of+Statistical+Learning%3A+Data+Mining%2C+Inference%2C+and+Prediction.+%5BFree+Book%5D" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/&amp;title=The+Elements+of+Statistical+Learning%3A+Data+Mining%2C+Inference%2C+and+Prediction.+%5BFree+Book%5D" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/&amp;t=The+Elements+of+Statistical+Learning%3A+Data+Mining%2C+Inference%2C+and+Prediction.+%5BFree+Book%5D" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/&amp;title=The+Elements+of+Statistical+Learning%3A+Data+Mining%2C+Inference%2C+and+Prediction.+%5BFree+Book%5D" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/&amp;title=The+Elements+of+Statistical+Learning%3A+Data+Mining%2C+Inference%2C+and+Prediction.+%5BFree+Book%5D" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=The+Elements+of+Statistical+Learning%3A+Data+Mining%2C+Inference%2C+and+Prediction.+%5BF%5B..%5D+-+http://b2l.me/mbth6&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>I came across this on Our Signal today, it&#8217;s a free e-book: <strong>The Elements of Statistical Learning: Data Mining, Inference, and Prediction </strong>written by Trevor Hastie, Robert Tibshirani and Jerome Friedman. The book and accompanying site is located <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn//">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/the-elements-of-statistical-learning-data-mining-inference-and-prediction-free-book/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Empirical Methods in Natural Language Processing Course</title>
		<link>http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/</link>
		<comments>http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/#comments</comments>
		<pubDate>Sun, 12 Apr 2009 19:05:11 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=170</guid>
		<description><![CDATA[I came across this course in Natural Language Processing today while doing some research.]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/&amp;n=Empirical+Methods+in+Natural+Language+Processing+Course&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/&amp;title=Empirical+Methods+in+Natural+Language+Processing+Course" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/&amp;title=Empirical+Methods+in+Natural+Language+Processing+Course" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/&amp;t=Empirical+Methods+in+Natural+Language+Processing+Course" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/&amp;title=Empirical+Methods+in+Natural+Language+Processing+Course" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/&amp;title=Empirical+Methods+in+Natural+Language+Processing+Course" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=Empirical+Methods+in+Natural+Language+Processing+Course+-+http://b2l.me/kyu5z&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>I came across this course in Natural Language Processing today while doing some research. It is currently being taught at The University of Edinburgh School of Informatics.</p>
<blockquote><p>This course is an introduction to data-driven methods applied to natural language processing. The emphasis is on methods, but we will survey applications such as syntactic parsing, text classification, information extraction, tagging, summarization. The final lectures will deal with statistical machine translation. </p></blockquote>
<p>See the lecture notes <a href="http://www.inf.ed.ac.uk/teaching/courses/emnlp/">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/empirical-methods-in-natural-language-processing-course/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Contextual Valence Shifting</title>
		<link>http://eric.ness.net/archives/contextual-valence-shifting/</link>
		<comments>http://eric.ness.net/archives/contextual-valence-shifting/#comments</comments>
		<pubDate>Tue, 24 Mar 2009 20:45:41 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Natural Language Processing]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=154</guid>
		<description><![CDATA[I have been reading up on Contextual Valence Shifting and I came across two interesting papers I thought I would share.]]></description>
			<content:encoded><![CDATA[

<div class="shr-bookmarks shr-bookmarks-expand shr-bookmarks-center shr-bookmarks-bg-knowledge">
<ul class="socials">
		<li class="shr-blogger">
			<a href="http://www.blogger.com/blog_this.pyra?t&amp;u=http://eric.ness.net/archives/contextual-valence-shifting/&amp;n=Contextual+Valence+Shifting&amp;pli=1" rel="nofollow" class="external" title="Blog this on Blogger">Blog this on Blogger</a>
		</li>
		<li class="shr-delicious">
			<a href="http://delicious.com/post?url=http://eric.ness.net/archives/contextual-valence-shifting/&amp;title=Contextual+Valence+Shifting" rel="nofollow" class="external" title="Share this on del.icio.us">Share this on del.icio.us</a>
		</li>
		<li class="shr-digg">
			<a href="http://digg.com/submit?phase=2&amp;url=http://eric.ness.net/archives/contextual-valence-shifting/&amp;title=Contextual+Valence+Shifting" rel="nofollow" class="external" title="Digg this!">Digg this!</a>
		</li>
		<li class="shr-facebook">
			<a href="http://www.facebook.com/share.php?v=4&amp;src=bm&amp;u=http://eric.ness.net/archives/contextual-valence-shifting/&amp;t=Contextual+Valence+Shifting" rel="nofollow" class="external" title="Share this on Facebook">Share this on Facebook</a>
		</li>
		<li class="shr-googlebuzz">
			<a href="http://www.google.com/buzz/post?url=http://eric.ness.net/archives/contextual-valence-shifting/&amp;imageurl=" rel="nofollow" class="external" title="Post on Google Buzz">Post on Google Buzz</a>
		</li>
		<li class="shr-reddit">
			<a href="http://reddit.com/submit?url=http://eric.ness.net/archives/contextual-valence-shifting/&amp;title=Contextual+Valence+Shifting" rel="nofollow" class="external" title="Share this on Reddit">Share this on Reddit</a>
		</li>
		<li class="shr-squidoo">
			<a href="http://www.squidoo.com/lensmaster/bookmark?http://eric.ness.net/archives/contextual-valence-shifting/" rel="nofollow" class="external" title="Add to a lense on Squidoo">Add to a lense on Squidoo</a>
		</li>
		<li class="shr-stumbleupon">
			<a href="http://www.stumbleupon.com/submit?url=http://eric.ness.net/archives/contextual-valence-shifting/&amp;title=Contextual+Valence+Shifting" rel="nofollow" class="external" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a>
		</li>
		<li class="shr-twitter">
			<a href="http://twitter.com/home?status=Contextual+Valence+Shifting+-+http://b2l.me/k2bwz&amp;source=shareaholic" rel="nofollow" class="external" title="Tweet This!">Tweet This!</a>
		</li>
</ul>
<div style="clear:both;"></div>
</div>

<p>I have been reading up on Contextual Valence Shifting and I came across two interesting papers I thought I would share.</p>
<p>In a nutshell Valence Shifting helps determine if a given sentence has a positive or negative tone. </p>
<blockquote><p>In addition to describing facts and events, texts often communicate information about the attitude of the writer or various participants towards an event being described. Salient clues about attitude are provided by the lexical choice of the writer but, as discussed below, the organization of the text also contributes critical information for attitude assessment.</p></blockquote>
<p><a href="http://www.aaai.org/Papers/Symposia/Spring/2004/SS-04-07/SS04-07-020.pdf">Contextual Valence Shifters</a> by Livia Polanyi and Annie Zaenen [pdf]<br />
<a href="http://www.tacoma.washington.edu/tech/docs/research/gradresearch/ldillard.pdf">â€œI Canâ€™t Recommend This Paper Highly Enoughâ€: Valence-Shifted Sentences in Sentiment Classification</a> by Logan Dillard</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/contextual-valence-shifting/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
