<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>eric.ness.net</title>
	<atom:link href="http://eric.ness.net/feed/" rel="self" type="application/rss+xml" />
	<link>http://eric.ness.net</link>
	<description>...I never learned to read.</description>
	<lastBuildDate>Fri, 05 Mar 2010 04:54:01 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Software Engineering Masters</title>
		<link>http://eric.ness.net/archives/software-engineering-masters/</link>
		<comments>http://eric.ness.net/archives/software-engineering-masters/#comments</comments>
		<pubDate>Fri, 05 Mar 2010 04:54:01 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Software Engineering]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=455</guid>
		<description><![CDATA[I am starting a masters in Software Engineering at the University of Maryland University College]]></description>
			<content:encoded><![CDATA[<p>So I am starting a masters in Software Engineering at the University of Maryland University College. I ended up checking out numerous programs and it was kind of interesting some of items that became important to me.</p>
<ol>
<li><strong>Online:</strong> I kind of knew that I would probably need a program that is at least partly online. This is mainly due to the fact that my work often requires me to travel often for up to two weeks at a time.</li>
<li><strong>Program Type:</strong> I am in kind of a weird place academically. I graduated with a Bachelors of Science in Information Technology even though I considered myself to be in the Computer Science program. I actually took all the required classes for the Comp Sci degree with the exception of all the math. This is now a bit of a detriment because I could have used about two more math classes that most masters programs require. It really leaves me at looking at Software Engineering or Business Intelligence DB programs.</li>
<li><strong>Computer Science Programs:</strong> Interestingly enough there are many aspects to Computer Science that I am not interested in: namely many programs have you focus in a particular area (i.e. graphics). The only focus area that really interests me is Machine Learning and often this gets put under AI &#8211; which often is not really the same.</li>
<li><strong>DC &#8211; Area:</strong> The DC area is kind of a strange place to look for schools. There are only a couple of programs that are actually convenient for me to travel to: George Washington, American, Georgetown, Howard and the University of the District of Columbia. But, truth be told I am actually not really interested in any of these schools for various reasons.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/software-engineering-masters/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apriori Algorithm</title>
		<link>http://eric.ness.net/archives/apriori-algorithm/</link>
		<comments>http://eric.ness.net/archives/apriori-algorithm/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 00:43:31 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=445</guid>
		<description><![CDATA[Review of Apriori algorithm and changes.]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been meaning to get in to the <a href="http://datamining.codeplex.com/">Data Mining SDK</a> at code plex for a while as it has a couple of good items in it. The one item I was really interested in was the <a href="http://en.wikipedia.org/wiki/Apriori_algorithm">apriori algorithm</a>.</p>
<p>Wikipedia describes Apriori:</p>
<blockquote><p>In computer science and data mining, Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Other algorithms are designed for finding association rules in data having no transactions (Winepi and Minepi), or having no timestamps (DNA sequencing).</p></blockquote>
<p>The classic example is if you own a store and someone buys milk what is the probability that he will also buy bread and eggs or if voters in one state voted for one issue what is the chance he voted for something else. The applications for this approach are pretty much limitless.</p>
<p>The code in the SDK is pretty good with a couple of exceptions: there is little documentation and it only supports XML files and OleDb data connections. I have reworked it so it will also connect to a MSSQL database.</p>
<p>For this test application I created a simple C# Console Application and imported the &#8220;APriori&#8221; project in to the solution. In the APriori project you will to add these two bits of code to classes to the APriori project:</p>
<p>Add this method to DataAccessLayer.cs</p>
<pre class="brush: jscript;">
	public Data GetTransactionsData(string rdbmsConnectionString, string dataSource)
        {
            myDatabase = new Data();
            string query = &quot;SELECT * FROM &quot; + dataSource;
            var myConn = new SqlConnection(rdbmsConnectionString);
            var myDBAdapter = new SqlDataAdapter(query, myConn);

            myConn.Open();
            try
            {
                myDBAdapter.Fill(myDatabase, &quot;TransactionTable&quot;);
            }
            finally
            {
                myConn.Close();
            }
            return myDatabase;
        }
</pre>
<p>Add this method to DataMining.cs</p>
<pre class="brush: jscript;">
public Data MarketBasedAnalysis(double supportCount, double minimumConfidence, string connectionString, string dataSource)
        {

            Database database = new Database();
            ItemsetCandidate Item = new ItemsetCandidate();

            this.AP = new APriori.Apriori();
            this.AP.ProgressMonitorEvent += new ProgressMonitorEventHandler(this.OnProgressMonitoringCompletedEvent);
            this.dataBase = database.GetTransactionsData(connectionString, dataSource);
            database.Transactions = this.dataBase;
            this.transactionsCount = this.dataBase.TransactionTable.Count;

            supportCount = ((supportCount / 100) * this.transactionsCount);

            minimumConfidence = (minimumConfidence / 100);

            string support = &quot;SupportCount &gt;= &quot; + supportCount + &quot; AND Level &gt; 1&quot;;

            string sort = &quot;SupportCount, Level&quot;;
            ItemsetCandidate uniqueItems = AP.CreateOneItemsets(database);
            AP.AprioriGenerator(uniqueItems, database, Convert.ToInt32(supportCount));
            ItemsetArrayList[] keys = database.GetItemset(support, sort);
            string msg = &quot;Creating Frequent Subsets for Items&quot;;
            ProgressMonitorEventArgs e = new ProgressMonitorEventArgs(1, 100, 95, &quot;DataMining.MarketBasedAnalysis(3)&quot;, msg);
            this.OnProgressMonitorEvent(e);

            for (int counter = 0; counter &lt; keys.Length; counter++)
            {
                AP.CreateItemsetSubsets(0, keys[counter], null, database);
            }

            msg = &quot;Completed C#.NET Data Mining Market Based Analysis&quot;;
            e = new ProgressMonitorEventArgs(1, 100, 100, &quot;DataMining.MarketBasedAnalysis(3)&quot;, msg);
            this.OnProgressMonitorEvent(e);

            //Set the public properties of the class
            this.minimumSupportCount = supportCount;
            this.minimumConfidence = minimumConfidence;
            this.connectionString = connectionString;
            this.dataSource = dataSource;
            this.dataSourceCommand = dataSourceCommand;

            //return the database of transactions
            return this.dataBase;

        }
</pre>
<p>Here is my class in my console application</p>
<pre class="brush: jscript;">
using System;
using System.Data;
using VISUAL_BASIC_DATA_MINING_NET;
using VISUAL_BASIC_DATA_MINING_NET.CustomEvents;

namespace APr2.classes
{
    internal class testrun
    {
        private Data _dataAnalysis;
        public event ProgressMonitorEventHandler ProgressMonitorEvent;

        /// &lt;summary&gt;
        /// Runs the Apriori.
        /// &lt;/summary&gt;
        public void RunApriori()
        {
            // Create Data Mining Object
            var myDM = new DataMining();

            // Register Event
            myDM.ProgressMonitorEvent += OnProgressMonitorEvent;

            // Connect To Data Base &amp; Process Items
            _dataAnalysis = myDM.MarketBasedAnalysis(2,             // Support Count
                                                     2,             // Minimum Confidence
                                                     @&quot;Data Source=(local);Initial Catalog=Apriori;Integrated Security=True;&quot;, // Connection String
                                                     &quot;Example&quot;);    // Table in db

            // Copy to Data View
            var dataView = new ViewData();
            _dataAnalysis.Tables.Add(dataView.CreateViewRulesTable(2, _dataAnalysis).Copy());
            _dataAnalysis.Tables.Add(dataView.CreateViewSubsetTable(_dataAnalysis).Copy());

            // Spacer Line
            Console.WriteLine();

            // Print Items
            foreach (DataRow row in dataView.ViewDataSet.Tables[1].Rows)
            {
                double per = Convert.ToDouble(row.ItemArray[2].ToString().Substring(0, (row.ItemArray[2].ToString().Length -1)));
                Console.WriteLine(row.ItemArray[0] + &quot;\t&quot; + row.ItemArray[1] + &quot;\t&quot; + String.Format(&quot;{0:###.##%}&quot;, (per/100)));
            }
        }

        /// &lt;summary&gt;
        /// Called when [progress monitor event].
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;sender&quot;&gt;The sender.&lt;/param&gt;
        /// &lt;param name=&quot;e&quot;&gt;The &lt;see cref=&quot;VISUAL_BASIC_DATA_MINING_NET.CustomEvents.ProgressMonitorEventArgs&quot;/&gt; instance containing the event data.&lt;/param&gt;
        public void OnProgressMonitorEvent(object sender, ProgressMonitorEventArgs e)
        {
            // Prints Event Messages
            Console.Write(&quot;\r&quot; + e.EventMessage);
        }
    }
}
</pre>
<p>Your MSSQL Code will be this</p>
<pre class="brush: jscript;">
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Example](
	[TransactionID] [int] IDENTITY(1,1) NOT NULL,
	[Transactions] [nvarchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
 CONSTRAINT [PK_Example] PRIMARY KEY CLUSTERED
(
	[TransactionID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
</pre>
<p>And these records:</p>
<pre class="brush: jscript;">
1	Books, CD, Video
2	CD, Games
3	CD, DVD
4	Books, CD, Games
5	Books, DVD
6	CD, DVD
7	Books, DVD
8	Books, CD, DVD, Video
9	Books, CD, DVD
10	Books, Games
11	Games, Lasers
</pre>
<p>Run the RunApriori() method in my class and it will yield you the correct results. Have fun.</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2010/03/ap_full.jpg"><img class="alignnone size-full wp-image-448" title="ap_full" src="http://eric.ness.net/wp-content/uploads/2010/03/ap_full.jpg" alt="" width="577" height="369" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/apriori-algorithm/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>UUorld Map Visualization</title>
		<link>http://eric.ness.net/archives/uuorld-map-visualization/</link>
		<comments>http://eric.ness.net/archives/uuorld-map-visualization/#comments</comments>
		<pubDate>Mon, 01 Feb 2010 20:50:28 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Visualization]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=423</guid>
		<description><![CDATA[UUorld has quickly become one of my new favorite toys as of late.]]></description>
			<content:encoded><![CDATA[<p><a title="UUorld" href="http://www.uuorld.com/">UUorld</a> has quickly become one of my new favorite toys as of late. What is it? As the site explains it &#8220;provides an immersive mapping environment, high-quality data, and critical analysis tools.&#8221;</p>
<p>Here is just a simple result of what the map looks like:</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2010/02/example1.jpg"><img class="alignnone size-full wp-image-424" title="example1" src="http://eric.ness.net/wp-content/uploads/2010/02/example1.jpg" alt="" width="577" height="351" /></a></p>
<p>Here are some of the highlights:</p>
<ol>
<li>Supports time series data</li>
<li>Fairly extensive database to pull data from online (apparently of 10,000 different datasets)</li>
<li>Create you own datasets via csv files</li>
<li>Export to video</li>
<li>Export to KML file for use in google maps/earth.</li>
<li>Has the following border sets: Country, US States/Counties, US Zip Codes</li>
</ol>
<p>There is one caveat that I feel I must add to not give the impression that all is rosy &#8211; I bought the application over a week ago and it took several e-mails for them to reply to me and finally get my full download.</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2010/02/example2.jpg"><img class="alignnone size-full wp-image-425" title="example2" src="http://eric.ness.net/wp-content/uploads/2010/02/example2.jpg" alt="" width="577" height="351" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/uuorld-map-visualization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scatterplots Using R and MSSQL</title>
		<link>http://eric.ness.net/archives/scatterplots-using-r-and-mssql/</link>
		<comments>http://eric.ness.net/archives/scatterplots-using-r-and-mssql/#comments</comments>
		<pubDate>Fri, 27 Nov 2009 15:00:57 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Visualization]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=412</guid>
		<description><![CDATA[Scatterplots Using R and MSSQL]]></description>
			<content:encoded><![CDATA[<p>As an extension of <a href="http://eric.ness.net/archives/histogram-lattices-using-r-and-mssql/">yesterdays post</a> here is another fairly cool chart you can do in <a href="http://www.r-project.org/">R</a>. For this little sample we are using the same data as before but for my sql query I have to do a crosstab query. So lets take a look at the code:</p>
<pre class="brush: jscript;">
# includes
library(RODBC)

# create connection
channel &lt;- odbcConnect(&quot;HealthDB&quot;)

# query database
myData &lt;- sqlQuery(channel, &quot;SELECT
Country AS 'Country',
Year AS 'Year',
[96741] AS 'GDP growth (annual %)--WDI-2009',
[96841] AS 'GDP per capita (constant 2000 US$)--WDI-2009',
[99941] AS 'Population growth (annual %)--WDI-2009',
[100041] AS 'Population, total--WDI-2009'
FROM
(
SELECT DISTINCT CountryID, Country, Year, IndicatorID, IndValue
FROM [Time Series Data]
WHERE (
((IndicatorID) = 96741) OR
((IndicatorID) = 96841) OR
((IndicatorID) = 99941) OR
((IndicatorID) = 100041))
AND
(((CountryID) = 4118) OR
((CountryID) = 4125) OR
((CountryID) = 4129) OR
((CountryID) = 4134) OR
((CountryID) = 4141) OR
((CountryID) = 4145) OR
((CountryID) = 4164) OR
((CountryID) = 4186) OR
((CountryID) = 4213) OR
((CountryID) = 4327) OR
((CountryID) = 4219) OR
((CountryID) = 4221) OR
((CountryID) = 4227) OR
((CountryID) = 4230) OR
((CountryID) = 4243) OR
((CountryID) = 4326) OR
((CountryID) = 4268) OR
((CountryID) = 4272) OR
((CountryID) = 4273) OR
((CountryID) = 4325) OR
((CountryID) = 4300) OR
((CountryID) = 4308) OR
((CountryID) = 4309) OR
((CountryID) = 4311) OR
((CountryID) = 4316))
AND
(NOT (Year IS NULL)) AND (Year &gt;= 1960) AND
(Year &lt;= 2007))
ps
PIVOT (
MAX(IndValue)
FOR IndicatorID IN ([96741], [96841], [99941], [100041] ) )
AS
pvt
order by Country, Year&quot;)

#close connection
odbcClose(channel)

#Plot charts
plot(myData[3:6], col=&quot;orange&quot;, main=&quot;Select Indicators for Europe and Central Asia&quot;)
</pre>
<p>Here is the result</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/Scatterplot.jpg"><img class="alignnone size-full wp-image-414" title="Scatterplot" src="http://eric.ness.net/wp-content/uploads/2009/11/Scatterplot.jpg" alt="Scatterplot" width="577" height="400" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/scatterplots-using-r-and-mssql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Histogram Lattices Using R and MSSQL</title>
		<link>http://eric.ness.net/archives/histogram-lattices-using-r-and-mssql/</link>
		<comments>http://eric.ness.net/archives/histogram-lattices-using-r-and-mssql/#comments</comments>
		<pubDate>Thu, 26 Nov 2009 19:22:41 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Visualization]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=402</guid>
		<description><![CDATA[Creating Histogram Lattices Using R and MSSQL]]></description>
			<content:encoded><![CDATA[<p>After getting Joseph Adler&#8217;s book &#8220;<a href="http://oreilly.com/catalog/9780596009427">Baseball Hacks</a>&#8221; I&#8217;ve been wanting to get in to <a href="http://www.r-project.org/">R</a>. R is simply an amazing open source statistics/graphing application. For this example we are going pull data from a MSSQL database and make a histogram lattice of a couple of countries.</p>
<p>First, I pulled the data from <a href="http://healthsystems2020.healthsystemsdatabase.org/datasets/timeseriesdataset.aspx">HealthSystems2020</a> time series database and imported the data in to MSSQL. I did some minor touch ups to the database giving the indicator an id etc. The second thing you need to do is create an ODBC connection for your database here is a fairly good <a href="http://www.devasp.com/samples/dsn_sql.asp">tutorial</a>. In this example I called my ODBC DSN &#8220;HealthDB&#8221;. Also make sure you adjust you sql query so that they are pulling the correct names/values.</p>
<p>Finally, here is the code:</p>
<pre class="brush: jscript;">

# includes
library(RODBC)
library(lattice)

# create connection
channel &lt;- odbcConnect(&quot;HealthDB&quot;)

# query database
myData &lt;- sqlQuery(channel, &quot;SELECT Country, IndValue
FROM         [YOURTABLE]
WHERE     (id = 96841) AND (
(Country = 'Afghanistan') OR
(Country = 'Bangladesh') OR
(Country = 'Bhutan') OR
(Country = 'India') OR
(Country = 'Maldives') OR
(Country = 'Nepal') OR
(Country = 'Pakistan') OR
(Country = 'China') OR
(Country = 'Indonesia') OR
(Country = 'Sri Lanka'))&quot;)

#close connection
odbcClose(channel)

#create histogram
histogram(~ myData[,3] | myData[,1], type=&quot;count&quot;, col=&quot;red&quot;, main = &quot;GDP per capita (constant 2000 US$)&quot;, xlab=&quot;Country&quot;)
</pre>
<p>And here is the result:<br />
<a href="http://eric.ness.net/wp-content/uploads/2009/11/histogram.jpg"><img class="alignnone size-full wp-image-403" title="histogram" src="http://eric.ness.net/wp-content/uploads/2009/11/histogram.jpg" alt="histogram" width="577" height="376" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/histogram-lattices-using-r-and-mssql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Benford&#8217;s Law and Trailing Digit Tests</title>
		<link>http://eric.ness.net/archives/befords-law-and-trailing-digit-tests/</link>
		<comments>http://eric.ness.net/archives/befords-law-and-trailing-digit-tests/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 14:23:50 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=371</guid>
		<description><![CDATA[Looking at Benford's Law and Trailing Digit Test]]></description>
			<content:encoded><![CDATA[<p>As of late I&#8217;ve been coming across <a href="http://eric.ness.net/archives/benfords-law/">Benford&#8217;s Law</a> all over the place and so I thought I would revisit the topic. Beford&#8217;s Law essentially states &#8220;that in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way&#8221;. More specifically you should expect the leading digit &#8216;1&#8242; to appear about 30.1% fo the time, &#8216;2&#8242; about 17.6% and so on [<a href="http://en.wikipedia.org/wiki/Benford%27s_law#Mathematical_statement">see table</a>]. One classic example for the use of Benford&#8217;s Law is in fraud detection.</p>
<p>One of the reasons I wanted to revisit this topic is that after having a conversation with a friend about this very topic, I came across a series of articles written by Nate Silver regarding the polling firm Strategic Vision. Strategic Vision released a poll focusing on <a href="http://www.fivethirtyeight.com/2009/09/are-oklahoma-students-really-this-dumb.html">Oklahoma students</a> and it <a href="http://www.fivethirtyeight.com/2009/09/strategic-vision-polls-exhibit-unusual.html">motivated Silver to ask a series of questions about the firm</a>.</p>
<p>Now I don&#8217;t want to really focus in on all the Strategic Vision stuff but, I do want to talk about a method Silver used to detect anomalies in their polling. As Silver correctly suggests, polling would not be a good candidate to using Benford&#8217;s Law (i.e. take the last Presidential race where for the most part the two candidates were going back and forth in the 40~50% range &#8211; Benford&#8217;s wouldn&#8217;t work). However, there is another method that might give you a little insight as Silver explains:</p>
<blockquote><p><span id="fullpost">For each question, I recorded the <span style="font-style: italic;">trailing digit</span> for each candidate or line item. For instance, if Strategic Vision had Barack Obama beating John McCain 48-43 in a particular state, I&#8217;d record a tally in the 8 column and another in the 3 column. Or if they had voters opposing a particular policy 50-45, I&#8217;d record a tally in the 0 column (for 50) and another in the 5 column (for 45). </span></p></blockquote>
<p><span>And what Silver says essentially is, that if you look at the last digit you should have roughly a </span>uniform distribution. Put it another way, that if I have roughly 200 4&#8217;s I would expect roughly 200 8&#8217;s too. Silver also says that when using the trailing digit method that in some cases, you might find deviations from this distribution might be due to rounding error&#8217;s or a specific mathematical method. The trailing digit test is clearly not a sure fire way to detect fraud or anything, but just another useful tool to see if the data passes the smell test.</p>
<p>Silver&#8217;s insight prompted me to write some code and play around with these two methods just to see what comes up. Because I am fairly familiar with the the United Nation&#8217;s World Health Database, I thought I would run some tests using these methods. Here are the results:</p>
<p><strong>Gross Domestic Product (GDP)</strong></p>
<p>As you can see, GDP generally follows Benford&#8217;s Law and the trailing digit test. Something to note on the trailing digit results &#8211; the average for each number (bin) is 23.7 so the number of 1&#8217;s and the number of 5&#8217;s are roughly equidistant from the average, even though it seems a little odd finding a GDP with a value ending in 1 is almost twice as likely as finding a value ending in 5.</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/gdp.jpg"><img class="alignnone size-full wp-image-374" title="gdp" src="http://eric.ness.net/wp-content/uploads/2009/11/gdp.jpg" alt="gdp" width="600" height="250" /></a></p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/gdp-trail.jpg"><img class="alignnone size-full wp-image-375" title="gdp trail" src="http://eric.ness.net/wp-content/uploads/2009/11/gdp-trail.jpg" alt="gdp trail" width="600" height="250" /></a></p>
<p><strong>Life expectancy at birth (years)</strong></p>
<p>Now here is an example where Benford&#8217;s Law will not work. The reason is because the range of life spans for all countries is from 40-83 years of age, so we are going to have to focus in on the trailing digit test.</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/life-exp.jpg"><img class="alignnone size-full wp-image-378" title="life exp" src="http://eric.ness.net/wp-content/uploads/2009/11/life-exp.jpg" alt="life exp" width="600" height="250" /></a></p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/life-exp-trail.jpg"><img class="alignnone size-full wp-image-379" title="life exp trail" src="http://eric.ness.net/wp-content/uploads/2009/11/life-exp-trail.jpg" alt="life exp trail" width="600" height="250" /></a></p>
<p>You will notice that there is something weird here with the trailing digits. First, the average value is 19.3 and we have 0&#8217;s and 2&#8217;s appearing almost 3 times more than 7&#8217;s. One thing that might account for this disparity is the fact that we have a little over 20 countries that fall in the range of 80-83, which would tilt the values of 0,1,2,3 a little higher than normal. And the number of countries in the 40&#8217;s is also a little sparse. So what I did was remove these from the set and re-ran the test. Here are the following results.</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/life-exp-norm-benford.jpg"><img class="alignnone size-full wp-image-382" title="life exp norm benford" src="http://eric.ness.net/wp-content/uploads/2009/11/life-exp-norm-benford.jpg" alt="life exp norm benford" width="600" height="250" /></a></p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/life-exp-norm-trail.jpg"><img class="alignnone size-full wp-image-383" title="life exp norm trail" src="http://eric.ness.net/wp-content/uploads/2009/11/life-exp-norm-trail.jpg" alt="life exp norm trail" width="600" height="250" /></a></p>
<p>You will notice that after cleaning up the data a 3 is still almost 3 times as likely to appear as a 7 in what should be some fairly naturally occurring numbers. The average for this set is <span id="Label1">14.8.</span></p>
<p><span>Now I think it would be inappropriate for me to draw any hard conclusions about the Life Expectancy data other than to say that some form of rounding has likely occurred due to the fact that the data are all whole numbers.<br />
</span></p>
<p><span>That said though, you can see how these two simple/practical tests can assist in determining whether there has been some human manipulation of the data. Here is the code that I used:</span></p>
<pre class="brush: jscript;">
using System;
using System.Linq;

namespace BenfordsLaw
{
    /// &lt;summary&gt;
    /// Benfords Law Class
    /// &lt;/summary&gt;
    public class Benfords
    {
        /// &lt;summary&gt;
        /// Adds the data.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;data&quot;&gt;The data.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double[] CalculateBenfordsDistribution(double[] data)
        {
            if (data.Count() == 0)
            {
                throw new ArgumentException(&quot;Error: There are no items in your data array.&quot;);
            }

            //Create benford bins to hold counts
            var benfordsContainer = new double[9];

            // Loop through array
            foreach (double number in data)
            {
                // Get absolute value of number
                double currentNumber = Math.Abs(number);

                // for items smaller than 1 * multiply it so you
                // can find the first number
                if ((currentNumber &lt; 1) &amp;&amp; (currentNumber &gt; 0))
                {
                    double num = (currentNumber*10000);
                    while (num &gt;= 10)
                        num /= 10;
                    PackageNumberInBenfordBin(benfordsContainer, num);
                }
                else
                {
                    double num = currentNumber;
                    while (num &gt;= 10)
                        num /= 10;
                    PackageNumberInBenfordBin(benfordsContainer, num);
                }
            }

            return benfordsContainer;
        }

        /// &lt;summary&gt;
        /// Trailings the digit check.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;data&quot;&gt;The data.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double[] TrailingDigitCheck(double[] data)
        {
            if (data.Count() == 0)
            {
                throw new ArgumentException(&quot;Error: There are no items in your data array.&quot;);
            }

            //Create benford bins to hold counts
            var trailingContainer = new double[10];

            // Loop through array
            foreach (double number in data)
            {
                // Get absolute value of number
                double currentNumber = Math.Abs(number);
                string numTemp = currentNumber.ToString();
                string numTemp2 = numTemp.Substring(numTemp.Length - 1);
                PackageNumberInTrailingBin(trailingContainer, Convert.ToDouble(numTemp2));
            }

            return trailingContainer;
        }

        /// &lt;summary&gt;
        /// Packages the number in benford bin.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myContainer&quot;&gt;My container.&lt;/param&gt;
        /// &lt;param name=&quot;num&quot;&gt;The num.&lt;/param&gt;
        private static void PackageNumberInBenfordBin(double[] myContainer, double num)
        {
            // Update container totals
            int myNum = Convert.ToInt32(Math.Floor(num));
            myContainer[myNum-1] += 1;
        }

        /// &lt;summary&gt;
        /// Packages the number in trailing bin.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myContainer&quot;&gt;My container.&lt;/param&gt;
        /// &lt;param name=&quot;num&quot;&gt;The num.&lt;/param&gt;
        private static void PackageNumberInTrailingBin(double[] myContainer, double num)
        {
            // Update container totals
            int myNum = Convert.ToInt32(Math.Floor(num));
            myContainer[myNum] += 1;
        }
    }
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/befords-law-and-trailing-digit-tests/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>K-Means Document Clustering</title>
		<link>http://eric.ness.net/archives/k-means-document-clustering/</link>
		<comments>http://eric.ness.net/archives/k-means-document-clustering/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 17:35:48 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Visualization]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=357</guid>
		<description><![CDATA[K-Means Document Clustering in C#]]></description>
			<content:encoded><![CDATA[<p>Using our <a href="http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/">previous example</a> as a basis to move to the next step let&#8217;s take a look at clustering using the <a href="http://en.wikipedia.org/wiki/K-means_clustering">K-Means</a> clustering algorithm to group the documents in to their appropriate categories.</p>
<p>In the paper â€œ<a href="http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf">Indexing by Latent Semantic Analysis</a>â€ (Deerwester et al.) they have an example of 9 titles of different papers grouped in to two categories â€œhuman computer interactionâ€ &amp; â€œgraphs &amp; treesâ€. So far, we&#8217;ve used <a href="http://eric.ness.net/archives/singular-value-decomposition/">Singular Value Decomposition</a> (SVD) and <a href="http://eric.ness.net/archives/latent-semantic-indexing/">Latent Semantic Indexing</a> (LSI) to better understand the relationship of words and documents. In the <a href="http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/">last blog post</a> we then took the results in LSI to plot words and documents on a two dimensional Cartesian plane.</p>
<p>All of this is pretty interesting stuff in and of itself however, the next step really is to see which documents belong in each group. One way to do this is by using K-Means clustering.</p>
<blockquote><p>Simply speaking k-means clustering is an algorithm to classify or to group your objects based on attributes/features into K number of group. K is positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus the purpose of K-mean clustering is to classify the data. [<a href="http://people.revoledu.com/kardi/tutorial/kMean/WhatIs.htm">Kardi Teknomo</a>]</p></blockquote>
<p>A big chunk of the code is built off of the same project we are working on. I am using <a href="http://sites.google.com/site/docaresh/">Aresh Saharkhiz</a> K-Means implementation in the project with some minor changes/refactoring done by me.</p>
<p>Let take a look at the code!</p>
<p>This first part is the display (an ASP.NET app.)</p>
<pre class="brush: jscript;">
&lt;%@ Page Language=&quot;C#&quot; AutoEventWireup=&quot;true&quot; CodeBehind=&quot;Default.aspx.cs&quot; Inherits=&quot;LSITest._Default&quot; %&gt;
&lt;%@ Register Assembly=&quot;DundasWebChart&quot; Namespace=&quot;Dundas.Charting.WebControl&quot; TagPrefix=&quot;DCWC&quot; %&gt;
&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot; &gt;
&lt;head runat=&quot;server&quot;&gt;
    &lt;title&gt;LSI Test&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;form id=&quot;form1&quot; runat=&quot;server&quot;&gt;
    &lt;div&gt;
        &lt;DCWC:Chart ID=&quot;Chart1&quot; runat=&quot;server&quot; Height=&quot;400px&quot; Width=&quot;400px&quot;
            ImageType=&quot;Jpeg&quot;&gt;
            &lt;Legends&gt;
                &lt;DCWC:Legend Name=&quot;Default&quot; Alignment=&quot;Center&quot; Docking=&quot;Bottom&quot;&gt;&lt;/DCWC:Legend&gt;
            &lt;/Legends&gt;
            &lt;Titles&gt;
                &lt;DCWC:Title Name=&quot;Title1&quot;&gt;
                &lt;/DCWC:Title&gt;
            &lt;/Titles&gt;
            &lt;Series&gt;
                &lt;DCWC:Series Name=&quot;Series1&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
                &lt;DCWC:Series Name=&quot;Series2&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
                &lt;DCWC:Series Name=&quot;Series3&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
            &lt;/Series&gt;
            &lt;ChartAreas&gt;
                &lt;DCWC:ChartArea Name=&quot;Series2&quot;&gt;
                    &lt;axisy interval=&quot;0.5&quot; maximum=&quot;2&quot; minimum=&quot;-1&quot;&gt;
                        &lt;majorgrid linecolor=&quot;Gray&quot; linestyle=&quot;Dash&quot; /&gt;
                    &lt;/axisy&gt;
                    &lt;axisx interval=&quot;0.5&quot; maximum=&quot;2.5&quot; minimum=&quot;-0.5&quot;&gt;
                        &lt;majorgrid linecolor=&quot;Gray&quot; linestyle=&quot;Dash&quot; /&gt;
                    &lt;/axisx&gt;
                &lt;/DCWC:ChartArea&gt;
            &lt;/ChartAreas&gt;
        &lt;/DCWC:Chart&gt;
    &lt;/div&gt;
    &lt;/form&gt;
&lt;/body&gt;
&lt;/html&gt;
</pre>
<p>This is the code behind for the ASP.NET page. Because we are only dealing with two known categories K-Means is plotting out the two categories and if you wanted to do more you would definitely have to re-write the ColorCodeDocuments function.</p>
<pre class="brush: jscript;">
using System;
using System.Data;
using System.Drawing;
using System.Web.UI;
using Dundas.Charting.WebControl;

namespace LSITest
{
    public partial class _Default : Page
    {
        /// &lt;summary&gt;
        /// Handles the Load event of the Page control.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;sender&quot;&gt;The source of the event.&lt;/param&gt;
        /// &lt;param name=&quot;e&quot;&gt;The &lt;see cref=&quot;System.EventArgs&quot;/&gt; instance containing the event data.&lt;/param&gt;
        protected void Page_Load(object sender, EventArgs e)
        {
            // Perform LSI
            var mylsi = new lsi();
            mylsi.LSITest();
            double[,] myDocs = mylsi.MyDocs;

            // Plot Documents and the k-means
            const string distanceType = &quot;manhattan&quot;;
            PlotDocuments(myDocs, mylsi.MyDocRowCount);
            PlotKMeansPoints(myDocs, 2, distanceType);
            ColorCodeDocuments(distanceType);

            // If you want to plot the words just un-comment the next two lines
            //double[,] myWords = mylsi.MyWords;
            //PlotWords(myDocs, mylsi.MyWordsRowCount);

            // comment this line out to show words in legend
            Chart1.Series[&quot;Series2&quot;].ShowInLegend = false;
        }

        /// &lt;summary&gt;
        /// Plots the words.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myWords&quot;&gt;My words.&lt;/param&gt;
        /// &lt;param name=&quot;myWordsRowCount&quot;&gt;My words row count.&lt;/param&gt;
        private void PlotWords(double[,] myWords, int myWordsRowCount)
        {
            for (int i = 0; i &lt; myWordsRowCount; i++)
            {
                Chart1.Series[&quot;Series2&quot;].Points.AddXY(myWords[i, 0], myWords[i, 1]);
            }

            // Set point colors and shapes
            Chart1.Series[&quot;Series2&quot;].LegendText = &quot;Words&quot;;
            Chart1.Series[&quot;Series2&quot;].Color = Color.Gray;
            Chart1.Series[&quot;Series2&quot;].MarkerStyle = MarkerStyle.Circle;
            Chart1.Series[&quot;Series2&quot;].MarkerSize = 6;
        }

        /// &lt;summary&gt;
        /// Plots the documents.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myDocs&quot;&gt;My docs.&lt;/param&gt;
        /// &lt;param name=&quot;myDocRowCount&quot;&gt;My doc row count.&lt;/param&gt;
        private void PlotDocuments(double[,] myDocs, int myDocRowCount)
        {
            // Load documents
            for (int i = 0; i &lt; myDocRowCount; i++)
            {
                Chart1.Series[&quot;Series1&quot;].Points.AddXY(myDocs[i, 0], myDocs[i, 1]);
            }

            // Set point colors and shapes
            Chart1.Series[&quot;Series1&quot;].LegendText = &quot;Documents&quot;;
            Chart1.Series[&quot;Series1&quot;].Color = Color.Red;
            Chart1.Series[&quot;Series1&quot;].MarkerStyle = MarkerStyle.Diamond;
            Chart1.Series[&quot;Series1&quot;].MarkerSize = 12;
        }

        /// &lt;summary&gt;
        /// Plots the K means points.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;items&quot;&gt;The items.&lt;/param&gt;
        /// &lt;param name=&quot;k&quot;&gt;The k.&lt;/param&gt;
        /// &lt;param name=&quot;distanceType&quot;&gt;&lt;/param&gt;
        private void PlotKMeansPoints(double[,] items, int k, string distanceType)
        {
            ClusterCollection clusters = kmeans.ClusterDataSet(k, items, distanceType);

            for (int i = 0; i &lt; clusters.Count; i++)
            {
                Chart1.Series[&quot;Series3&quot;].Points.AddXY(clusters[i].ClusterMean[0], clusters[i].ClusterMean[1]);
            }

            // Set point colors and shapes
            Chart1.Series[&quot;Series3&quot;].LegendText = &quot;Cluster&quot;;
            Chart1.Series[&quot;Series3&quot;].Color = Color.Gold;
            Chart1.Series[&quot;Series3&quot;].MarkerStyle = MarkerStyle.Star6;
            Chart1.Series[&quot;Series3&quot;].MarkerSize = 18;
        }

        /// &lt;summary&gt;
        /// Colors the code documents.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;distanceType&quot;&gt;Type of the distance.&lt;/param&gt;
        private void ColorCodeDocuments(string distanceType)
        {
            var myDist = new similarity();

            // Extract data
            DataSet myDocs = Chart1.DataManipulator.ExportSeriesValues(&quot;Series1&quot;);
            DataSet myKMeansPoints = Chart1.DataManipulator.ExportSeriesValues(&quot;Series3&quot;);

            // Document counter
            int count = 0;

            // Get co-ordinates for k-means points
            double firstKMeansX = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[0][&quot;X&quot;]);
            double firstKMeansY = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[0][&quot;Y&quot;]);
            double secondKMeansX = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[1][&quot;X&quot;]);
            double secondKMeansY = Convert.ToDouble(myKMeansPoints.Tables[0].Rows[1][&quot;Y&quot;]);

            foreach (DataRow docRow in myDocs.Tables[0].Rows)
            {
                // get co-ordinates for current doc
                double currentDocX = Convert.ToDouble(docRow[&quot;X&quot;]);
                double currentDocY = Convert.ToDouble(docRow[&quot;Y&quot;]);

                // load in to arrays
                double[] firstX = {currentDocX, currentDocY};
                double[] firstY = {firstKMeansX, firstKMeansY};
                double[] secondX = {currentDocX, currentDocY};
                double[] secondY = {secondKMeansX, secondKMeansY};

                // find the distance
                double firstDist = myDist.FindDistance(firstX, firstY, distanceType);
                double secondDist = myDist.FindDistance(secondX, secondY, distanceType);

                // Color accordingly
                Chart1.Series[&quot;Series1&quot;].Points[count].Color = firstDist &lt; secondDist ? Color.Blue : Color.Gray;
                count++;
            }
        }
    }
}
</pre>
<p>This is the K-Means class written by Aresh Saharkhiz with my changes</p>
<pre class="brush: jscript;">
/// Most of this code was written by Aresh Saharkhiz
/// Re-organized by me
/// See Code Project: http://www.codeproject.com/KB/recipes/K-Mean_Clustering.aspx
using System;
using System.Collections;
using System.Data;
using System.Diagnostics;

namespace LSITest
{
    public class kmeans
    {
        /// &lt;summary&gt;
        /// Calculates The Mean Of A Cluster OR The Cluster Center
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;cluster&quot;&gt;
        /// A two-dimensional array containing a dataset of numeric values
        /// &lt;/param&gt;
        /// &lt;returns&gt;
        /// Returns an Array Defining A Data Point Representing The Cluster Mean or Centroid
        /// &lt;/returns&gt;
        public static double[] ClusterMean(double[,] cluster)
        {
            int rowCount = cluster.GetUpperBound(0) + 1;
            int fieldCount = cluster.GetUpperBound(1) + 1;
            var dataSum = new double[1,fieldCount];
            var centroid = new double[fieldCount];

            for (int j = 0; j &lt; fieldCount; j++)
            {
                for (int i = 0; i &lt; rowCount; i++)
                {
                    dataSum[0, j] = dataSum[0, j] + cluster[i, j];
                }

                centroid[j] = (dataSum[0, j]/rowCount);
            }

            return centroid;
        }

        /// &lt;summary&gt;
        /// Seperates a dataset into clusters or groups with similar characteristics
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;clusterCount&quot;&gt;The number of clusters or groups to form&lt;/param&gt;
        /// &lt;param name=&quot;data&quot;&gt;An array containing data that will be clustered&lt;/param&gt;
        /// &lt;param name=&quot;type&quot;&gt;&lt;/param&gt;
        /// &lt;returns&gt;A collection of clusters of data&lt;/returns&gt;
        public static ClusterCollection ClusterDataSet(int clusterCount, double[,] data, string type)
        {
            int rowCount = data.GetUpperBound(0) + 1;
            int fieldCount = data.GetUpperBound(1) + 1;
            int stableClustersCount = 0;
            double[] dataPoint;
            var random = new Random();
            Cluster cluster;
            var clusters = new ClusterCollection();
            var clusterNumbers = new ArrayList(clusterCount);
            var myDist = new similarity();

            while (clusterNumbers.Count &lt; clusterCount)
            {
                int clusterNumber = random.Next(0, rowCount - 1);

                if (!clusterNumbers.Contains(clusterNumber))
                {
                    cluster = new Cluster();
                    clusterNumbers.Add(clusterNumber);
                    dataPoint = new double[fieldCount];

                    for (int field = 0; field &lt; fieldCount; field++)
                    {
                        dataPoint.SetValue((data[clusterNumber, field]), field);
                    }

                    cluster.Add(dataPoint);
                    clusters.Add(cluster);
                }
            }

            while (stableClustersCount != clusters.Count)
            {
                stableClustersCount = 0;
                ClusterCollection newClusters = ClusterDataSet(clusters, data, type);

                for (int clusterIndex = 0; clusterIndex &lt; clusters.Count; clusterIndex++)
                {
                    if ((myDist.FindDistance(newClusters[clusterIndex].ClusterMean, clusters[clusterIndex].ClusterMean, type)) == 0)
                    {
                        stableClustersCount++;
                    }
                }

                clusters = newClusters;
            }

            return clusters;
        }

        /// &lt;summary&gt;
        /// Seperates a dataset into clusters or groups with similar characteristics
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;clusters&quot;&gt;A collection of data clusters&lt;/param&gt;
        /// &lt;param name=&quot;data&quot;&gt;An array containing data to b eclustered&lt;/param&gt;
        /// &lt;param name=&quot;type&quot;&gt;&lt;/param&gt;
        /// &lt;returns&gt;A collection of clusters of data&lt;/returns&gt;
        public static ClusterCollection ClusterDataSet(ClusterCollection clusters, double[,] data, string type)
        {
            double[] dataPoint;
            double firstClusterDistance = 0.0;
            int rowCount = data.GetUpperBound(0) + 1;
            int fieldCount = data.GetUpperBound(1) + 1;
            int position = 0;
            var myDist = new similarity();

            // create a new collection of clusters
            var newClusters = new ClusterCollection();

            for (int count = 0; count &lt; clusters.Count; count++)
            {
                var newCluster = new Cluster();
                newClusters.Add(newCluster);
            }

            if (clusters.Count &lt;= 0)
            {
                throw new SystemException(&quot;Cluster Count Cannot Be Zero!&quot;);
            }

            for (int row = 0; row &lt; rowCount; row++)
            {
                dataPoint = new double[fieldCount];

                for (int field = 0; field &lt; fieldCount; field++)
                {
                    dataPoint.SetValue((data[row, field]), field);
                }

                for (int cluster = 0; cluster &lt; clusters.Count; cluster++)
                {
                    double[] clusterMean = clusters[cluster].ClusterMean;

                    if (cluster == 0)
                    {
                        firstClusterDistance = myDist.FindDistance(dataPoint, clusterMean, type);
                        position = cluster;
                    }
                    else
                    {
                        double secondClusterDistance = myDist.FindDistance(dataPoint, clusterMean, type);

                        if (firstClusterDistance &gt; secondClusterDistance)
                        {
                            firstClusterDistance = secondClusterDistance;
                            position = cluster;
                        }
                    }
                }

                newClusters[position].Add(dataPoint);
            }

            return newClusters;
        }

        /// &lt;summary&gt;
        /// Converts the data table to array.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;table&quot;&gt;The table.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public static double[,] ConvertDataTableToArray(DataTable table)
        {
            int rowCount = table.Rows.Count;
            int fieldCount = table.Columns.Count;

            var dataPoints = new double[rowCount,fieldCount];

            for (int rowPosition = 0; rowPosition &lt; rowCount; rowPosition++)
            {
                DataRow row = table.Rows[rowPosition];

                for (int fieldPosition = 0; fieldPosition &lt; fieldCount; fieldPosition++)
                {
                    double fieldValue;
                    try
                    {
                        fieldValue = double.Parse(row[fieldPosition].ToString());
                    }
                    catch (Exception ex)
                    {
                        Debug.WriteLine(ex.ToString());
                        throw new InvalidCastException(&quot;Invalid row at &quot; + rowPosition + &quot; and field &quot; + fieldPosition,
                                                       ex);
                    }

                    dataPoints[rowPosition, fieldPosition] = fieldValue;
                }
            }

            return dataPoints;
        }
    }

    /// &lt;summary&gt;
    /// A class containing a group of data with similar characteristics (cluster)
    /// &lt;/summary&gt;
    [Serializable]
    public class Cluster : CollectionBase
    {
        private double[] _clusterMean;
        private double[] _clusterSum;

        /// &lt;summary&gt;
        /// The sum of all the data in the cluster
        /// &lt;/summary&gt;
        public double[] ClusterSum
        {
            get { return _clusterSum; }
        }

        /// &lt;summary&gt;
        /// The mean of all the data in the cluster
        /// &lt;/summary&gt;
        public double[] ClusterMean
        {
            get
            {
                for (int count = 0; count &lt; this[0].Length; count++)
                {
                    _clusterMean[count] = (_clusterSum[count]/List.Count);
                }

                return _clusterMean;
            }
        }

        /// &lt;summary&gt;
        /// Returns the one dimensional array data located at the index
        /// &lt;/summary&gt;
        public virtual double[] this[int index]
        {
            get
            {
                //return the Neuron at IList[index]
                return (double[]) List[index];
            }
        }

        /// &lt;summary&gt;
        /// Adds a single dimension array data to the cluster
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;data&quot;&gt;A 1-dimensional array containing data that will be added to the cluster&lt;/param&gt;
        public virtual void Add(double[] data)
        {
            List.Add(data);

            if (List.Count == 1)
            {
                _clusterSum = new double[data.Length];

                _clusterMean = new double[data.Length];
            }

            for (int count = 0; count &lt; data.Length; count++)
            {
                _clusterSum[count] = _clusterSum[count] + data[count];
            }
        }
    }

    /// &lt;summary&gt;
    /// A collection of Cluster objects or Clusters
    /// &lt;/summary&gt;
    [Serializable]
    public class ClusterCollection : CollectionBase
    {
        /// &lt;summary&gt;
        /// Returns the Cluster at this index
        /// &lt;/summary&gt;
        public virtual Cluster this[int index]
        {
            get
            {
                //return the Neuron at IList[index]
                return (Cluster) List[index];
            }
        }

        /// &lt;summary&gt;
        /// Adds a Cluster to the collection of Clusters
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;cluster&quot;&gt;A Cluster to be added to the collection of clusters&lt;/param&gt;
        public virtual void Add(Cluster cluster)
        {
            List.Add(cluster);
        }
    }
}
</pre>
<p>Here is the similarity class than can calculate Euclidean, Manhattan, Chebyshev, Minkowski distances</p>
<pre class="brush: jscript;">
/// Most of this code was written by Aresh Saharkhiz
/// Re-organized by me
/// See Code Project: http://www.codeproject.com/KB/recipes/Quantitative_Distances.aspx
using System;

namespace LSITest
{
    public class similarity
    {
        /// &lt;summary&gt;
        /// Finds the distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;param name=&quot;type&quot;&gt;The type.&lt;/param&gt;
        /// &lt;param name=&quot;distanceType&quot;&gt;&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double FindDistance(double[] x, double[] y, string distanceType)
        {
            double distance;

            switch (distanceType.ToLower())
            {
                case &quot;euclidean&quot;:
                    distance = EuclideanDistance(x, y);
                    break;
                case &quot;manhattan&quot;:
                    distance = ManhattanDistance(x, y);
                    break;
                case &quot;minkowski&quot;:
                    distance = MinkowskiDistance(x, y, 1);
                    break;
                case &quot;chebyshev&quot;:
                    distance = ChebyshevDistance(x, y);
                    break;
                default:
                    distance = 0.0;
                    break;
            }

            return distance;
        }

        /// &lt;summary&gt;
        /// Finds the Euclideans distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double EuclideanDistance(double[] x, double[] y)
        {
            double sum = 0.0;

            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException(&quot;the number of elements in x must match the number of elements in y&quot;);
            }

            int count = x.Length;

            for (int i = 0; i &lt; count; i++)
            {
                sum += Math.Pow(Math.Abs(x[i] - y[i]), 2);
            }

            double distance = Math.Sqrt(sum);
            return distance;
        }

        /// &lt;summary&gt;
        /// Finds Manhattan distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double ManhattanDistance(double[] x, double[] y)
        {
            double sum = 0.0;

            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException(&quot;the number of elements in x must match the number of elements in y&quot;);
            }

            int count = x.Length;

            for (int i = 0; i &lt; count; i++)
            {
                sum += Math.Abs(x[i] - y[i]);
            }

            double distance = sum;
            return distance;
        }

        /// &lt;summary&gt;
        /// Finds Chebyshevs distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public static double ChebyshevDistance(double[] x, double[] y)
        {
            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException(&quot;the number of elements in x must match the number of elements in y&quot;);
            }
            int count = x.Length;
            var newData = new double[count];

            for (int i = 0; i &lt; count; i++)
            {
                newData[i] = Math.Abs(x[i] - y[i]);
            }
            double max = double.MinValue;

            foreach (double num in newData)
            {
                if (num &gt; max)
                {
                    max = num;
                }
            }
            return max;
        }

        /// &lt;summary&gt;
        /// Finds Minkowskis distance.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;x&quot;&gt;The x.&lt;/param&gt;
        /// &lt;param name=&quot;y&quot;&gt;The y.&lt;/param&gt;
        /// &lt;param name=&quot;order&quot;&gt;The order.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        public double MinkowskiDistance(double[] x, double[] y, double order)
        {
            double sum = 0.0;

            if (x.GetUpperBound(0) != y.GetUpperBound(0))
            {
                throw new ArgumentException(&quot;the number of elements in x must match the number of elements in y&quot;);
            }
            int count = x.Length;

            for (int i = 0; i &lt; count; i++)
            {
                sum = sum + Math.Pow(Math.Abs(x[i] - y[i]), order);
            }

            double distance = Math.Pow(sum, (1 / order));
            return distance;
        }
    }
}
</pre>
<p>And finally the same LSI class used in the previous examples.</p>
<pre class="brush: jscript;">
using System;
using SmartMathLibrary;

namespace LSITest
{
    public class lsi
    {
        // this returns the formated html results
        public int MyDocColumnCount;
        public int MyDocRowCount;
        public double[,] MyDocs;
        public double[,] MyWords;
        public int MyWordsColumnCount;
        public int MyWordsRowCount;
        public string ToPrint;

        /// &lt;summary&gt;
        /// LISs the test.
        /// &lt;/summary&gt;
        public void LSITest()
        {
            //Create Matrix
            var testArray = new double[,]
                                {
                                    {1, 0, 0, 1, 0, 0, 0, 0, 0},
                                    {1, 0, 1, 0, 0, 0, 0, 0, 0},
                                    {1, 1, 0, 0, 0, 0, 0, 0, 0},
                                    {0, 1, 1, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 1, 2, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 0, 1, 1, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 0, 0, 0, 0, 1},
                                    {0, 0, 0, 0, 0, 1, 1, 1, 0},
                                    {0, 0, 0, 0, 0, 0, 1, 1, 1},
                                    {0, 0, 0, 0, 0, 0, 0, 1, 1}
                                };

            // Load array in to Matrix
            var a = new Matrix(testArray);

            // print original matrix
            PrintMatrix(a);

            // preform Latent Semantic Indexing
            GetDocumentWordPlots(a);
        }

        /// &lt;summary&gt;
        /// Prints the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void PrintMatrix(IMatrix myMatrix)
        {
            ToPrint += &quot;&lt;br /&gt;&lt;br /&gt;&quot;;

            for (int i = 0; i &lt; myMatrix.Rows; i++)
            {
                for (int j = 0; j &lt; myMatrix.Columns; j++)
                {
                    ToPrint += String.Format(&quot;{0:0.##}&quot;, myMatrix.MatrixData[i, j]) + &quot;\t&quot;;
                }
                ToPrint += &quot;&lt;br /&gt;&quot;;
            }
        }

        /// &lt;summary&gt;
        /// Gets the document word plots.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void GetDocumentWordPlots(Matrix myMatrix)
        {
            // Run single value decomposition
            var svd = new SingularValueDecomposition(myMatrix);
            svd.ExecuteDecomposition();

            // Put components into individual matrices
            Matrix wordVector = svd.U.Copy();
            Matrix sigma = svd.S.ToMatrix();
            Matrix documentVector = svd.V.Copy();

            // get value of k
            // you can also manually set the value of k
            var k = (int) Math.Floor(Math.Sqrt(myMatrix.Columns));

            // reduce the vectors
            Matrix reducedWordVector = CopyMatrix(wordVector, wordVector.Rows, k - 1);
            Matrix reducedSigma = CreateSigmaMatrix(sigma, k - 1, k - 1);
            Matrix reducedDocumentVector = CopyMatrix(documentVector, documentVector.Rows, k - 1);

            // Recalculate the matrix
            Matrix docs = reducedDocumentVector*reducedSigma;
            Matrix words = reducedWordVector*reducedSigma;

            // Fill doc plot locations
            MyDocs = new double[docs.Rows,docs.Columns];
            for (int i = 0; i &lt; docs.Rows; i++)
            {
                for (int j = 0; j &lt; docs.Columns; j++)
                {
                    MyDocs[i, j] = docs.MatrixData[i, j];
                }
            }

            // Fill word plot locations
            MyWords = new double[words.Rows,words.Columns];
            for (int i = 0; i &lt; words.Rows; i++)
            {
                for (int j = 0; j &lt; words.Columns; j++)
                {
                    MyWords[i, j] = words.MatrixData[i, j];
                }
            }

            // Set counts for charts
            MyDocRowCount = docs.Rows;
            MyWordsRowCount = words.Rows;

            PrintMatrix(docs);
            PrintMatrix(words);
        }

        /// &lt;summary&gt;
        /// Creates the sigma matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;matrix&quot;&gt;The matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CreateSigmaMatrix(IMatrix matrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; columnEnd; i++)
            {
                copyMatrix.MatrixData[i, i] = matrix.MatrixData[i, 0];
            }

            return copyMatrix;
        }

        /// &lt;summary&gt;
        /// Copies the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CopyMatrix(IMatrix myMatrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; rowEnd; i++)
            {
                for (int j = 0; j &lt; columnEnd; j++)
                {
                    copyMatrix.MatrixData[i, j] = myMatrix.MatrixData[i, j];
                }
            }

            return copyMatrix;
        }
    }
}
</pre>
<p>And what do the results look like?</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/kmeansresults.jpg"><img class="alignnone size-full wp-image-361" style="margin-left: 100px; margin-right: 100px;" title="kmeansresults" src="http://eric.ness.net/wp-content/uploads/2009/11/kmeansresults.jpg" alt="kmeansresults" width="400" height="400" /></a></p>
<p>As you can see the K-Means clustering algorithm correctly grouped the documents in the appropriate categories.</p>
<p>Recommended reading and thanks goes to <a href="http://www.codeproject.com/KB/recipes/K-Mean_Clustering.aspx">Aresh Saharkhiz</a> for sharing his implementation of K-Means Clustering.</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/k-means-document-clustering/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Plotting Documents &amp; Words: Using Latent Semantic Indexing</title>
		<link>http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/</link>
		<comments>http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 00:32:22 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Visualization]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=338</guid>
		<description><![CDATA[Plotting Documents &#038; Words from LSI results]]></description>
			<content:encoded><![CDATA[<p>In the<a href="http://eric.ness.net/archives/latent-semantic-indexing/"> last blog post</a> we looked over a couple of great papers talking about using <a href="http://eric.ness.net/archives/singular-value-decomposition/">Singular Value Decomposition</a> (SVD) to do <a href="http://eric.ness.net/archives/latent-semantic-indexing/">Latent Semantic Indexing</a> (LSI) using the <a href="http://smartmathlibrary.codeplex.com/">SmartMathLibrary</a>. Now that we have the results we should plot them to get a sense of where these words and documents lay on a two dimensional Cartesian plane.</p>
<p style="text-align: left;">Jennifer Flynnâ€™s presentation &#8220;<a href="http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf">Latent Semantic Indexing Using SVD and Riemannian SVD</a>&#8221; actually goes on to tell us how to do this. Essentially the process is the same as before however, k must equal 2. We ended up with k = 2 in our previous example however, in larger examples k will more than likely be a different number. Regardless, here we want to end up with a matrix with two columns giving us our (x,y) &#8211; if you wanted to plot these items in a three dimensional space k=3 and if you find an awesome way to plot where k=5 e-mail me.Â  The formulas we use are as follows after we have performed SVD.</p>
<blockquote>
<p style="text-align: left;"><strong>Documents = U*âˆ‘</strong></p>
<p style="text-align: left;"><strong>Words = V*âˆ‘</strong></p>
</blockquote>
<p style="text-align: left;">The resulting matrices give us our (x,y) co-ordinates that we can then plot. I have been using the Dundas charting library for over two years now but the library is expensive so you should go and get the free library <a href="http://www.microsoft.com/downloads/details.aspx?FamilyID=130f7986-bf49-4fe5-9ca8-910ae6ea442c&amp;DisplayLang=en">here</a> since Microsoft acquired them and is a free download. And again for simplicities sake, this project is just a simple ASP.NET application.</p>
<p style="text-align: left;">The LSI Class:</p>
<p style="text-align: left;">Please note that this is almost exactly the same as in the previous blog however, here at the end of the GetDocumentWordPlots function we use the formulas mention above to load the co-ordinates of the words and documents in to a double array that we will ultimately pass to the chart.</p>
<pre class="brush: jscript;">
using System;
using SmartMathLibrary;

namespace LSITest
{
    public class lsi
    {
        // this returns the formated html results
        public int MyDocColumnCount;
        public int MyDocRowCount;
        public double[,] MyDocs;
        public double[,] MyWords;
        public int MyWordsColumnCount;
        public int MyWordsRowCount;
        public string ToPrint;

        /// &lt;summary&gt;
        /// LISs the test.
        /// &lt;/summary&gt;
        public void LSITest()
        {
            //Create Matrix
            var testArray = new double[,]
                                {
                                    {1, 0, 0, 1, 0, 0, 0, 0, 0},
                                    {1, 0, 1, 0, 0, 0, 0, 0, 0},
                                    {1, 1, 0, 0, 0, 0, 0, 0, 0},
                                    {0, 1, 1, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 1, 2, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 0, 1, 1, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 0, 0, 0, 0, 1},
                                    {0, 0, 0, 0, 0, 1, 1, 1, 0},
                                    {0, 0, 0, 0, 0, 0, 1, 1, 1},
                                    {0, 0, 0, 0, 0, 0, 0, 1, 1}
                                };

            // Load array in to Matrix
            var a = new Matrix(testArray);

            // print original matrix
            PrintMatrix(a);

            // preform Latent Semantic Indexing
            GetDocumentWordPlots(a);
        }

        /// &lt;summary&gt;
        /// Prints the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void PrintMatrix(IMatrix myMatrix)
        {
            ToPrint += &quot;&lt;br /&gt;&lt;br /&gt;&quot;;

            for (int i = 0; i &lt; myMatrix.Rows; i++)
            {
                for (int j = 0; j &lt; myMatrix.Columns; j++)
                {
                    ToPrint += String.Format(&quot;{0:0.##}&quot;, myMatrix.MatrixData[i, j]) + &quot;\t&quot;;
                }
                ToPrint += &quot;&lt;br /&gt;&quot;;
            }
        }

        /// &lt;summary&gt;
        /// Gets the document word plots.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void GetDocumentWordPlots(Matrix myMatrix)
        {
            // Run single value decomposition
            var svd = new SingularValueDecomposition(myMatrix);
            svd.ExecuteDecomposition();

            // Put components into individual matrices
            Matrix wordVector = svd.U.Copy();
            Matrix sigma = svd.S.ToMatrix();
            Matrix documentVector = svd.V.Copy();

            // get value of k
            var k = 2;

            // reduce the vectors
            Matrix reducedWordVector = CopyMatrix(wordVector, wordVector.Rows, k - 1);
            Matrix reducedSigma = CreateSigmaMatrix(sigma, k - 1, k - 1);
            Matrix reducedDocumentVector = CopyMatrix(documentVector, documentVector.Rows, k - 1);

            // Recalculate the matrix
            Matrix docs = reducedDocumentVector*reducedSigma;
            Matrix words = reducedWordVector*reducedSigma;

            // Fill doc plot locations
            MyDocs = new double[docs.Rows,docs.Columns];
            for (int i = 0; i &lt; docs.Rows; i++)
            {
                for (int j = 0; j &lt; docs.Columns; j++)
                {
                    MyDocs[i, j] = docs.MatrixData[i, j];
                }
            }

            // Fill word plot locations
            MyWords = new double[words.Rows,words.Columns];
            for (int i = 0; i &lt; words.Rows; i++)
            {
                for (int j = 0; j &lt; words.Columns; j++)
                {
                    MyWords[i, j] = words.MatrixData[i, j];
                }
            }

            // Set counts for charts
            MyDocRowCount = docs.Rows;
            MyWordsRowCount = words.Rows;

            PrintMatrix(docs);
            PrintMatrix(words);
        }

        /// &lt;summary&gt;
        /// Creates the sigma matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;matrix&quot;&gt;The matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CreateSigmaMatrix(IMatrix matrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; columnEnd; i++)
            {
                copyMatrix.MatrixData[i, i] = matrix.MatrixData[i, 0];
            }

            return copyMatrix;
        }

        /// &lt;summary&gt;
        /// Copies the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CopyMatrix(IMatrix myMatrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; rowEnd; i++)
            {
                for (int j = 0; j &lt; columnEnd; j++)
                {
                    copyMatrix.MatrixData[i, j] = myMatrix.MatrixData[i, j];
                }
            }

            return copyMatrix;
        }
    }
}
</pre>
<p>Here is the code behind for the web page that displays the chart and the matrices. Essentially we iterate through the double array pulled from the LSI class and load them in to a chart series.</p>
<pre class="brush: jscript;">
using System;
using System.Drawing;
using System.Web.UI;
using Dundas.Charting.WebControl;

namespace LSITest
{
    public partial class _Default : Page
    {
        /// &lt;summary&gt;
        /// Handles the Load event of the Page control.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;sender&quot;&gt;The source of the event.&lt;/param&gt;
        /// &lt;param name=&quot;e&quot;&gt;The &lt;see cref=&quot;System.EventArgs&quot;/&gt; instance containing the event data.&lt;/param&gt;
        protected void Page_Load(object sender, EventArgs e)
        {
            var mylsi = new lsi();
            mylsi.LSITest();
            Label1.Text = mylsi.ToPrint;
            double[,] myDocs = mylsi.MyDocs;
            double[,] myWords = mylsi.MyWords;

            // Load documents
            for (int i = 0; i &lt; mylsi.MyDocRowCount; i++)
            {
                Chart1.Series[&quot;Series1&quot;].Points.AddXY(myDocs[i, 0], myDocs[i, 1]);
            }

            // Load words
            for (int i = 0; i &lt; mylsi.MyWordsRowCount; i++)
            {
                Chart1.Series[&quot;Series2&quot;].Points.AddXY(myWords[i, 0], myWords[i, 1]);
            }

            // Set title
            Chart1.Series[&quot;Series1&quot;].LegendText = &quot;Documents&quot;;
            Chart1.Series[&quot;Series2&quot;].LegendText = &quot;Words&quot;;

            // Set point colors and shapes
            Chart1.Series[&quot;Series1&quot;].Color = Color.Red;
            Chart1.Series[&quot;Series1&quot;].MarkerStyle = MarkerStyle.Diamond;
            Chart1.Series[&quot;Series1&quot;].MarkerSize = 12;
            Chart1.Series[&quot;Series2&quot;].Color = Color.Gray;
            Chart1.Series[&quot;Series2&quot;].MarkerStyle = MarkerStyle.Circle;
            Chart1.Series[&quot;Series2&quot;].MarkerSize = 6;
        }
    }
}
</pre>
<p>And finally here is the ASP.NET web page.</p>
<pre class="brush: jscript;">
&lt;%@ Page Language=&quot;C#&quot; AutoEventWireup=&quot;true&quot; CodeBehind=&quot;Default.aspx.cs&quot; Inherits=&quot;LSITest._Default&quot; %&gt;
&lt;%@ Register Assembly=&quot;DundasWebChart&quot; Namespace=&quot;Dundas.Charting.WebControl&quot; TagPrefix=&quot;DCWC&quot; %&gt;
&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD XHTML 1.0 Transitional//EN&quot; &quot;http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd&quot;&gt;

&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot; &gt;
&lt;head runat=&quot;server&quot;&gt;
    &lt;title&gt;LSI Test&lt;/title&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;form id=&quot;form1&quot; runat=&quot;server&quot;&gt;
    &lt;div&gt;
        &lt;DCWC:Chart ID=&quot;Chart1&quot; runat=&quot;server&quot; Height=&quot;400px&quot; Width=&quot;400px&quot;
            ImageType=&quot;Jpeg&quot;&gt;
            &lt;Legends&gt;
                &lt;DCWC:Legend Name=&quot;Default&quot; Alignment=&quot;Center&quot; Docking=&quot;Bottom&quot;&gt;&lt;/DCWC:Legend&gt;
            &lt;/Legends&gt;
            &lt;Titles&gt;
                &lt;DCWC:Title Name=&quot;Title1&quot;&gt;
                &lt;/DCWC:Title&gt;
            &lt;/Titles&gt;
            &lt;Series&gt;
                &lt;DCWC:Series Name=&quot;Series1&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
                &lt;DCWC:Series Name=&quot;Series2&quot; ChartType=&quot;Point&quot; MarkerBorderColor=&quot;64, 64, 64&quot;
                    ShadowOffset=&quot;1&quot;&gt;
                &lt;/DCWC:Series&gt;
            &lt;/Series&gt;
            &lt;ChartAreas&gt;
                &lt;DCWC:ChartArea Name=&quot;Series2&quot;&gt;
                    &lt;axisy interval=&quot;0.5&quot; maximum=&quot;2&quot; minimum=&quot;-1&quot;&gt;
                        &lt;majorgrid linecolor=&quot;Gray&quot; linestyle=&quot;Dash&quot; /&gt;
                    &lt;/axisy&gt;
                    &lt;axisx interval=&quot;0.5&quot; maximum=&quot;2.5&quot; minimum=&quot;-0.5&quot;&gt;
                        &lt;majorgrid linecolor=&quot;Gray&quot; linestyle=&quot;Dash&quot; /&gt;
                    &lt;/axisx&gt;
                &lt;/DCWC:ChartArea&gt;
            &lt;/ChartAreas&gt;
        &lt;/DCWC:Chart&gt;
        &lt;br /&gt;
        &lt;asp:Label ID=&quot;Label1&quot; runat=&quot;server&quot; Text=&quot;&quot;&gt;&lt;/asp:Label&gt;
    &lt;/div&gt;
    &lt;/form&gt;
&lt;/body&gt;
&lt;/html&gt;
</pre>
<p>So lets see the result!</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/documents.jpg"><img class="alignnone size-full wp-image-341" style="margin-left: 100px; margin-right: 100px;" title="documents" src="http://eric.ness.net/wp-content/uploads/2009/11/documents.jpg" alt="documents" width="400" height="400" /></a></p>
<p>Now obviously you could/should probably write this a different way but it gets you to where you need to be.</p>
<p>I would also recommend you read in Flynn&#8217;s presentation on how to compare two words/documents by using the dot product of two row vectors. Or one could also use the <a href="http://eric.ness.net/archives/euclidean-distance-score/">Euclidean Distance Score</a>. And if you are also interested I would recommend Sujit Pal&#8217;s blog post &#8220;<a href="http://sujitpal.blogspot.com/2008/10/ir-math-in-java-cluster-visualization.html">IR Math in Java : Cluster Visualization</a>&#8221; for additional reading.</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/plotting-documents-words-using-latent-semantic-indexing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Latent Semantic Indexing</title>
		<link>http://eric.ness.net/archives/latent-semantic-indexing/</link>
		<comments>http://eric.ness.net/archives/latent-semantic-indexing/#comments</comments>
		<pubDate>Sun, 01 Nov 2009 15:48:26 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=309</guid>
		<description><![CDATA[Latent Semantic Indexing in C#]]></description>
			<content:encoded><![CDATA[<p>Latent Semantic Indexing (LSI) is commonly described as a &#8220;indexing and retrieval method that uses a mathematical technique called <a href="http://eric.ness.net/archives/singular-value-decomposition/">Singular Value Decomposition</a> (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.&#8221;. To be a bit more clear Sujit Pal has one of the best descriptions of what LSI is and how it occures:</p>
<blockquote><p>Latent Semantic Indexing attempts to uncover latent relationships among documents based on word co-occurence. So if document A contains (w1,w2) and document B contains (w2,w3), we can conclude that there is something common between documents A and B. LSI does this by decomposing the input raw term frequency matrix (A, see below) into three different matrices (U, S and V) using Singular Value Decomposition (SVD). Once that is done, the three vectors are &#8220;reduced&#8221; and the original vector rebuilt from the reduced vectors. Because of the reduction, noisy relationships are suppressed and relations become very clearly visible.</p></blockquote>
<p><strong>So how is this done?</strong></p>
<p>To start with let&#8217;s use the example in &#8220;<a href="http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf">Indexing by Latent Semantic Analysis</a>&#8221; (Deerwester et al.) because you see this example repeated in a number of places on the web. The example in the paper says let&#8217;s take a look at 9 titles of papers that fall in to two categories &#8220;human computer interaction&#8221; &amp; &#8220;graphs &amp; trees&#8221;. <a href="http://eric.ness.net/wp-content/uploads/2009/11/listofwords.jpg"></a></p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/listofwords.jpg"><img class="alignnone size-full wp-image-310" style="margin-left: 100px; margin-right: 100px;" title="listofwords" src="http://eric.ness.net/wp-content/uploads/2009/11/listofwords.jpg" alt="listofwords" width="400" height="505" /></a></p>
<p>In this example the matrix is comprised of the word counts in the different document. The next step is take this matrix and break it down in to it&#8217;s different parts using SVD. The result looks like this:</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/svd.jpg"><img class="alignnone size-full wp-image-320" title="svd" src="http://eric.ness.net/wp-content/uploads/2009/11/svd.jpg" alt="svd" width="600" /></a></p>
<p>After you have preformed SVD on the original matrix you then reduce the individual vectors. What the reduction of the vectors does is get rid of some of the &#8220;noise&#8221; &#8211; exposing the relationship between words and documents.</p>
<p>One question that arises is how much do you want to reduce the vectors (often called k)? There seems to be no hard and fast rule to this as different papers have different approaches/results with different values. In Sujit Pal&#8217;s <a href="http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html">post</a> he uses the square root of the number of columns of the original matrix (m) which is then rounded down minus 1 which is I think a good method to use. It also happens to be the value that is used in the Deerwester paper k=2. The following picture shows what this looks like (please see Jennifer Flynn&#8217;s presentation <a href="http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf">Latent Semantic Indexing Using SVD and Riemannian SVD</a> for a more elaborate example):</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/reduce.jpg"><img class="alignnone size-full wp-image-323" title="reduce" src="http://eric.ness.net/wp-content/uploads/2009/11/reduce.jpg" alt="reduce" width="600" /></a></p>
<p>After the vectors have been reduced all that is required to do is take the vectors and multiply them back together again and that is it. See the result:</p>
<p><a href="http://eric.ness.net/wp-content/uploads/2009/11/lsi1.jpg"><img class="alignnone size-full wp-image-326" title="lsi" src="http://eric.ness.net/wp-content/uploads/2009/11/lsi1.jpg" alt="lsi" width="600" /></a></p>
<p>[<strong>Update</strong>: as one of the readers (Jorge) noted v is not exactly correct it should be V.Transpose. Please check out the “Indexing by Latent Semantic Analysis” (Deerwester et al.) paper starting on page 26 for the correct values of the matrices <a rel="nofollow" href="http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf">http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf</a> I will try to update this here shortly]</p>
<p>So lets take a look at the code &#8211; it follows the example outlined in the Deerwester paper and please keep in mind that is just a little class i put together in a asp.net test app that shows a html formatted matrix of the original (m) and the LSI result:</p>
<pre class="brush: jscript;">
using System;
using SmartMathLibrary;

namespace LSITest
{
    public class lsi
    {
        // this returns the formated html results
        public string ToPrint;

        /// &lt;summary&gt;
        /// LISs the test.
        /// &lt;/summary&gt;
        public void LSITest()
        {
            //Create Matrix
            var testArray = new double[,]
                                {
                                    {1, 0, 0, 1, 0, 0, 0, 0, 0},
                                    {1, 0, 1, 0, 0, 0, 0, 0, 0},
                                    {1, 1, 0, 0, 0, 0, 0, 0, 0},
                                    {0, 1, 1, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 1, 2, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 1, 0, 0, 0, 0},
                                    {0, 0, 1, 1, 0, 0, 0, 0, 0},
                                    {0, 1, 0, 0, 0, 0, 0, 0, 1},
                                    {0, 0, 0, 0, 0, 1, 1, 1, 0},
                                    {0, 0, 0, 0, 0, 0, 1, 1, 1},
                                    {0, 0, 0, 0, 0, 0, 0, 1, 1}
                                };

            // Load array in to Matrix
            var a = new Matrix(testArray);

            // print original matrix
            PrintMatrix(a);

            // preform Latent Semantic Indexing
            Transform(a);
        }

        /// &lt;summary&gt;
        /// Prints the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void PrintMatrix(IMatrix myMatrix)
        {
            ToPrint += &quot;&lt;br /&gt;&lt;br /&gt;&quot;;

            for (int i = 0; i &lt; myMatrix.Rows; i++)
            {
                for (int j = 0; j &lt; myMatrix.Columns; j++)
                {
                    ToPrint += String.Format(&quot;{0:0.##}&quot;, myMatrix.MatrixData[i, j]) + &quot;\t&quot;;
                }
                ToPrint += &quot;&lt;br /&gt;&quot;;
            }
        }

        /// &lt;summary&gt;
        /// Transforms the specified my matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        private void Transform(Matrix myMatrix)
        {
            // Run single value decomposition
            var svd = new SingularValueDecomposition(myMatrix);
            svd.ExecuteDecomposition();

            // Put components into individual matrices
            Matrix wordVector = svd.U.Copy();
            Matrix sigma = svd.S.ToMatrix();
            Matrix documentVector = svd.V.Copy();

            // get value of k
            // you can also manually set the value of k
            var k = (int) Math.Floor(Math.Sqrt(myMatrix.Columns));

            // reduce the vectors
            Matrix reducedWordVector = CopyMatrix(wordVector, wordVector.Rows, k - 1);
            Matrix reducedSigma = CreateSigmaMatrix(sigma, k - 1, k - 1);
            Matrix reducedDocumentVector = CopyMatrix(documentVector, documentVector.Rows, k - 1);

            // re-compute matrix
            Matrix a = reducedWordVector*reducedSigma*reducedDocumentVector.Transpose();

            // print result
            PrintMatrix(a);
        }

        /// &lt;summary&gt;
        /// Creates the sigma matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;matrix&quot;&gt;The matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CreateSigmaMatrix(IMatrix matrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; columnEnd; i++)
            {
                copyMatrix.MatrixData[i, i] = matrix.MatrixData[i, 0];
            }

            return copyMatrix;
        }

        /// &lt;summary&gt;
        /// Copies the matrix.
        /// &lt;/summary&gt;
        /// &lt;param name=&quot;myMatrix&quot;&gt;My matrix.&lt;/param&gt;
        /// &lt;param name=&quot;rowEnd&quot;&gt;The row end.&lt;/param&gt;
        /// &lt;param name=&quot;columnEnd&quot;&gt;The column end.&lt;/param&gt;
        /// &lt;returns&gt;&lt;/returns&gt;
        private static Matrix CopyMatrix(IMatrix myMatrix, int rowEnd, int columnEnd)
        {
            var copyMatrix = new Matrix(rowEnd, columnEnd);

            for (int i = 0; i &lt; rowEnd; i++)
            {
                for (int j = 0; j &lt; columnEnd; j++)
                {
                    copyMatrix.MatrixData[i, j] = myMatrix.MatrixData[i, j];
                }
            }

            return copyMatrix;
        }
    }
}
</pre>
<p>With much thanks to Sujit Pal, Jennifer Flynn and Deerwester for their excellent explanations.</p>
<p><strong>Recommended Reading</strong></p>
<p><a href="http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html">IR Math with Java : TF, IDF and LSI</a> &#8211; Sujit Pal</p>
<p><a href="http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf">Indexing by Latent Semantic Analysis</a> &#8211; Deerwester et al</p>
<p><a href="http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf">Latent Semantic Indexing Using SVD and Riemannian SVD</a> &#8211; Jennifer Flynn</p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/latent-semantic-indexing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Waves: The Sulptures of Reuben Margolin</title>
		<link>http://eric.ness.net/archives/waves-the-sulptures-of-reuben-margolin/</link>
		<comments>http://eric.ness.net/archives/waves-the-sulptures-of-reuben-margolin/#comments</comments>
		<pubDate>Sun, 01 Nov 2009 00:03:39 +0000</pubDate>
		<dc:creator>Eric</dc:creator>
				<category><![CDATA[Art]]></category>
		<category><![CDATA[Video]]></category>

		<guid isPermaLink="false">http://eric.ness.net/?p=304</guid>
		<description><![CDATA[The Scupltures of Reuben Margolin]]></description>
			<content:encoded><![CDATA[<p>I came across the artist Reuben Margolin today after reading this article at <a href="http://www.good.is/post/poptech-%E2%80%9909-kinetic-sculptures/">GOOD</a>. His sculptures are truly amazing.</p>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="425" height="344" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://www.youtube.com/v/U0D3QSJJsCo&amp;rel=0&amp;color1=0xb1b1b1&amp;color2=0xcfcfcf&amp;feature=player_embedded&amp;fs=1" /><param name="allowfullscreen" value="true" /><embed type="application/x-shockwave-flash" width="425" height="344" src="http://www.youtube.com/v/U0D3QSJJsCo&amp;rel=0&amp;color1=0xb1b1b1&amp;color2=0xcfcfcf&amp;feature=player_embedded&amp;fs=1" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<p><object classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" width="400" height="225" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="src" value="http://vimeo.com/moogaloop.swf?clip_id=3001833&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" /><embed type="application/x-shockwave-flash" width="400" height="225" src="http://vimeo.com/moogaloop.swf?clip_id=3001833&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=&amp;fullscreen=1" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<p>Check out his site <a href="http://www.reubenmargolin.com">Reuben Margolin</a></p>
]]></content:encoded>
			<wfw:commentRss>http://eric.ness.net/archives/waves-the-sulptures-of-reuben-margolin/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.717 seconds -->
