## Cryptanalysis Using n-Gram Probabilities

May 1st, 2010  |  Published in Machine Learning, Programming

One of my favorite programmers is Peter Norvig who is currently Director of Research at Google. This summer I picked up a book called Beautiful Data in which Norvig contributed a chapter called “Natural Language Corpus Data” in which he outlined a number of very cool things you can do with n-grams in the google corpus. It covers some of the things you’d imagine that it would cover: spelling correction, word segmentation, etc. The one item covered that I had never considered was in the area of cryptanalysis.

The cool thing is that Google will give you their corpus to download. The only problem is that it contains “1,024,908,267,229 words of running text” and is 24 GB compressed in size. This is a bit impractical to run on your dev box. Enter Microsoft – the Microsoft Web N-gram Service just went Beta and is now available to Professors and Students so I immediately signed up and I have to say that it pretty cool!

So I wanted to try out the new service using one of Norvig’s examples in his book – specifically using n-gram probabilities and character shifting. This is a very simple example and fairly basic type of encryption where the if the user types an ‘a’ it gets shifted to ‘n’ or whatever. So you simply run through all 26 possibilities and use the individual words combined probabilities to determine the answer to the encoded message.

This project has a Service Refrence connected to Microsoft’s n-Gram Service. The service requires an n-gram model and a user id which you get when you sign up (see their quickstart tutorial). So let’s take a look at some code:

```using System;
using System.Collections.Generic;
using System.Configuration;
using System.Linq;
using MicrosoftNGramTest.NGramService;

namespace MicrosoftNGramTest.classes
{
internal class Shift
{
#region Variables

private readonly string _alphabet = "abcdefghijklmnopqrstuvwxyz";
private readonly string _ngramModel = ConfigurationManager.AppSettings.Get("ngramModel");
private readonly string _userToken = ConfigurationManager.AppSettings.Get("userToken");

#endregion

#region Run The Test

/// <summary>
/// Runs the test
/// </summary>
public void Test()
{
// Print title
Console.WriteLine("Character Shift Cryptanalysis");
Console.WriteLine("#############################");

// Local Variables
const string phrase = "Yvfgra, qb lbh jnag gb xabj n frperg?";
string[] words = phrase.ToLower().Split(' ');
var newPhrase = new string[26];
var client = new LookupServiceClient();
var result = new Dictionary<string, int>();

try
{
// Loop the word variations
foreach (string s in words)
{
char[] currentWord = s.ToCharArray();

foreach (char c in currentWord)
{
for (int i = 0; i < 26; i++)
{
newPhrase[i] += CharShift(c, i);
}
}

for (int i = 0; i < newPhrase.Count(); i++)
{
newPhrase[i] += " ";
}
}

// Print phrases with probabilities
foreach (string s in newPhrase)
{
string[] newWords = s.Split(' ');
double prob = 0;
foreach (string word in newWords)
{
prob += client.GetProbability(_userToken, _ngramModel, word);
}
Console.WriteLine(s + " " + Convert.ToInt32(prob));
}

Console.WriteLine();
KeyValuePair<string, int> q = (from t in result
orderby t.Value descending
select t).FirstOrDefault();
Console.WriteLine(q.Key + " " + q.Value);
}
finally
{
client.Close();
}
}

#endregion

#region Shifting

/// <summary>
/// Gets the alphabet array.
/// </summary>
/// <returns></returns>
private char[] GetAlphabetArray()
{
return _alphabet.ToCharArray();
}

/// <summary>
/// Gets the current char array position.
/// </summary>
/// <param name="c">The c.</param>
/// <returns></returns>
private int GetCurrentCharArrayPosition(char c)
{
int position = 0;
int count = 0;

foreach (char letter in GetAlphabetArray())
{
if (letter == c)
{
position = count;
}
count++;
}
return position;
}

/// <summary>
/// Shifts the character.
/// </summary>
/// <param name="c">The c.</param>
/// <param name="increase">The increase.</param>
/// <returns></returns>
private char CharShift(char c, int increase)
{
const int numOfLetters = 26;
char[] alphabet = GetAlphabetArray();
int currentArrayPosition = GetCurrentCharArrayPosition(c);
char letter = c;

if (IsCharInArray(c))
{
if ((currentArrayPosition + increase) < numOfLetters)
{
letter = alphabet[currentArrayPosition + increase];
}
else
{
int newPosition = (currentArrayPosition + increase) - numOfLetters;
letter = alphabet[newPosition];
}
}
return letter;
}

/// <summary>
/// Determines whether the char is in the array.
/// </summary>
/// <param name="c">The c.</param>
/// <returns>
/// 	<c>true</c> if [is char in array] [the specified c]; otherwise, <c>false</c>.
/// </returns>
private bool IsCharInArray(char c)
{
bool isCharInArray = false;
IEnumerable<char> q = (from t in GetAlphabetArray()
where t == c
select t);
if (q.Count() > 0)
{
isCharInArray = true;
}
return isCharInArray;
}

#endregion
}
}
```

And here is the result!

### Related Posts

Apriori Algorithm
K-Means Document Clustering
Latent Semantic Indexing
Singular Value Decomposition
Pearson’s Correlation Coefficient