Given an encrypted message without any knowledge about the used cipher you nevertheless might be able to crack it (given that the cipher used has some vulnerabilities).
This post provides codesnippets for analysing encrypted messages for letter frequencies and coincidences.
Index of Coincidence (IC)What you can find out by the IC, is whether a message is encoded by a monoalphabetic (like Caesar cipher) or a polyalphabetic cipher (like Vigenère).
The index of coincidence is at 1.73 for plaintext in English. If you get an index close to 1 (which would be the value for a randomly created text) you have probably a polyalphabetic cipher and need to do further examinations like Friedman Test. If the index is higher, you have a monoalphabetic one and are able to crack it just by counting the letters (frequency analysis).
A sample output looks like this (the longer the message the better the result):
IC: 2.0548892842111965
kappa text: 0.07610701052634061
kappa random: 0.037037037037037035
If kappa text is close to kappa random, you probably have a polyalphabethic cipher, otherwise a monoalphabetic one
You see, IC is pretty high, so you have for sure a plaintext or a monoalphabetic cipher. It is the same with kappa text and kappa random, just that they are not normalized. The IC is the same as (kappa text * letters). kappa text for English and with 26 letters is about 0.067. Of course you get other values for kappa if you use more characters than just the letters of the alphabet whereas the IC will stay the same.
import java.util.Map;
public class IndexOfCoincidence {
private Language language;
private FrequencyAnalysis ana;
private final String newline = System.getProperty("line.separator");
public String analyse(String code) {
ana = new FrequencyAnalysis();
ana.analyse(code);
double kappaText = computeKappaText();
double kappaRand = computeKappaRandom();
double ic = kappaText * amountOfCharacters();
return "IC: " + ic + newline + newline
+ "kappa text: " + kappaText + newline
+ "kappa random: " + kappaRand + newline + newline
+ "If kappa text is close to kappa random, you probably have a "
+ "polyalphabethic cipher, otherwise a monoalphabetic one";
}
public double computeKappaRandom() {
return 1 / (double) ana.getFrequency().size();
}
public int amountOfCharacters(){
return ana.getFrequency().size();
}
public double computeKappaText() {
Map<Character, Integer> frequency = ana.getFrequency();
long n = ana.getSumCount();
int sumNi = 0;
for (Integer i : frequency.values()) {
sumNi += (i * (i - 1));
}
return (sumNi / (double) (n * (n - 1)));
}
}
Frequency AnalysisAs you can see above frequency analysis is used to determine the IC. But that is not the only purpose. Every language has typical letter frequencies, e.g. the letter 'e' is the most common letter in the english language, so you assume that the most counted letter in an encrypted message will be 'e'. But this works only with monoalphabethic ciphers. It wouldn't work with Vigenère, because this cipher uses more alphabeths than one, which means the letter 'e' is not always translated with the same character.
But given a shift cipher (Caesar) it is pretty easy to crack with frequency analysis. You just need to determine the most common character in there. If this is 'h' you know the cipher has shifted every letter of the alphabeth three times.
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
public class FrequencyAnalysis {
private Map<Character, Integer> frequency;
private Language language;
public String analyse(String code) {
frequency = new HashMap<Character, Integer>();
count(code);
return getAnalysisString();
}
public Map<Character, Integer> getFrequency(){
return new HashMap<Character, Integer>(frequency);
}
private String getAnalysisString() {
StringBuilder b = new StringBuilder();
long sumcount = getSumCount();
b.append("Char\tAbs\tRel\n");
for (Character c : getSortedKeyList()) {
b.append(c + "\t" + frequency.get(c) + "\t"
+ (frequency.get(c) / (double) sumcount) + "\n");
}
b.append("\nletters: " + sumcount);
b.append("\ncharacters: " + frequency.size());
return b.toString();
}
//insertion sort is used here
private List<Character> getSortedKeyList() {
List<Character> list = new LinkedList<Character>();
for(Entry<Character, Integer> entry : frequency.entrySet()){
for(int i = 1; i < list.size(); i++){
if(frequency.get(list.get(i)) > entry.getValue()){
list.add(i, entry.getKey());
break;
}
}
if(!list.contains(entry.getKey())){
list.add(entry.getKey());
}
}
return list;
}
public long getSumCount() {
long sum = 0;
for (Integer i : frequency.values()) {
sum += i;
}
return sum;
}
private void count(String code) {
char[] text = code.toCharArray();
for (char c : text) {
if (frequency.containsKey(c)) {
frequency.put(c, frequency.get(c) + 1);
} else {
frequency.put(c, 1);
}
}
}
}
A sample output for a german plaintext looks like this:
Char Abs Rel
2 0.0018450184501845018
2 1 9.225092250922509E-4
5 1 9.225092250922509E-4
; 1 9.225092250922509E-4
N 1 9.225092250922509E-4
H 1 9.225092250922509E-4
I 1 9.225092250922509E-4
K 1 9.225092250922509E-4
U 1 9.225092250922509E-4
— 1 9.225092250922509E-4
„ 1 9.225092250922509E-4
Z 1 9.225092250922509E-4
“ 1 9.225092250922509E-4
€ 1 9.225092250922509E-4
y 1 9.225092250922509E-4
x 1 9.225092250922509E-4
( 2 0.0018450184501845018
) 2 0.0018450184501845018
F 2 0.0018450184501845018
B 2 0.0018450184501845018
M 2 0.0018450184501845018
V 2 0.0018450184501845018
j 2 0.0018450184501845018
G 3 0.0027675276752767526
A 3 0.0027675276752767526
W 3 0.0027675276752767526
ß 3 0.0027675276752767526
P 3 0.0027675276752767526
ö 3 0.0027675276752767526
E 4 0.0036900369003690036
L 4 0.0036900369003690036
T 4 0.0036900369003690036
S 4 0.0036900369003690036
ü 4 0.0036900369003690036
p 5 0.004612546125461255
D 6 0.005535055350553505
k 6 0.005535055350553505
, 7 0.006457564575645757
ä 9 0.008302583025830259
. 10 0.00922509225092251
v 11 0.01014760147601476
b 13 0.011992619926199263
w 14 0.012915129151291513
f 15 0.013837638376383764
m 16 0.014760147601476014
o 17 0.015682656826568265
z 18 0.016605166051660517
g 23 0.021217712177121772
c 26 0.023985239852398525
l 28 0.025830258302583026
u 36 0.033210332103321034
s 39 0.035977859778597784
d 40 0.03690036900369004
h 41 0.03782287822878229
a 48 0.04428044280442804
t 51 0.0470479704797048
i 65 0.05996309963099631
n 85 0.07841328413284133
r 87 0.08025830258302583
e 143 0.13191881918819187
156 0.14391143911439114
letters: 1084
characters: 61
In case you want to try this code, just add this method:
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
System.out.println("Your encoded message:\n");
String code = scan.nextLine();
IndexOfCoincidence ic = new IndexOfCoincidence();
System.out.println(ic.analyse(code));
}
Deque