CS 456 Assignment 1
Summary
You are given 10,000 characters of encrypted text, encrypted using the Playfair cipher discussed in class. You are also given 2,000,000 characters of plaintext from the same source. You must decrypt the text, and find the key used to encode it.
Details
The Playfair cipher uses a 5 by 7 grid filled with characters to encode a message written with those 35 characters. I am using a 35 character alphabet containing the 26 lowercase letters, the space character, period, comma, exclamation point, question mark, semicolon, hyphen, quotation mark, and apostrophe.
The Playfair cipher encrypts two characters at a time, substituting a different pair of characters for them. Thus it is a substitution cipher on pairs of characters. A pair of characters that forms two corners of a rectangle in the 5 by 7 grid is replaced by the pair of characters forming the other two corners of the rectangle. A pair of characters in the same row and/or column is replaced using different rules, that you can see in the program playfair.cpp.
The plaintext that you are given will give you statistics about the frequency of different letter pairs. Because we are using a substitution cipher, the histogram of the frequencies of letter pairs occuring in the encrypted text should be similar to the histogram of the frequencies of letter pairs in the plaintext. This will let you form guesses as to which letter pair in the encrypted text corresponds to which plaintext letter pair. You can then reconstruct the grid used to encrypt this message.
You should submit to me three copies of the grid you find: one copy when you have guessed the position of only 10 letters, one copy of the grid when you have guessed 25 letters, and the final, completely filled in, grid. The grid is not uniquely determined; you can tell which row follows which row, but not which row is first. Therefore, give me the final grid rotated such that the space character is in the top left corner (row 0, column 0).
Data files and program files
I am giving you the programs I wrote to encrypt this text. You should be able to use these, and modify them, to help you solve this problem. For example, you should be able to create a program that finds the frequency with which letter pairs occur by combining the two programs encrypt.cpp and histogram.cpp.
I will be working on decrypting the cipher this weekend, and will add an additional program decrypt.cpp that will work like encrypt.cpp except that it will take a partial list (a guess) of the letter pair substitutions, and output a file containing rows of enrypted text and rows of partially decrypted text, so that you can look at the partial decryption so far and guess further letter pair substitutions.
Here are the text files and program files:
- coded.txt, the encrypted message
- bleak.txt, a source of unencrypted text, similar to the message
- clean35.cpp, the program used to reduce the original file to a 35-character alphabet, and eliminate excess whitespace.
- histogram.cpp, the program used to analyze the original text, and pick the 35 most common characters to use as our alphabet.
- encrypt.cpp, a program that takes source text, a substitution table, and produces the encrypted version of the source text using those substitutions.
- playfair.cpp, a program that takes a 35-character alphabet, and an integer random seed, and produces the substitution table of all letter pair substitutions given by a Playfair cipher based on a random grid determined by the random seed.
- playfairdecrypt.cpp, a minor variation of playfair.cpp that produces the decryption substitutions.
- sub1.sub, an example substitution table created by playfair.cpp with random seed 1
- sub1inverse.sub, an example substitution table created by playfairdecrypt.cpp with random seed 1
- alphabet.txt, The 35 character alphabet.
Decryption Hints
Here is an additional program to help you decrypt: decrypt.cpp. Additional information:
The 4 most frequent letter pairs in the sample text are also the 4 most frequent in the plaintext, and thus encode as the 4 most frequent letter pairs in the ciphertext. The order of the 4 may not be the same in the sample text and the plaintext, though. The same holds for letter pairs 5 through 8.
Also, in the Playfair cipher grid, the letters t and space are in the same row.
Alternate, easier version
I have created a different, longer, ciphertext, with a different code: coded2.txt. The statistics of this file match the statistics of the sample text (bleak.txt) even better than the original problem.
Thus, the top letter pairs in this ciphertext match the top letter pairs in the sample text, in the same order. I won't tell you how many match in order, but it is a bunch. You may decrypt this ciphertext and give me the grid of its playfair cipher, for a maximum score of 90% on assignment 1.