Tuesday, August 17, 2010

Solving Hangman?

Over on the Wolfram blog a Mathematica user wrote up a hangman game, and found that "jazz" was the hardest word for his AI to guess.

You can read through his post, but he figures that there are so many easy words that guessing some rare letters near the beginning won't cost you that much.

I took that as a challenge and investigated on my own. In turns out to be pretty hard, even if you get 13 wrong guesses to die, to find all possible words.

Instead of his 90,000 word dictionary, I was using the 230,000 word dictionary in /usr/share/dict/words on OpenBSD. To get a good look at the sub-problem, I restricted myself to the 825 words that could be "_ A _ _". I've included those 825 words in the first comment of this post (since I don't know how to get Blogspot to do attachments and am too tired to try it after working on this for a few hours in evening).

The general strategy is to guess the letters that are most likely to appear. Wrong guesses bring you closer to your doom, right guesses give you a lot of information.

If you could perfectly bisect the remaining words with each guess it would only take 10 guesses, but we can't eliminate that many words at each time.

You could go through the letters in their normal order in the English language, but that gets you to J or Z only after about 20 guesses.

Okay, so let's review. We have "_ A _ _". I'm pretending that we got lucky and guessed "A" first, so we have 25 possible letters left to guess, and 13 wrong answers kill us. There are 825 words and here are how many of those words contain each letter:

 1. r 172
 2. e 172
 3. n 170
 4. t 163
 5. l 148
 6. s 136
 7. i 131
 8. m 117
 9. k 116
10. d 107
11. p 102
12. h 101
13. g 91
14. c 89
15. w 88
16. b 88
17. y 84
18. u 82
19. f 61
20. o 54
21. v 41
22. j 29
23. z 28
24. x 9
(q does not appear at all)
Guessing those in order won't work either, as you can see. But we don't need to guess in that order.

The most straightforward way, and the one used at the Wolfram post if I understood it correctly, is to iteratively find the most common letter in the remaining words and guess that one. I think they called this A* back when I was in school. Or was that a greedy algorithm? I can't remember.

Anyway, if you do that, you get this

Guess NumberLetterRemaining Words
0(A)825
1R653
2N506
3L383
4E284
5S208
6T155
7K114
8M80
9Y55
10U38
11I26
12O16
13F8
So after our 13th wrong answer, we are hanged and still have 8 words left: {bach bawd caph dabb hadj jazz wapp zach}.

However, this isn't necessarily the best order in which to guess letters. There are 24! different orders in which you could guess the letters that aren't A or Q. That would take 1.5*1021 tries to figure out.

Well, what only matters is the first 13 that we guess, and 24C13 is about 2.5 million. That's still a lot, and it may not generalize for other words.

We can do something like minmax algorithm, where we assume that we get the worst possible result from our best possible guesses, looking forward several steps. Looking forward 2, 3, or even 6 steps the very start leaves us off no better. In retrospect this isn't that surprising, since we still have letters pretty well distributed.

However, if we take the 400 best combinations from that 6th step (with word sets ranging in size from 155 to 204) and look forward 5 steps from there, we get some better results. What I called A* above only had us down to 26 word choices, but the best discovery here is down to 24 words. (We guessed D, E, K, L, M, N, P, R, S, T and Y.)

If we take the 32 best solutions we had there (ranging in size from 24 words up to 30 words), and look forward 2 steps, we can get down to 6 possible words. With the additional letters B and F, those words are {gazi hagi jacu jazz waco zach}.

So I can't always win. I didn't test every single combination, but I'm pretty sure that I got close. If I were to spend more than an evening on this, I would try starting with the common results around position 4 or 5 and building trees from there.

I also think that I might be more successful with Mathematica's smaller word list of 90,000 entries. I can't find a good place to download that or a similar corpus (this gets me close, but it looks like a hell of a lot of work to pull out that data). When I find a dictionary of that size I'll try again.