Tuesday, August 17, 2010

Solving Hangman?

Over on the Wolfram blog a Mathematica user wrote up a hangman game, and found that "jazz" was the hardest word for his AI to guess.

You can read through his post, but he figures that there are so many easy words that guessing some rare letters near the beginning won't cost you that much.

I took that as a challenge and investigated on my own. In turns out to be pretty hard, even if you get 13 wrong guesses to die, to find all possible words.

Instead of his 90,000 word dictionary, I was using the 230,000 word dictionary in /usr/share/dict/words on OpenBSD. To get a good look at the sub-problem, I restricted myself to the 825 words that could be "_ A _ _". I've included those 825 words in the first comment of this post (since I don't know how to get Blogspot to do attachments and am too tired to try it after working on this for a few hours in evening).

The general strategy is to guess the letters that are most likely to appear. Wrong guesses bring you closer to your doom, right guesses give you a lot of information.

If you could perfectly bisect the remaining words with each guess it would only take 10 guesses, but we can't eliminate that many words at each time.

You could go through the letters in their normal order in the English language, but that gets you to J or Z only after about 20 guesses.

Okay, so let's review. We have "_ A _ _". I'm pretending that we got lucky and guessed "A" first, so we have 25 possible letters left to guess, and 13 wrong answers kill us. There are 825 words and here are how many of those words contain each letter:

 1. r 172
 2. e 172
 3. n 170
 4. t 163
 5. l 148
 6. s 136
 7. i 131
 8. m 117
 9. k 116
10. d 107
11. p 102
12. h 101
13. g 91
14. c 89
15. w 88
16. b 88
17. y 84
18. u 82
19. f 61
20. o 54
21. v 41
22. j 29
23. z 28
24. x 9
(q does not appear at all)
Guessing those in order won't work either, as you can see. But we don't need to guess in that order.

The most straightforward way, and the one used at the Wolfram post if I understood it correctly, is to iteratively find the most common letter in the remaining words and guess that one. I think they called this A* back when I was in school. Or was that a greedy algorithm? I can't remember.

Anyway, if you do that, you get this

Guess NumberLetterRemaining Words
0(A)825
1R653
2N506
3L383
4E284
5S208
6T155
7K114
8M80
9Y55
10U38
11I26
12O16
13F8
So after our 13th wrong answer, we are hanged and still have 8 words left: {bach bawd caph dabb hadj jazz wapp zach}.

However, this isn't necessarily the best order in which to guess letters. There are 24! different orders in which you could guess the letters that aren't A or Q. That would take 1.5*1021 tries to figure out.

Well, what only matters is the first 13 that we guess, and 24C13 is about 2.5 million. That's still a lot, and it may not generalize for other words.

We can do something like minmax algorithm, where we assume that we get the worst possible result from our best possible guesses, looking forward several steps. Looking forward 2, 3, or even 6 steps the very start leaves us off no better. In retrospect this isn't that surprising, since we still have letters pretty well distributed.

However, if we take the 400 best combinations from that 6th step (with word sets ranging in size from 155 to 204) and look forward 5 steps from there, we get some better results. What I called A* above only had us down to 26 word choices, but the best discovery here is down to 24 words. (We guessed D, E, K, L, M, N, P, R, S, T and Y.)

If we take the 32 best solutions we had there (ranging in size from 24 words up to 30 words), and look forward 2 steps, we can get down to 6 possible words. With the additional letters B and F, those words are {gazi hagi jacu jazz waco zach}.

So I can't always win. I didn't test every single combination, but I'm pretty sure that I got close. If I were to spend more than an evening on this, I would try starting with the common results around position 4 or 5 and building trees from there.

I also think that I might be more successful with Mathematica's smaller word list of 90,000 entries. I can't find a good place to download that or a similar corpus (this gets me close, but it looks like a hell of a lot of work to pull out that data). When I find a dictionary of that size I'll try again.

4 Comments:

Blogger Dan Weber said...

babe babi babs babu baby bach back bade baff baft bago baho baht bail
bain bais bait bake baku bald bale bali balk ball balm balt balu banc
band bane bang bani bank bant barb bard bare bari bark barm barn bart
baru base bash bask bass bast bate bath bats batt batz baud baul baun
bawd bawl bawn baze cack cade cadi cafh cage cagn caid cain cake caky
calf calk call calm calp calx camb came camp cand cane cank cant cany
cape caph card care cark carl caro carp carr cart case cash cask cass
cast cate cauk caul caum caup cave cavy cawk dabb dace dade dado dadu
daer daff daft dago dags dail dain dais dale dali dalk dalt dame damn
damp dand dane dang dank dard dare darg dari dark darn darr dart dash
dasi date daub daud daur daut dauw dave davy dawn days daze dazy each
earl earn ease east easy eats eave face fack fact facy fade fady faff
fage fail fain fair fake faky fall falx fame fand fang fant fany faon
fare farl farm faro fash fass fast fate faun favn fawn faze gabe gabi
gaby gade gael gaen gaet gaff gage gail gain gair gait gale gali gall
galp galt gamb game gamp gamy gane gang gant gaol gaon gape gapo gapy
garb gare garn garo gash gasp gast gate gaub gaud gaul gaum gaun gaup
gaur gaus gaut gave gawk gawm gawn gaze gazi gazy habe habu hack hade
hadj haec haem haet haff haft hagi haik hail hain hair haje hake hako
haku hale half hall halo hals halt hame hami hand hank hano hant hapi
hapu harb hard hare hark harl harm harn harp harr hart hash hask hasp
hate hath hati hatt haul have hawk hawm hayz haze hazy iamb jack jacu
jade jady jail jain jake jako jamb jami jane jank jann jaob jape jarg
jarl jass jati jato jauk jaun jaup jawy jazz kadu kago kagu kahu kaid
kaik kail kaki kale kali kalo kame kang kans kapp karo kasm kate kath
katy kavi kayo kazi lace lack lacy lade lady laet laic laid lain lair
lake laky lall lalo lamb lame lamp land lane lank lant lanx lapp lard
lari lark lars lash lasi lask lass last late lath laud laun laur lave
lawk lawn laze lazy mabi mace mack maco made madi mado mage magh magi
mahi maid mail maim main majo make maki mako maku male mali mall malm
malo malt mamo mand mane mang mani mank mano mans mant manx many mapo
marc mare mari mark marl marm maro mars mart maru mary mash mask mass
mast masu mate math maty maud maul maun maux mawk mawp mayo maze mazy
nabk nabs nabu nace nach nael naid naif naig naik nail nain naio nair
nais nake nako name nane nant naos nape napu nard nark narr nary nash
nasi nast natr natt naut nave navy nawt naze nazi oaky oary oast oath
oaty pace pack paco pact page pahi paho paik pail pain paip pair pais
pale pali pall palm palp palt paly pand pane pang pani pank pant paon
pape pard pare pari park parr part pash pasi pass past pate path pato
patu paty paul paup paut pave pavo pavy pawk pawl pawn rabi race rach
rack racy rafe raff raft rage raid rail rain rais rake rakh raki raku
rame rami ramp rand rane rang rani rank rann rant rape rapt rare rase
rash rasp rate rath rauk raun rave ravi raze razz sabe sack saco sade
sadh sado sadr safe safi saft sage sago sagy sahh saho saic said sail
saim sain saip sair sake saki sale salm salp salt same samh samp sand
sane sang sank sans sant sapo sard sare sari sark sart sash sate sauf
saul saum saur saut save sawn sawt saxe tabu tach tack tact tade tael
taen taft tahr tail tain tait take takt taku taky talc tald tale tali
talk tall tame tamp tane tang tanh tank tano taos tape taps tapu tare
tari tarn taro tarp tarr tars tart tash task tass tasu tate tath tatu
taum taun taur taut tave tavy tawn taws taxi taxy uang vade vady vage
vail vain vair vale vali vall vamp vane vang vare vari vary vase vast
vasu vayu wabe wabi wace wack waco wade wadi waeg waer wafd waff waft
wage waif waik wail wain wait wake wakf waky wale wali walk wall walt
wame wamp wand wane wang want wany wapp ward ware warf wark warl warm
warn warp wart wary wase wash wasp wast wath watt wauf waul waup waur
wave wavy waxy ways yabu yade yaff yagi yair yaje yalb yale yali yamp
yang yank yapp yarb yard yare yark

9:23 PM  
Blogger Dan Weber said...

yarl yarm yarn yarr yaru yate yati yaud yawl yawn yawp yaws yawy zach zain zant zany zarf zarp zati


Blogger only allows 4096 characters in a post. Who knew?

9:24 PM  
Blogger tenax_technologies said...

This comment has been removed by a blog administrator.

7:47 AM  
Blogger tenax_technologies said...

Great blog!
Thanks for sharing.

--------
Tenax Technologies is a Belarussian software company delivering complex web solutions. We provide comprehensive software development for startups based on Java J2EE Spring Hibernate web2.0 technologies.

8:51 AM  

Post a Comment

<< Home