Tuesday, August 17, 2010

Solving Hangman?

Over on the Wolfram blog a Mathematica user wrote up a hangman game, and found that "jazz" was the hardest word for his AI to guess.

You can read through his post, but he figures that there are so many easy words that guessing some rare letters near the beginning won't cost you that much.

I took that as a challenge and investigated on my own. In turns out to be pretty hard, even if you get 13 wrong guesses to die, to find all possible words.

Instead of his 90,000 word dictionary, I was using the 230,000 word dictionary in /usr/share/dict/words on OpenBSD. To get a good look at the sub-problem, I restricted myself to the 825 words that could be "_ A _ _". I've included those 825 words in the first comment of this post (since I don't know how to get Blogspot to do attachments and am too tired to try it after working on this for a few hours in evening).

The general strategy is to guess the letters that are most likely to appear. Wrong guesses bring you closer to your doom, right guesses give you a lot of information.

If you could perfectly bisect the remaining words with each guess it would only take 10 guesses, but we can't eliminate that many words at each time.

You could go through the letters in their normal order in the English language, but that gets you to J or Z only after about 20 guesses.

Okay, so let's review. We have "_ A _ _". I'm pretending that we got lucky and guessed "A" first, so we have 25 possible letters left to guess, and 13 wrong answers kill us. There are 825 words and here are how many of those words contain each letter:

 1. r 172
 2. e 172
 3. n 170
 4. t 163
 5. l 148
 6. s 136
 7. i 131
 8. m 117
 9. k 116
10. d 107
11. p 102
12. h 101
13. g 91
14. c 89
15. w 88
16. b 88
17. y 84
18. u 82
19. f 61
20. o 54
21. v 41
22. j 29
23. z 28
24. x 9
(q does not appear at all)
Guessing those in order won't work either, as you can see. But we don't need to guess in that order.

The most straightforward way, and the one used at the Wolfram post if I understood it correctly, is to iteratively find the most common letter in the remaining words and guess that one. I think they called this A* back when I was in school. Or was that a greedy algorithm? I can't remember.

Anyway, if you do that, you get this

Guess NumberLetterRemaining Words
0(A)825
1R653
2N506
3L383
4E284
5S208
6T155
7K114
8M80
9Y55
10U38
11I26
12O16
13F8
So after our 13th wrong answer, we are hanged and still have 8 words left: {bach bawd caph dabb hadj jazz wapp zach}.

However, this isn't necessarily the best order in which to guess letters. There are 24! different orders in which you could guess the letters that aren't A or Q. That would take 1.5*1021 tries to figure out.

Well, what only matters is the first 13 that we guess, and 24C13 is about 2.5 million. That's still a lot, and it may not generalize for other words.

We can do something like minmax algorithm, where we assume that we get the worst possible result from our best possible guesses, looking forward several steps. Looking forward 2, 3, or even 6 steps the very start leaves us off no better. In retrospect this isn't that surprising, since we still have letters pretty well distributed.

However, if we take the 400 best combinations from that 6th step (with word sets ranging in size from 155 to 204) and look forward 5 steps from there, we get some better results. What I called A* above only had us down to 26 word choices, but the best discovery here is down to 24 words. (We guessed D, E, K, L, M, N, P, R, S, T and Y.)

If we take the 32 best solutions we had there (ranging in size from 24 words up to 30 words), and look forward 2 steps, we can get down to 6 possible words. With the additional letters B and F, those words are {gazi hagi jacu jazz waco zach}.

So I can't always win. I didn't test every single combination, but I'm pretty sure that I got close. If I were to spend more than an evening on this, I would try starting with the common results around position 4 or 5 and building trees from there.

I also think that I might be more successful with Mathematica's smaller word list of 90,000 entries. I can't find a good place to download that or a similar corpus (this gets me close, but it looks like a hell of a lot of work to pull out that data). When I find a dictionary of that size I'll try again.

4 comments:

  1. babe babi babs babu baby bach back bade baff baft bago baho baht bail
    bain bais bait bake baku bald bale bali balk ball balm balt balu banc
    band bane bang bani bank bant barb bard bare bari bark barm barn bart
    baru base bash bask bass bast bate bath bats batt batz baud baul baun
    bawd bawl bawn baze cack cade cadi cafh cage cagn caid cain cake caky
    calf calk call calm calp calx camb came camp cand cane cank cant cany
    cape caph card care cark carl caro carp carr cart case cash cask cass
    cast cate cauk caul caum caup cave cavy cawk dabb dace dade dado dadu
    daer daff daft dago dags dail dain dais dale dali dalk dalt dame damn
    damp dand dane dang dank dard dare darg dari dark darn darr dart dash
    dasi date daub daud daur daut dauw dave davy dawn days daze dazy each
    earl earn ease east easy eats eave face fack fact facy fade fady faff
    fage fail fain fair fake faky fall falx fame fand fang fant fany faon
    fare farl farm faro fash fass fast fate faun favn fawn faze gabe gabi
    gaby gade gael gaen gaet gaff gage gail gain gair gait gale gali gall
    galp galt gamb game gamp gamy gane gang gant gaol gaon gape gapo gapy
    garb gare garn garo gash gasp gast gate gaub gaud gaul gaum gaun gaup
    gaur gaus gaut gave gawk gawm gawn gaze gazi gazy habe habu hack hade
    hadj haec haem haet haff haft hagi haik hail hain hair haje hake hako
    haku hale half hall halo hals halt hame hami hand hank hano hant hapi
    hapu harb hard hare hark harl harm harn harp harr hart hash hask hasp
    hate hath hati hatt haul have hawk hawm hayz haze hazy iamb jack jacu
    jade jady jail jain jake jako jamb jami jane jank jann jaob jape jarg
    jarl jass jati jato jauk jaun jaup jawy jazz kadu kago kagu kahu kaid
    kaik kail kaki kale kali kalo kame kang kans kapp karo kasm kate kath
    katy kavi kayo kazi lace lack lacy lade lady laet laic laid lain lair
    lake laky lall lalo lamb lame lamp land lane lank lant lanx lapp lard
    lari lark lars lash lasi lask lass last late lath laud laun laur lave
    lawk lawn laze lazy mabi mace mack maco made madi mado mage magh magi
    mahi maid mail maim main majo make maki mako maku male mali mall malm
    malo malt mamo mand mane mang mani mank mano mans mant manx many mapo
    marc mare mari mark marl marm maro mars mart maru mary mash mask mass
    mast masu mate math maty maud maul maun maux mawk mawp mayo maze mazy
    nabk nabs nabu nace nach nael naid naif naig naik nail nain naio nair
    nais nake nako name nane nant naos nape napu nard nark narr nary nash
    nasi nast natr natt naut nave navy nawt naze nazi oaky oary oast oath
    oaty pace pack paco pact page pahi paho paik pail pain paip pair pais
    pale pali pall palm palp palt paly pand pane pang pani pank pant paon
    pape pard pare pari park parr part pash pasi pass past pate path pato
    patu paty paul paup paut pave pavo pavy pawk pawl pawn rabi race rach
    rack racy rafe raff raft rage raid rail rain rais rake rakh raki raku
    rame rami ramp rand rane rang rani rank rann rant rape rapt rare rase
    rash rasp rate rath rauk raun rave ravi raze razz sabe sack saco sade
    sadh sado sadr safe safi saft sage sago sagy sahh saho saic said sail
    saim sain saip sair sake saki sale salm salp salt same samh samp sand
    sane sang sank sans sant sapo sard sare sari sark sart sash sate sauf
    saul saum saur saut save sawn sawt saxe tabu tach tack tact tade tael
    taen taft tahr tail tain tait take takt taku taky talc tald tale tali
    talk tall tame tamp tane tang tanh tank tano taos tape taps tapu tare
    tari tarn taro tarp tarr tars tart tash task tass tasu tate tath tatu
    taum taun taur taut tave tavy tawn taws taxi taxy uang vade vady vage
    vail vain vair vale vali vall vamp vane vang vare vari vary vase vast
    vasu vayu wabe wabi wace wack waco wade wadi waeg waer wafd waff waft
    wage waif waik wail wain wait wake wakf waky wale wali walk wall walt
    wame wamp wand wane wang want wany wapp ward ware warf wark warl warm
    warn warp wart wary wase wash wasp wast wath watt wauf waul waup waur
    wave wavy waxy ways yabu yade yaff yagi yair yaje yalb yale yali yamp
    yang yank yapp yarb yard yare yark

    ReplyDelete
  2. yarl yarm yarn yarr yaru yate yati yaud yawl yawn yawp yaws yawy zach zain zant zany zarf zarp zati


    Blogger only allows 4096 characters in a post. Who knew?

    ReplyDelete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
  4. Anonymous5:12 AM

    The people who play online rummy do it for various reasons. For an online rummy player, the biggest thrill they can get by playing the game is the rewards. Going by rummy similarities when a person takes birth it could be compared to downloading rummy. The next step is how to live or lead your life is on similar lines as how to play rummy.

    ReplyDelete