Ivan Peev blog

Friday, November 05, 2010

Wordhacking in Python

Reddit is hiring again and number 5 of the bonus points here: http://blog.reddit.com/2010/08/reddit-is-hiring.html is:

5. ...compose us a piece of customized wordhacking, e.g.:

pithier
legible
emerald
alleged
spumoni
excerpt

That sounds like an interesting task and here we will try solve it using Python and more specifically the intersection operation of the set type. For example one solution to the text "reddit is now wordhacking" is :

thrown
daemon
coders
iodide
weighs
catnap
stitch
lusaka
bandit
scorns
sewage

And another:

runways
echoing
doormat
deadsea
itching
tetanus
icecold
soakers
namings
oddness
wriggle

Follows the source. The function that do all the work is make_wordhacking. When run for the first time the program will get a dictionary with words from Internet. The program will print only one (the first) solution for each combination of columns and word size.

from __future__ import division
from urllib2 import urlopen
from collections import defaultdict
from itertools import combinations
from zipfile import ZipFile
import os


def iterate_in_group_by(num, cx):
    "iterate_in_group_by(3, 'ABCDEFG') --> ABC DEF G"
    c = 0
    while c < len(cx):
        yield(cx[c:c+num])
        c += num

def get_words():
    fn = "corncob_lowercase.zip"
    if not os.path.isfile(fn):
            data = urlopen( "http://www.mieliestronk.com/" + fn ).read()
            with open(fn,"w") as fw:
                fw.write(data)
    fz = ZipFile(fn)
    return fz.open(fz.namelist()[0]).read().split()

def make_sets(W):
    S=defaultdict(lambda: defaultdict(set))
    for word in W:
        for i,c in enumerate(word):
            S[c][i].add(word)
    return S

def factors(n):
    for k in range(2,n//2+1):
        if n%k == 0:
            yield k


def print_reddit_comment_solution(R,P):
    print "---------------------------------------------------"
    print "We have a solution"
    print P
    for v in R.values():
        for w in [list(a) for a in v]:
            for p in range(len(w)):
                if p in P:
                    w[p] = "`[%s](http://)`"%w[p]
            print "".join(("\n`%s`" % "".join(w)).split("``"))
            break

def print_html_solution(R,P):
    print """\n<br/><div style="font-family: monospace; text-align: center; font-size: x-large; line-height: 1em;">"""
    for v in R.values():
        for w in [list(a) for a in v]:
            for p in range(len(w)):
                if p in P:
                    w[p] = """<span style="font-family:monospace; font-weight: bold; background-color: rgb(255, 221, 221);">%s</span>"""%w[p]
                else:
                    w[p] = """<span style="font-family:monospace; font-weight: bold; background-color: rgb(221, 255, 255);">%s</span>""" % w[p]
            print "<br/>%s" % "".join(w)
            break
    print """</div>"""

def make_wordhacking(text,W):
    text = [x for x in text if x.isalpha()]
    for colnum in factors(len(text)):
        print colnum
        for w_len in range(colnum,15):
            S = make_sets( (word for word in W if len(word)==w_len) )
            T = list(iterate_in_group_by(len(text)//colnum,text))
            for P in combinations(range(w_len),colnum):
                R={}
                for i in range(len(T[0])):
                    C = [col[i] for col in T]
                    R[i] = S[C[0]][P[0]]
                    for ind,pos in enumerate(P[1:]):
                        R[i] = R[i] & S[C[ind+1]][pos]
                    if not R[i]:
                        break
                else :
                    print_reddit_comment_solution(R,P)


if __name__ == "__main__":
    W =get_words()
    text = "reddit is now wordhacking"
    make_wordhacking(text,W)

Sometimes it will generate a lot of solutions, sometimes not at all. Try to choose an input text that has a length with a lot of factors.

Wednesday, April 09, 2008

Solving the Monty Hall problem with simulation in Python

On at least two occasions I've met the Monty Hall problem. The first time it was in the book "The Curious Incident of the Dog in the Night-time" from Mark Haddon. And today I found it in an article called "Monty Hall strikes again - reveals fatal flaw in some of the most famous psychology experiments"

It is so counter intuitive that it is amazing. The problem is this:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

http://en.wikipedia.org/wiki/Monty_Hall_problem

My intuition says there is no need to change doors. The probability to win a car is obvious and it is 1/3. But the mathematical solution says the probability to win a car if you change doors is 2/3. Amazing if it is true! So lets find out who is right? How? Will make a simulation, will run it many many times and will get the average. This is the Python program that solves the problem:


from __future__ import division
from random import shuffle, choice


def get_random_doors():
    doors = ["goat","goat","car"]
    shuffle(doors)
    return doors
    
    
def get_host_open(doors, guest_pick1):
    host_open = 0
    remaining_doors = [ k for k in [0,1,2] 
                                if k != guest_pick1]
    if doors[guest_pick1] == "car":
        host_open = choice(remaining_doors)
    else:
        [host_open] = [d for d in remaining_doors 
                                if doors[d] != "car"]
    return host_open
    
    
def simulation(strategy):
    N = 100000
    cars_won = 0
    for _ in range(N):
        doors = get_random_doors()
        guest_pick1 = choice([0,1,2])
        host_open = get_host_open(doors, guest_pick1)
        if strategy == "stay":
            guest_pick_final = guest_pick1
        elif strategy == "change":
            [guest_pick_final] = [ k for k in [0,1,2] 
                                    if k != guest_pick1 and k != host_open]
        else:
            raise ("Unknown strategy")
            
        if doors[guest_pick_final] == "car":
            cars_won += 1
        
    print strategy, cars_won/N
    
    
simulation("stay")
simulation("change")

This is a Python 2.4 and 2.5 code. The problem is written with the goal to be easy to read and verify and is not optimized for performance. Lets run it. The result is:

stay 0.33478
change 0.66407

So change doors, definitely change doors if you can!

Saturday, April 14, 2007

Peter Norvig's spell checker

I think that Peter Norvig changed slightly the implementation of his spell checker, because of me: http://www.norvig.com/spell-correct.html

That happened during discussion on the reddit's comments thread about his spell checker: http://programming.reddit.com/info/1gb59/comments

ipeev 2 points 4 days ago*

I see a small problem in the Norvig's code. This part is supposed to make 26n alterations:

[word[0:i]+c+word[i+1:] for i in range(n) for c in string.lowercase] + ## alteration

But not really. This is what string.lowercase returns on my computer:

import string

string.lowercase

'abcdefghijklmnopqrstuvwxyz\x83\x9a\x9c\x9e\xaa\xb5\xba\xdf\xe0 \xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0 \xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

Total of 65 characters. Same for insertion.

Is this a real problem? It could be! One has to be aware of the Python's battery included philosophy.

Then someone responds:

vetinari 1 point 2 days ago

Check your locale, mine is fine:

>>> import string
>>> [c for c in string.lowercase]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>>

He is right too, but I am not going to change my locale, so I respond:

ipeev 1 point 2 days ago

My point is that the Norvig's code is not aware of the localization features of Python. If he wants his program to run on any computer, the simplest think to do is use the first 26 letters only:

string.lowercase[:26]

But Norvig didn't exactly listened to me and implemented different solution:

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion
    [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
    [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
    [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion

So instead of choosing my cryptic syntax he implemented it with explicit alphabet instead. Apparently he also thinks that explicit is better than implicit.

At the end of his page he writes:

Originally my program was 20 lines, but a reader pointed out that I had used string.lowercase, which in some locales in some versions of Python, has more characters than just the a-z I intended. So I added the variable alphabet to make sure.

I think I might have been that reader. Just for information the locale in question is Bulgarian. Sorry for causing the trouble with all those regional alphabets, I know it a pain in the ... . I like his site very much and enjoyed the "Tutorial on Good Lisp Programming Style" and even I have it printed on paper at home!

UPDATE: I've sent an email to Norvig and asked him about this and he responded that indeed, he did read my comment on reddit and thats why he changed the implementation. Now my name is there too which is cool!

Originally my program was 20 lines, but Ivan Peev pointed out that I had used string.lowercase, which in some locales in some versions of Python, has more characters than just the a-z I intended. So I added the variable alphabet to make sure. I could have used string.ascii_lowercase.

Tuesday, March 20, 2007

Solving the "Mr.S and Mr.P" puzzle by John McCarthy in Python

I have found recently an interesting article on reddit, about solving a logic puzzle in Haskel:
http://okmij.org/ftp/Haskell/Mr-S-P.lhs
While the technique presented is interesting itself, I find it transferable to Python with even improved readability.
And here is the original statement:

Formalization of two Puzzles Involving Knowledge
 McCarthy, John (1987).
 http://www-formal.stanford.edu/jmc/puzzles.html

We pick two numbers a and b, so that a>=b and both numbers are within
the range [2,99]. We give Mr.P the product a*b and give Mr.S the sum
a+b.

The following dialog takes place:

 Mr.P: I don't know the numbers
 Mr.S: I knew you didn't know. I don't know either.
 Mr.P: Now I know the numbers
 Mr.S: Now I know them too

Can we find the numbers a and b?

Follows the solution written in Python 2.5.

# all pairs of a,b
pairs = [ (a,b) for a in range(2,99+1) for b in range(2,99+1) if a>=b ]

# calculates map of solutions
def calc_map(oper):
    M={}
    for a,b in pairs:
        m = oper(a,b)
        if not m in M:
            M[m] = []
        M[m].append( (a,b) )
    return M

# function that tests for single solution
single = lambda lx: len(lx) == 1

# maps that hold the sum and the product solutions, 
# dictionaries with list values
S = calc_map(lambda a,b: a+b)
P = calc_map(lambda a,b: a*b)

# Rules list
rule_MrP_dont_know      = lambda p: not single (P[p])

rule_MrS_dont_know      = lambda s: not single (S[s])

rule_MrS_knew_MrP_doesnt_know = lambda s: all( [rule_MrP_dont_know( a*b ) 
                                                        for a,b in S[s] ] )
        
rule_MrP_now_knows      = lambda p: single( [ (a,b) for a,b in P[p] 
                                        if rule_MrS_knew_MrP_doesnt_know(a+b) ])

rule_MrS_knows_MrP_now_know = lambda s: single([ (a,b) for a,b in S[s] 
                                            if rule_MrP_now_knows(a*b) ])
# Solve it
for a, b in pairs:
    s,p = a+b, a*b
    if rule_MrP_dont_know(p) \
            and rule_MrS_dont_know(s) \
            and rule_MrS_knew_MrP_doesnt_know(s)\
            and rule_MrP_now_knows(p) \
            and rule_MrS_knows_MrP_now_know(s):
        print "Answer is:" , a,b

And if we run the program the answer we get is:


Answer is: 13 4

I used the "all" function to increase readability and this is the only reason a 2.5 version is needed. Use own "all" function for a 2.4 version.

Now what we have here is almost a short implementation of a rule engine with forward-chaining. It certainly misses a lot of functionality of a real rule engine, but the actual process of rules execution is probabbly the same. Is Python a good language for this kind of tasks? I think it is.

The Python solution does not follow exactly the Haskell route neither the the paper by McCarthy. MsCarthy I find rather difficult to read and Haskel has its own oddity that makes me screem sometimes in the middle of the night.

The Python rules try to be more directly mapped to the conditions in the task. Readability counts and of course beautiful is better than ugly.

Tuesday, September 12, 2006

Solving the Google Code Jam "countPaths" problem in Python

Wednesday, 16. August 2006, 13:54:26

Google Code Jam, Python

I found this article about Haskell:
Solving the Google Code Jam "countPaths" problem in Haskell

Recently Guido van Rossum announced that Python is one of the supported languages in the next Google Code Jam. I decided to write a solution to the puzzle in Python. I haven't looked at the Haskell code. It is too strange anyway.(*) Neither looked at the C code from the other site - too long. Here it is in Python in 25 lines. And it is very fast.


class WordPath:
   def howMany(self, (x,y), word):
       if x <>=self.N or y<0>=self.M or self.grid[x][y] != word[0]:
           return 0
       if len(word) == 1:
           return 1
       s = 0
       for a in (x-1, x, x+1):
           for b in (y-1, y, y+1):
               if not (a == x and b ==y):
                   if (a, b, word) not in self.cache:
                       self.cache[ (a,b,word) ] = self.howMany( (a,b), word[1:])
                   s += self.cache[ (a,b,word) ]
       return s

   def countPaths( self, grid, word):
       self.grid = grid
       self.N = len(grid)
       self.M = len(grid[0])
       self.cache = {}
       s = sum ( [ self.howMany( (x,y), word) for x in range (self.N)
                                                   for y in range(self.M)] )
       if s > 1000000000:
           s = -1
       return s


#

test = WordPath()
assert 1 == test.countPaths( ("ABC","FED","GHI"), "ABCDEFGHI")
assert 108 == test.countPaths( ("AA","AA"), "AAAA")
assert 2 == test.countPaths( ("ABC","FED","GAI"), "ABCDEA")
assert 0 == test.countPaths( ("ABC","DEF","GHI"), "ABCD")
assert 56448 == test.countPaths( ("ABABA","BABAB","ABABA","BABAB","ABABA"), "ABABABBA")
assert -1 == test.countPaths( ("AAAAA","AAAAA","AAAAA","AAAAA","AAAAA"), "AAAAAAAAAAA")

Original Problem Statement here

The author of the Haskell article Tom Moertel investigates this extreme case:

"to find a word composed of 50 “A” letters within a 50×50 grid of “A” cells.".

Let us see if we remove the check for exceeding 1 billion what will happen?

print test.countPaths( [ "A"*50 for a in range(50)], "A"*50)

The result is 303835410591851117616135618108340196903254429200 and this is the same
value that Tom Moertel found.

The calculation took whole 8 seconds on oooold Athlon 1000Mhz with Win XP.

Nice.

Well not really. The Google Code Jam rules are as Tom Moertel pointed out "All submissions have a maximum of 2 seconds of runtime per test case". If Goggle is running the tests on 4Ghz Athlon we will be almost in limits. But lets not take chances. We have to stop the run earlier. That unfortunately will increase our code size. One very straightforward solution is with exceptions:


import timeit

class WordPath:
   def howMany(self, (x,y), word):
       if x <>=self.N or y<0>=self.M or self.grid[x][y] != word[0]:
           return 0
       if len(word) == 1:
           return 1
       s = 0
       for a in (x-1, x, x+1):
           for b in (y-1, y, y+1):
               if not (a == x and b ==y):
                   if (a, b, word) not in self.cache:
                       self.cache[ (a,b,word) ] = self.howMany( (a,b), word[1:])
                   s += self.cache[ (a,b,word) ]
                   if s > 1000000000:
                       raise OverflowError ('spam', 'eggs')
       return s

   def countPaths( self, grid, word):
       self.grid = grid
       self.N = len(grid)
       self.M = len(grid[0])
       self.cache = {}
       try:
           s = sum ( [ self.howMany( (x,y), word) for x in range (self.N)
                                                   for y in range(self.M)] )
           if s > 1000000000:
               s = -1
           return s
       except OverflowError :
           return -1

#


t = timeit.Timer(stmt='WordPath().countPaths( [ "A"*50 for a in range(50)], "A"*50)',
                   setup = 'from __main__ import WordPath')

print "%.2f sec/pass" %  (t.timeit(number=100)/100)

And the result on 1Ghz is the blazing speed of:

0.03 sec/pass

So the conclusion is that Python can be used in Google's Code Jam, but one must be carefull with the time limits!

(*) Update. Some people are commenting on reddit my ignorance of the Haskell code. I actually learned most of Haskell some time ago, until I got to Monads. Then I found 267 articles expalining what Monads are from which 27 just tutorials. At that moment I tought "Enough! Maybe I will read it later.". The so called "Maybe" monad. That was 1.5 years ago.

Comments

This is a neat solution to a problem that's combinatorially large. I think the code is clean and easy to understand.

I do have one question though - it appears to me that the WordPath class is used only for running the two methods. Wouldn't it be cleaner (two fewer lines and no "selfs") to just have the two methods in the global name space? What does the class structure bring to the party?

Don't get me wrong; I'm not trying to trash the code. I'm sure I couldn't do as well. I'm just trying to improve my understanding of good python style.

Thanks.

By steve_g, # 17. August 2006, 01:35:01

Hi Steve,

The solution is implemented with a class because of the Google Code Jam rules. They give the class structure and the competitor have to implement it. For this task the Java interface was:


Class:
    WordPath
    Method:
    countPaths
    Parameters:
    String[], String
    Returns:
    int
    Method signature:
    int countPaths(String[] grid, String find)
    (be sure your method is public)

By ipeev, # 17. August 2006, 04:41:58

Here's a version which doesn't use a Dictionary to cache stuff :


class WordPath:
 def countPaths(self, grid, word):
       last_round = [[((x == word[0] and 1) or 0) for x in y] for y in grid]
       len_grid = len(grid)
       len_grid_0 = len(grid[0])
       for letter in word[1:]:
           this_round  = [[0 for x in range(len_grid_0)] for y in range(len_grid)]
           for x in range(len_grid):
               for y in range(len_grid_0):
                   if grid[x][y] == letter:
                       for a in (x-1, x, x+1):
                           for b in (y-1, y, y+1):                           
                               if a >= 0 and a <>= 0 and b < len_grid_0 and not (a == x and b == y):
                                   this_round[x][y] += last_round[a][ b]
           last_round = this_round
       total = sum(sum(x) for x in last_round)
       if total > 1000000000:
           total = -1
       return total

Runs a little faster for the tests (on my machine anyway) but starts running a lot faster if you remove the "> 1000000000" tests and use large test values.

EDIT: Made it a bit more efficient

By almostobsolete, # 17. August 2006, 14:48:40

Very interesting!

After looking for several minutes at the code I think I understand now why and how this solution works.

But not sure why it is faster. It aparently doesn't use cache. Maybe because it doesn't use recursion. I wish Python supports better recursive functions some day.

Anyway. Measured 2.9 seconds on the same computer. Removing the check for "> 1000000000" doesn't improve the time from what I see.

Probably the trick with the early interruption can be used here too, because it is far behind the fastest 0.03 seconds solution with the exceptions.

By ipeev, # 17. August 2006, 17:04:33

Sorry, I wasn't very clear in my last message. What I meant was that it's quicker in getting the correct result (303835410591851117616135618108340196903254429200) for big all A's test (print test.countPaths( [ "A"*50 for a in range(50)], "A"*50)) and it gets relatively faster as the size of the grid or the word is increased.

By almostobsolete, # 17. August 2006, 19:05:31

Thomas writes:

I was coding this in the statistical language R (which is not fast enough, but comes surprisingly close with sparse transition matrix operations) and this got me thinking about the overflow exception case.

I don't think that all A's is the hardest case by a long way. Suppose you have a 50x50 grid of As with a B in the last entry and that the test word is 49 As and a B. You won't know until you look at the last entry whether there are $BIGNUM solutions or none.

By anonymous user, # 21. August 2006, 14:49:54

Interesting observation Thomas. I checked to see how the execution time will change in the new extreme case for the solution with the exceptions.


t = timeit.Timer(stmt='WordPath().countPaths( [ "A"*50 for a in range(50)], "A"*50)',
                   setup = 'from __main__ import WordPath')
print "%.4f sec/pass" %  (t.timeit(number=10)/10)

A = [ "A"*50 for a in range(49)] + ["A"*49 + "B"]
W = "A"*49 + "B"
t = timeit.Timer(stmt='WordPath().countPaths( A,W)',
                   setup = 'from __main__ import WordPath, A,W')
print "%.4f sec/pass" %  (t.timeit(number=10)/10)

The first test is with all "A"s and the second is with "A"s and only 1 "B" at the end.

Here are the measured results:

0.0114 sec/pass
0.9282 sec/pass

We see that indeed the time increased about 100 times. But it is still under 1 second and much better than 2.4545 sec/pass for my first solution without the exceptions.

But let see how the solution provided by almostobsolete will handle this case. Running the same 2 tests gives:

1.3069 sec/pass
1.2378 sec/pass

His algorythm handles a little better the new extreme case. Apparently it is using time to sum all solutions in the "A"s only case.

By ipeev, # 22. August 2006, 06:53:36

Saturday, December 17, 2005

За Функционалното Програмиране

От известно време се интересувам от функционално програмираме. Много интересно.

Sunday, May 15, 2005

Начален старт

Това е новия ми блог.

Ivan Peev blog

Friday, November 05, 2010

Wordhacking in Python

Wednesday, April 09, 2008

Solving the Monty Hall problem with simulation in Python

Saturday, April 14, 2007

Peter Norvig's spell checker

Tuesday, March 20, 2007

Solving the "Mr.S and Mr.P" puzzle by John McCarthy in Python

Tuesday, September 12, 2006

Solving the Google Code Jam "countPaths" problem in Python

Solving the Google Code Jam "countPaths" problem in Python

Comments

Saturday, December 17, 2005

За Функционалното Програмиране

Sunday, May 15, 2005

Начален старт

Blog Archive

Links

About Me

my analitics