Ivan Peev blog

Saturday, April 14, 2007

Peter Norvig's spell checker

I think that Peter Norvig changed slightly the implementation of his spell checker, because of me: http://www.norvig.com/spell-correct.html

That happened during discussion on the reddit's comments thread about his spell checker: http://programming.reddit.com/info/1gb59/comments

ipeev 2 points 4 days ago*

I see a small problem in the Norvig's code. This part is supposed to make 26n alterations:

[word[0:i]+c+word[i+1:] for i in range(n) for c in string.lowercase] + ## alteration

But not really. This is what string.lowercase returns on my computer:

import string

string.lowercase

'abcdefghijklmnopqrstuvwxyz\x83\x9a\x9c\x9e\xaa\xb5\xba\xdf\xe0 \xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0 \xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

Total of 65 characters. Same for insertion.

Is this a real problem? It could be! One has to be aware of the Python's battery included philosophy.

Then someone responds:

vetinari 1 point 2 days ago

Check your locale, mine is fine:

>>> import string
>>> [c for c in string.lowercase]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>>

He is right too, but I am not going to change my locale, so I respond:

ipeev 1 point 2 days ago

My point is that the Norvig's code is not aware of the localization features of Python. If he wants his program to run on any computer, the simplest think to do is use the first 26 letters only:

string.lowercase[:26]

But Norvig didn't exactly listened to me and implemented different solution:

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] +                     # deletion
    [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
    [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
    [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet])  # insertion

So instead of choosing my cryptic syntax he implemented it with explicit alphabet instead. Apparently he also thinks that explicit is better than implicit.

At the end of his page he writes:

Originally my program was 20 lines, but a reader pointed out that I had used string.lowercase, which in some locales in some versions of Python, has more characters than just the a-z I intended. So I added the variable alphabet to make sure.

I think I might have been that reader. Just for information the locale in question is Bulgarian. Sorry for causing the trouble with all those regional alphabets, I know it a pain in the ... . I like his site very much and enjoyed the "Tutorial on Good Lisp Programming Style" and even I have it printed on paper at home!

UPDATE: I've sent an email to Norvig and asked him about this and he responded that indeed, he did read my comment on reddit and thats why he changed the implementation. Now my name is there too which is cool!

Originally my program was 20 lines, but Ivan Peev pointed out that I had used string.lowercase, which in some locales in some versions of Python, has more characters than just the a-z I intended. So I added the variable alphabet to make sure. I could have used string.ascii_lowercase.

Tuesday, March 20, 2007

Solving the "Mr.S and Mr.P" puzzle by John McCarthy in Python

I have found recently an interesting article on reddit, about solving a logic puzzle in Haskel:
http://okmij.org/ftp/Haskell/Mr-S-P.lhs
While the technique presented is interesting itself, I find it transferable to Python with even improved readability.
And here is the original statement:

Formalization of two Puzzles Involving Knowledge
 McCarthy, John (1987).
 http://www-formal.stanford.edu/jmc/puzzles.html

We pick two numbers a and b, so that a>=b and both numbers are within
the range [2,99]. We give Mr.P the product a*b and give Mr.S the sum
a+b.

The following dialog takes place:

 Mr.P: I don't know the numbers
 Mr.S: I knew you didn't know. I don't know either.
 Mr.P: Now I know the numbers
 Mr.S: Now I know them too

Can we find the numbers a and b?

Follows the solution written in Python 2.5.

# all pairs of a,b
pairs = [ (a,b) for a in range(2,99+1) for b in range(2,99+1) if a>=b ]

# calculates map of solutions
def calc_map(oper):
    M={}
    for a,b in pairs:
        m = oper(a,b)
        if not m in M:
            M[m] = []
        M[m].append( (a,b) )
    return M

# function that tests for single solution
single = lambda lx: len(lx) == 1

# maps that hold the sum and the product solutions, 
# dictionaries with list values
S = calc_map(lambda a,b: a+b)
P = calc_map(lambda a,b: a*b)

# Rules list
rule_MrP_dont_know      = lambda p: not single (P[p])

rule_MrS_dont_know      = lambda s: not single (S[s])

rule_MrS_knew_MrP_doesnt_know = lambda s: all( [rule_MrP_dont_know( a*b ) 
                                                        for a,b in S[s] ] )
        
rule_MrP_now_knows      = lambda p: single( [ (a,b) for a,b in P[p] 
                                        if rule_MrS_knew_MrP_doesnt_know(a+b) ])

rule_MrS_knows_MrP_now_know = lambda s: single([ (a,b) for a,b in S[s] 
                                            if rule_MrP_now_knows(a*b) ])
# Solve it
for a, b in pairs:
    s,p = a+b, a*b
    if rule_MrP_dont_know(p) \
            and rule_MrS_dont_know(s) \
            and rule_MrS_knew_MrP_doesnt_know(s)\
            and rule_MrP_now_knows(p) \
            and rule_MrS_knows_MrP_now_know(s):
        print "Answer is:" , a,b

And if we run the program the answer we get is:


Answer is: 13 4

I used the "all" function to increase readability and this is the only reason a 2.5 version is needed. Use own "all" function for a 2.4 version.

Now what we have here is almost a short implementation of a rule engine with forward-chaining. It certainly misses a lot of functionality of a real rule engine, but the actual process of rules execution is probabbly the same. Is Python a good language for this kind of tasks? I think it is.

The Python solution does not follow exactly the Haskell route neither the the paper by McCarthy. MsCarthy I find rather difficult to read and Haskel has its own oddity that makes me screem sometimes in the middle of the night.

The Python rules try to be more directly mapped to the conditions in the task. Readability counts and of course beautiful is better than ugly.

Tuesday, September 12, 2006

Solving the Google Code Jam "countPaths" problem in Python

Wednesday, 16. August 2006, 13:54:26

Google Code Jam, Python

I found this article about Haskell:
Solving the Google Code Jam "countPaths" problem in Haskell

Recently Guido van Rossum announced that Python is one of the supported languages in the next Google Code Jam. I decided to write a solution to the puzzle in Python. I haven't looked at the Haskell code. It is too strange anyway.(*) Neither looked at the C code from the other site - too long. Here it is in Python in 25 lines. And it is very fast.


class WordPath:
   def howMany(self, (x,y), word):
       if x <>=self.N or y<0>=self.M or self.grid[x][y] != word[0]:
           return 0
       if len(word) == 1:
           return 1
       s = 0
       for a in (x-1, x, x+1):
           for b in (y-1, y, y+1):
               if not (a == x and b ==y):
                   if (a, b, word) not in self.cache:
                       self.cache[ (a,b,word) ] = self.howMany( (a,b), word[1:])
                   s += self.cache[ (a,b,word) ]
       return s

   def countPaths( self, grid, word):
       self.grid = grid
       self.N = len(grid)
       self.M = len(grid[0])
       self.cache = {}
       s = sum ( [ self.howMany( (x,y), word) for x in range (self.N)
                                                   for y in range(self.M)] )
       if s > 1000000000:
           s = -1
       return s


#

test = WordPath()
assert 1 == test.countPaths( ("ABC","FED","GHI"), "ABCDEFGHI")
assert 108 == test.countPaths( ("AA","AA"), "AAAA")
assert 2 == test.countPaths( ("ABC","FED","GAI"), "ABCDEA")
assert 0 == test.countPaths( ("ABC","DEF","GHI"), "ABCD")
assert 56448 == test.countPaths( ("ABABA","BABAB","ABABA","BABAB","ABABA"), "ABABABBA")
assert -1 == test.countPaths( ("AAAAA","AAAAA","AAAAA","AAAAA","AAAAA"), "AAAAAAAAAAA")

Original Problem Statement here

The author of the Haskell article Tom Moertel investigates this extreme case:

"to find a word composed of 50 “A” letters within a 50×50 grid of “A” cells.".

Let us see if we remove the check for exceeding 1 billion what will happen?

print test.countPaths( [ "A"*50 for a in range(50)], "A"*50)

The result is 303835410591851117616135618108340196903254429200 and this is the same
value that Tom Moertel found.

The calculation took whole 8 seconds on oooold Athlon 1000Mhz with Win XP.

Nice.

Well not really. The Google Code Jam rules are as Tom Moertel pointed out "All submissions have a maximum of 2 seconds of runtime per test case". If Goggle is running the tests on 4Ghz Athlon we will be almost in limits. But lets not take chances. We have to stop the run earlier. That unfortunately will increase our code size. One very straightforward solution is with exceptions:


import timeit

class WordPath:
   def howMany(self, (x,y), word):
       if x <>=self.N or y<0>=self.M or self.grid[x][y] != word[0]:
           return 0
       if len(word) == 1:
           return 1
       s = 0
       for a in (x-1, x, x+1):
           for b in (y-1, y, y+1):
               if not (a == x and b ==y):
                   if (a, b, word) not in self.cache:
                       self.cache[ (a,b,word) ] = self.howMany( (a,b), word[1:])
                   s += self.cache[ (a,b,word) ]
                   if s > 1000000000:
                       raise OverflowError ('spam', 'eggs')
       return s

   def countPaths( self, grid, word):
       self.grid = grid
       self.N = len(grid)
       self.M = len(grid[0])
       self.cache = {}
       try:
           s = sum ( [ self.howMany( (x,y), word) for x in range (self.N)
                                                   for y in range(self.M)] )
           if s > 1000000000:
               s = -1
           return s
       except OverflowError :
           return -1

#


t = timeit.Timer(stmt='WordPath().countPaths( [ "A"*50 for a in range(50)], "A"*50)',
                   setup = 'from __main__ import WordPath')

print "%.2f sec/pass" %  (t.timeit(number=100)/100)

And the result on 1Ghz is the blazing speed of:

0.03 sec/pass

So the conclusion is that Python can be used in Google's Code Jam, but one must be carefull with the time limits!

(*) Update. Some people are commenting on reddit my ignorance of the Haskell code. I actually learned most of Haskell some time ago, until I got to Monads. Then I found 267 articles expalining what Monads are from which 27 just tutorials. At that moment I tought "Enough! Maybe I will read it later.". The so called "Maybe" monad. That was 1.5 years ago.

Comments

This is a neat solution to a problem that's combinatorially large. I think the code is clean and easy to understand.

I do have one question though - it appears to me that the WordPath class is used only for running the two methods. Wouldn't it be cleaner (two fewer lines and no "selfs") to just have the two methods in the global name space? What does the class structure bring to the party?

Don't get me wrong; I'm not trying to trash the code. I'm sure I couldn't do as well. I'm just trying to improve my understanding of good python style.

Thanks.

By steve_g, # 17. August 2006, 01:35:01

Hi Steve,

The solution is implemented with a class because of the Google Code Jam rules. They give the class structure and the competitor have to implement it. For this task the Java interface was:


Class:
    WordPath
    Method:
    countPaths
    Parameters:
    String[], String
    Returns:
    int
    Method signature:
    int countPaths(String[] grid, String find)
    (be sure your method is public)

By ipeev, # 17. August 2006, 04:41:58

Here's a version which doesn't use a Dictionary to cache stuff :


class WordPath:
 def countPaths(self, grid, word):
       last_round = [[((x == word[0] and 1) or 0) for x in y] for y in grid]
       len_grid = len(grid)
       len_grid_0 = len(grid[0])
       for letter in word[1:]:
           this_round  = [[0 for x in range(len_grid_0)] for y in range(len_grid)]
           for x in range(len_grid):
               for y in range(len_grid_0):
                   if grid[x][y] == letter:
                       for a in (x-1, x, x+1):
                           for b in (y-1, y, y+1):                           
                               if a >= 0 and a <>= 0 and b < len_grid_0 and not (a == x and b == y):
                                   this_round[x][y] += last_round[a][ b]
           last_round = this_round
       total = sum(sum(x) for x in last_round)
       if total > 1000000000:
           total = -1
       return total

Runs a little faster for the tests (on my machine anyway) but starts running a lot faster if you remove the "> 1000000000" tests and use large test values.

EDIT: Made it a bit more efficient

By almostobsolete, # 17. August 2006, 14:48:40

Very interesting!

After looking for several minutes at the code I think I understand now why and how this solution works.

But not sure why it is faster. It aparently doesn't use cache. Maybe because it doesn't use recursion. I wish Python supports better recursive functions some day.

Anyway. Measured 2.9 seconds on the same computer. Removing the check for "> 1000000000" doesn't improve the time from what I see.

Probably the trick with the early interruption can be used here too, because it is far behind the fastest 0.03 seconds solution with the exceptions.

By ipeev, # 17. August 2006, 17:04:33

Sorry, I wasn't very clear in my last message. What I meant was that it's quicker in getting the correct result (303835410591851117616135618108340196903254429200) for big all A's test (print test.countPaths( [ "A"*50 for a in range(50)], "A"*50)) and it gets relatively faster as the size of the grid or the word is increased.

By almostobsolete, # 17. August 2006, 19:05:31

Thomas writes:

I was coding this in the statistical language R (which is not fast enough, but comes surprisingly close with sparse transition matrix operations) and this got me thinking about the overflow exception case.

I don't think that all A's is the hardest case by a long way. Suppose you have a 50x50 grid of As with a B in the last entry and that the test word is 49 As and a B. You won't know until you look at the last entry whether there are $BIGNUM solutions or none.

By anonymous user, # 21. August 2006, 14:49:54

Interesting observation Thomas. I checked to see how the execution time will change in the new extreme case for the solution with the exceptions.


t = timeit.Timer(stmt='WordPath().countPaths( [ "A"*50 for a in range(50)], "A"*50)',
                   setup = 'from __main__ import WordPath')
print "%.4f sec/pass" %  (t.timeit(number=10)/10)

A = [ "A"*50 for a in range(49)] + ["A"*49 + "B"]
W = "A"*49 + "B"
t = timeit.Timer(stmt='WordPath().countPaths( A,W)',
                   setup = 'from __main__ import WordPath, A,W')
print "%.4f sec/pass" %  (t.timeit(number=10)/10)

The first test is with all "A"s and the second is with "A"s and only 1 "B" at the end.

Here are the measured results:

0.0114 sec/pass
0.9282 sec/pass

We see that indeed the time increased about 100 times. But it is still under 1 second and much better than 2.4545 sec/pass for my first solution without the exceptions.

But let see how the solution provided by almostobsolete will handle this case. Running the same 2 tests gives:

1.3069 sec/pass
1.2378 sec/pass

His algorythm handles a little better the new extreme case. Apparently it is using time to sum all solutions in the "A"s only case.

By ipeev, # 22. August 2006, 06:53:36

Saturday, December 17, 2005

За Функционалното Програмиране

От известно време се интересувам от функционално програмираме. Много интересно.

Sunday, May 15, 2005

Начален старт

Това е новия ми блог.

Ivan Peev blog

Saturday, April 14, 2007

Peter Norvig's spell checker

Tuesday, March 20, 2007

Solving the "Mr.S and Mr.P" puzzle by John McCarthy in Python

Tuesday, September 12, 2006

Solving the Google Code Jam "countPaths" problem in Python

Solving the Google Code Jam "countPaths" problem in Python

Comments

Saturday, December 17, 2005

За Функционалното Програмиране

Sunday, May 15, 2005

Начален старт

Blog Archive

Links

About Me

my analitics