Saturday, April 14, 2007

Peter Norvig's spell checker

I think that Peter Norvig changed slightly the implementation of his spell checker, because of me: http://www.norvig.com/spell-correct.html

That happened during discussion on the reddit's comments thread about his spell checker: http://programming.reddit.com/info/1gb59/comments



ipeev 2 points 4 days ago*

I see a small problem in the Norvig's code. This part is supposed to make 26n alterations:

[word[0:i]+c+word[i+1:] for i in range(n) for c in string.lowercase] + ## alteration

But not really. This is what string.lowercase returns on my computer:

import string

string.lowercase

'abcdefghijklmnopqrstuvwxyz\x83\x9a\x9c\x9e\xaa\xb5\xba\xdf\xe0 \xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0 \xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

Total of 65 characters. Same for insertion.

Is this a real problem? It could be! One has to be aware of the Python's battery included philosophy.


Then someone responds:



vetinari 1 point 2 days ago

Check your locale, mine is fine:

>>> import string
>>> [c for c in string.lowercase]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>>


He is right too, but I am not going to change my locale, so I respond:


ipeev 1 point 2 days ago

My point is that the Norvig's code is not aware of the localization features of Python. If he wants his program to run on any computer, the simplest think to do is use the first 26 letters only:

string.lowercase[:26]



But Norvig didn't exactly listened to me and implemented different solution:


alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
n
= len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion






So instead of choosing my cryptic syntax he implemented it with explicit alphabet instead. Apparently he also thinks that explicit is better than implicit.

At the end of his page he writes:
Originally my program was 20 lines, but a reader pointed out that I had used string.lowercase, which in some locales in some versions of Python, has more characters than just the a-z I intended. So I added the variable alphabet to make sure.


I think I might have been that reader. Just for information the locale in question is Bulgarian. Sorry for causing the trouble with all those regional alphabets, I know it a pain in the ... . I like his site very much and enjoyed the "Tutorial on Good Lisp Programming Style" and even I have it printed on paper at home!

UPDATE: I've sent an email to Norvig and asked him about this and he responded that indeed, he did read my comment on reddit and thats why he changed the implementation. Now my name is there too which is cool!

Originally my program was 20 lines, but Ivan Peev pointed out that I had used string.lowercase, which in some locales in some versions of Python, has more characters than just the a-z I intended. So I added the variable alphabet to make sure. I could have used string.ascii_lowercase.

No comments: