That happened during discussion on the reddit's comments thread about his spell checker: http://programming.reddit.com/info/1gb59/comments
ipeev 2 points 4 days ago*
I see a small problem in the Norvig's code. This part is supposed to make 26n alterations:
[word[0:i]+c+word[i+1:] for i in range(n) for c in string.lowercase] + ## alteration
But not really. This is what string.lowercase returns on my computer:
import string
string.lowercase
'abcdefghijklmnopqrstuvwxyz\x83\x9a\x9c\x9e\xaa\xb5\xba\xdf\xe0 \xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0 \xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
Total of 65 characters. Same for insertion.
Is this a real problem? It could be! One has to be aware of the Python's battery included philosophy.
Then someone responds:
vetinari 1 point 2 days ago
Check your locale, mine is fine:
>>> import string
>>> [c for c in string.lowercase]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
>>>
He is right too, but I am not going to change my locale, so I respond:
ipeev 1 point 2 days ago
My point is that the Norvig's code is not aware of the localization features of Python. If he wants his program to run on any computer, the simplest think to do is use the first 26 letters only:
string.lowercase[:26]
But Norvig didn't exactly listened to me and implemented different solution:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion
So instead of choosing my cryptic syntax he implemented it with explicit alphabet instead. Apparently he also thinks that explicit is better than implicit.
At the end of his page he writes:
Originally my program was 20 lines, but a reader pointed out that I had used string.lowercase, which in some locales in some versions of Python, has more characters than just the a-z I intended. So I added the variable alphabet to make sure.
I think I might have been that reader. Just for information the locale in question is Bulgarian. Sorry for causing the trouble with all those regional alphabets, I know it a pain in the ... . I like his site very much and enjoyed the "Tutorial on Good Lisp Programming Style" and even I have it printed on paper at home!
UPDATE: I've sent an email to Norvig and asked him about this and he responded that indeed, he did read my comment on reddit and thats why he changed the implementation. Now my name is there too which is cool!
Originally my program was 20 lines, but Ivan Peev pointed out that I had used string.lowercase, which in some locales in some versions of Python, has more characters than just the a-z I intended. So I added the variable alphabet to make sure. I could have used string.ascii_lowercase.
I do have one question though - it appears to me that the WordPath class is used only for running the two methods. Wouldn't it be cleaner (two fewer lines and no "selfs") to just have the two methods in the global name space? What does the class structure bring to the party?
Don't get me wrong; I'm not trying to trash the code. I'm sure I couldn't do as well. I'm just trying to improve my understanding of good python style.
Thanks.
By steve_g, # 17. August 2006, 01:35:01
The solution is implemented with a class because of the Google Code Jam rules. They give the class structure and the competitor have to implement it. For this task the Java interface was:
By ipeev, # 17. August 2006, 04:41:58
Runs a little faster for the tests (on my machine anyway) but starts running a lot faster if you remove the "> 1000000000" tests and use large test values.
EDIT: Made it a bit more efficient
By almostobsolete, # 17. August 2006, 14:48:40
After looking for several minutes at the code I think I understand now why and how this solution works.
But not sure why it is faster. It aparently doesn't use cache. Maybe because it doesn't use recursion. I wish Python supports better recursive functions some day.
Anyway. Measured 2.9 seconds on the same computer. Removing the check for "> 1000000000" doesn't improve the time from what I see.
Probably the trick with the early interruption can be used here too, because it is far behind the fastest 0.03 seconds solution with the exceptions.
By ipeev, # 17. August 2006, 17:04:33
By almostobsolete, # 17. August 2006, 19:05:31
I was coding this in the statistical language R (which is not fast enough, but comes surprisingly close with sparse transition matrix operations) and this got me thinking about the overflow exception case.
I don't think that all A's is the hardest case by a long way. Suppose you have a 50x50 grid of As with a B in the last entry and that the test word is 49 As and a B. You won't know until you look at the last entry whether there are $BIGNUM solutions or none.
By anonymous user, # 21. August 2006, 14:49:54
The first test is with all "A"s and the second is with "A"s and only 1 "B" at the end.
Here are the measured results:
We see that indeed the time increased about 100 times. But it is still under 1 second and much better than 2.4545 sec/pass for my first solution without the exceptions.
But let see how the solution provided by almostobsolete will handle this case. Running the same 2 tests gives:
His algorythm handles a little better the new extreme case. Apparently it is using time to sum all solutions in the "A"s only case.
By ipeev, # 22. August 2006, 06:53:36