Saturday, September 27, 2008

Python: Removing items from a list

While sitting on a plane from Santa Clara to Austin (more on that in another post), I finally sat down and wrote simple test program to determine "what's the fastest way to remove a known item from a list in Python?"

Python has some surprising performance behaviors, and removing items from a list is no exception. I came up with five methods that I thought someone might implement for removing stuff from a list, along with a bit of glue code to time everything (I used the timeit module for that, hooray for 'coming with batteries').

I've posted the code in a file, (you can get that here), including the removal functions. I'll assume if you keep reading that you've looked at the source code. We have to have some basis for communication, after all.

Here are the results of running the module on my MacBook Air, while sitting on the airplane. Your mileage may vary.

McJohn$ python listremove.py 1024 4096
name min max mean median
ghetto_remove 0.011734 0.024060 0.015157 0.014703
repeated_remove 0.004583 0.025112 0.012059 0.010892
comp_remove 0.009502 0.020047 0.012323 0.011900
old_remove 0.017689 0.035935 0.022772 0.022382
hard_remove 0.005304 0.014384 0.007783 0.007205

What's surprising to me about this isn't that the 'hard remove' is the fastest (on average), but that it's worst case is better than the average runtime for most of the other algorithms--except for the 'repeated remove' best case, which is staggeringly good. That's actually quite shocking in and of itself, but I suspect that it would get worse if there were more collisions in the list itsef. We can test this by reducing the number of random elements allowable in the list while maintaining the list length. Here are the results for that trial:

McJohn$ python listremove.py 16 4096
name min max mean median
ghetto_remove 0.139576 0.214912 0.175577 0.177581
repeated_remove 2.597167 4.307188 3.578634 3.592269
comp_remove 0.090875 0.144086 0.116044 0.116585
old_remove 0.168973 0.239456 0.206943 0.207778
hard_remove 0.399565 0.668813 0.534192 0.542308

It turns out that this approach makes the repeated_remove case staggeringly bad, causing a massive increase in runtime. I wasn't actually prepared for it to get this bad (I was fairly certain there was a hang in my code somewhere, the perf tanked so badly). Unfortunately, that wasn't the case--it was just that repeated_remove gets really slow in the case of massive collisions. That shouldn't be too surprising, though. In the face of an element appearing many times in the list, the repeated_remove code has an O(N^2) worst case performance. It effectively could have to traverse the list up to N times (where N is also the length of the list). Of course, each traversal in that case should be very fast, averaging 1 lookup before finding the element. Something definitely seems to be going wrong, though.

Also a bit surprising in this case is that the hard_remove performance gets quite bad. In fact, only the 'comp_remove' seems to get good performance

No comments: