Skip to content

bpo-38938: Added special case to heapq.merge for small numbers of iterables #17422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

sweeneyde
Copy link
Member

@sweeneyde sweeneyde commented Dec 1, 2019

@rhettinger
Copy link
Contributor

Please don't reformat the whole module. Make sure the diff touches a minimum number of lines.

Also, I had in mind just the special case for n<=2 (just a single variables for each iterable, no binary tree).

Lib/heapq.py Outdated
@@ -329,6 +329,116 @@ def merge(*iterables, key=None, reverse=False):

'''

n = len(iterables)
if n == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop the n==0 and n==1 cases. They were already fast.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@sweeneyde
Copy link
Member Author

I have made the requested changes; please review again.

@bedevere-bot
Copy link

Thanks for making the requested changes!

@rhettinger: please review the changes made to this pull request.

@rhettinger
Copy link
Contributor

Stand-by for a bit. Am still reviewing the code. The binary tree variant is looking better than expected:

Am trying out starting the function with:

def merge(*seq):
    if len(seq) > 2:
        n = (len(seq) + 1) // 2
        seq = merge(*seq[:n]), merge(*seq[n:])
   ...

So far in testing, the binary tree variant is making fewer comparisons than the heap variant when n > 3.

@rhettinger
Copy link
Contributor

Also, for the comparisons, we're going to need to flip them around and/or invert them so that only __lt__ is used, the same as we do for bisect(), the other heapq functions, and sorted().

@sweeneyde
Copy link
Member Author

Using the binary-tree-of-iterables approach, is there a good way to prevent the re-computation of keys? The only way I've thought of is by taking iterables = [((key(x), i, x) for i, x in enumerate(it)) for it in iterables], combining using the same approach, then yielding only the last part of each tuple in the final step.

@tim-one
Copy link
Member

tim-one commented Dec 3, 2019

is there a good way to prevent the re-computation of keys? The only way I've thought of is by taking iterables = [((key(x), i, x) for i, x in enumerate(it))

I don't think there's a need for enumerate(), because this other approach really doesn't need to use tuple comparisons at all. Comparing tuples is another layer of expense.

Instead each input iterable should be wrapped in a generator that yields (element, key(element)) pairs (or in the other order - doesn't matter). The keys should be compared directly; the elements should never be compared; and an index from enumerate() would be useless then.

@sweeneyde
Copy link
Member Author

So then for the case where n > 2, to account for the key recursively, I think we could do something like the following:

# outside the function:
_SUB0 = operator.itemgetter(0)
_SUB1 = operator.itemgetter(1)

...

# after ensuring len(iterables) > 2:
n2 = len(iterables)//2
if key is None:
    result_1 = merge(*iterables[:n2], key=None, reverse=reverse)
    result_2 = merge(*iterables[n2:], key=None, reverse=reverse)
    yield from merge(result_1, result_2, key=None, reverse=reverse)
elif key is _SUB1:
    # recursive call, keys already computed
    result_1 = merge(*iterables[:n2], key=_SUB1, reverse=reverse)
    result_2 = merge(*iterables[n2:], key=_SUB1, reverse=reverse)
    yield from merge(result_1, result2, key=_SUB1, reverse=reverse)
else:
    # Top-level call: zip all of the iterators up with their keys once and for all.
    key_iterables = [zip(it, map(key, it)) for it in iterables]
    # merge the zips, keyed by the second attribute (where we put the key)
    result_1 = merge(*key_iterables[:n2], key=_SUB1, reverse=reverse)
    result_2 = merge(*key_iterables[n2:], key=_SUB1, reverse=reverse)
    result = merge(result_1, result_2, key=_SUB1, reverse=reverse)
    # Now unpack the original values
    yield from map(_SUB0, result)

Are inter-module dependencies like this operator call okay? Or would we be better doing
def _SUB0(x): return x[0], etc?

@tim-one
Copy link
Member

tim-one commented Dec 3, 2019

Using functions from other modules generally isn't a problem. I'm not exactly clear on what you have in mind, though. I had this in mind, just showing a typical chunk of what would be the path in the case where there is a key function and it's not reversed:

if a_key < b_key:
    yield a_tuple
    try:
        a_value, a_key = a_tuple = next_a()
    except StopIteration:
        yield b_tuple
        yield from b_iter
        return

I can't guess whether this would be slower or faster then using an itemgetter, and really don't care - I don't stress over micro-optimizations one way or the other, as they usually end up depending a whole lot on the specific compiler+options being used, and "the answer" can change over time anyway. If timing shows a "big" difference, then it may be worth giving up a bit of clarity.

@sweeneyde
Copy link
Member Author

I was suggesting leaving the n==2 case as I had implemented, and then for more iterables, reducing to this 2-case in a way that prevents the re-computation of keys. The current diff contains this alternative algorithm.

I did some rough benchmarking on this version using the code here (https://pastebin.com/qhKzcLzZ), and it was finding better performance for the tree-of-iterables version in almost all situations, especially when no key is used, and especially especially in the n==2 case.

I don't think there should be too much of a concern about a very unbalanced tree either, because each comparison advances some item toward the root of the tree, and each item must be advanced treedepth times, which is bounded above by ceil(log2(len(iterables))). The total number of comparisons needed is bounded above by (# of total items) * ceil(log2(len(iterables))), fewer than is typically used by repeated applications of heapreplace.

@tim-one
Copy link
Member

tim-one commented Dec 4, 2019

I don't believe that viewing this as a binary tree was helpful to begin with 😉. At the extreme, we're merging N 1-element iterables, and then the new approach is just an ordinary top-down recursive merge sort, but with lazy merging. As such, it "should" do fewer compares than a heap-based sort.

@tim-one
Copy link
Member

tim-one commented Dec 4, 2019

Without trying it, my guess is that this kind of thing would be a relatively poor case for the newer approach:

>>> import heapq
>>> N = 2**20
>>> raw = [[0]] * N
>>> raw.append([1] * N)
>>> xs = heapq.merge(*raw)
>>> xs = list(xs)
>>> xs == [0] * N + [1] * N
True

There are over a million iterables here, so the heap grows to about 20 levels deep, and a recursive mergesort goes down about 20 levels. After about half the results are delivered, the heap is reduced to a single element, which pushes and pops 1 N times. The mergesort is similarly reduced to being fed by only the original [1] * N iterable, but that remains buried under about 20 levels of recursive yielding. Unless I'm picturing something wrong.

If I'm not, the layers of recursive yielding could be considerably more expensive than pushing and popping from a 1-element heap N times.

But this is all highly contrived.

@sweeneyde
Copy link
Member Author

Oops--it looks like you're right: I didn't benchmark diverse enough inputs. I did a different benchmark (https://pastebin.com/88jkw5Yx with results visualized at https://imgur.com/a/mtEO0uA) testing different levels of "balance" for the lengths of the iterables, and it looks like the 2-iterable case was the only one that made nontrivial improvement across the board. The rest of the time, in general, I believe the automatic shrinking of the heap and indifference of the heap performance to the sources of its items beat out any small constant factor decrease in the number of comparisons from the recursive mergesort approach.

My apologies for the repeated changes, and thank you both for your guidance.

@tim-one
Copy link
Member

tim-one commented Dec 5, 2019

I'm not making a judgment 😉. Comparisons can be arbitrarily expensive in Python, and when they are an approach that cuts the number of compares can beat anything else.

Raymond needs to chime in too. What I guess is most common: no more than a few dozen iterable inputs, all of approximately the same length, with perhaps one unusually short oddball. That's what you naturally get when, e.g., you use merge() to sort a sequence too large to fit in RAM, and break it into sub-RAM sized pieces, each of which is .sort()ed independently, and written to disk. In such cases the heap never gets very deep, but neither does the recursion depth, and the heap doesn't get materially smaller until very near the end.

Optimizing for a million iterables (as my "bad case" example did) would be plain insane 😉.

@rhettinger
Copy link
Contributor

Thanks for all the comments. I'll have more time for this one over the holidays and will bring it to fruition.

@sweeneyde
Copy link
Member Author

https://pastebin.com/t048iBeL

Here's an interesting non-recursive pure-Python version of the mergesort-style algorithm. It seems to regularly beat out a forced-Python import of the current heapq module, and it was beating the heapq module across the board on my machine running pypy, so maybe that is an indication that it would outperform the existing heapq.merge function if implemented in C.

@sweeneyde
Copy link
Member Author

I was curious and wanted to get to know some cpython internals, so I took a crack at a C implementation: #17729.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants