Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-38938: Adjusting the heapq.merge algorithm for fewer comparisons #17729

Open
wants to merge 7 commits into
base: master
from

Conversation

@sweeneyde
Copy link
Contributor

sweeneyde commented Dec 28, 2019

This pull request would:

  • Change heapq.merge to be a class whose instances are iterable objects, rather than a generator function.
  • Change the algorithm for heapq.merge:
    • Before this PR, we repeatedly applied heapreplace on (index, iterator, item) tuples
      • Starting at the root, this would repeatedly replace parents with their smaller child. Upon reaching a leaf, it would replace that leaf using the __next__ method of the iterator corresponding to the the item we are replacing. It would then sift that new leaf down into the appropriate part of the heap.
    • After this PR, we would apply a "tree sift" method:
      • General algorithm: the items compete in a tournament, as we maintain a binary "tree of losers" (see Knuth Volume 3, Chapter 5.4.1. on "Multiway Merging").
      • Here, we only need to store individual items, or key-item pairs.
      • After popping off the root, we replace it with its smaller child, the child with its smaller child, etc., until reaching a leaf. So far, this is identical to the standard siftup method.
      • However, once reaching a leaf, we replace the leaf with the iterator corresponding to the index of that leaf. That is, always filling the leftmost leaf with the first iterable that was passed, the second-leftmost with the second-passed, ..., and the rightmost leaf with the last iterable passed.
      • This eliminates the need to sift the leaf down afterward because we already know that it is larger than the item it replaces.
      • For tree sizes that aren't a power of two, permuting the iterators so that the first one is leftmost is simply a matter of shifting all iterators around the list by a few spots.
      • Heuristically, removing the requirement that there be exactly one item from each iterator in the heap at a time makes the heap more likely to consist of those items which are closest.
      • Storing the tree flatly (for the C implementation, even when using a key), rather than having a tuple at each node helps with cache locality.
      • This is a non-recursive version of recursively applying a function that merges two iterables.
  • Add the corresponding C and Python implementations.
  • Add a test case for error handling in heapq.merge

According to my benchmarking, this is at least 2.5 times as fast in almost all cases to the existing implementation, and is especially fast (about 5x as fast as the existing way) with small numbers of iterables. I tested up to 12,000 iterables, with varying degrees of "balance" among the number of items in each. Also noteworthy is that fewer comparisons on average were needed with the proposed new method.

https://bugs.python.org/issue38938

Copy link
Member

pablogsal left a comment

I made a quick review of general aspects of the PR and left some comments. When I have more time I can do a more in-depth review of the PR.

Py_TPFLAGS_BASETYPE, /* tp_flags */
merge_doc, /* tp_doc */
(traverseproc)merge_traverse, /* tp_traverse */
0, /* tp_clear */

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

Please, implement tp_clear (maybe we could prove that this object cannot participate in reference cycles but is risky and the implementation is straightforward and assures correctness).

Modules/_heapqmodule.c Show resolved Hide resolved
Py_DECREF(tmpargs);
}

if (keyfunc == Py_None)

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

Please, follow PEP 7 (add braces).

}

iterators = PyList_New(num_iters);
if (iterators == NULL)

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

Please, follow PEP 7 (add braces).

}

mo = (mergeobject *)type->tp_alloc(type, 0);
if (mo == NULL)

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

Please, follow PEP 7 (add braces).

cmp = PyObject_RichCompareBool(
childkey, otherchildkey, Py_LT);
if (cmp < 0) {
arr[pos] = NULL;

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

Why are we setting this to NULL? If there is a PyObject* we would be leaking its reference, no? (I may be missing something).

This comment has been minimized.

Copy link
@sweeneyde

sweeneyde Dec 29, 2019

Author Contributor

I intended the tree to never be accessible from Python code.

This method repeatedly replaces a parent with its larger child, replaces that larger child with its larger child, etc. Before this sift method is complete, the list contains two copies of the item it just copied down. If, during this process, we encounter an error in the comparison, then this line ensures that the duplicate element won't be DECREFed twice when the list is cleaned up. In other words, this line ensures that the list is in a "consistent" state, except for the fact that it contains a bunch of NULLs. I ran the test case I added with -R 3:3 and it did not find any refleaks.

If tree is never externally accessible, is this a valid use for lists? Is there a better way to store a private C array of PyObject pointers when some of them may be NULL?

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

I intended the tree to never be accessible from Python code.

Bear in mind that the tree is tracked by the gc. In general it checks for NULL before visiting elements of the list but this is tricky business because by contract is exposed to the C-API via the collector (and you need to track it since is a container and the list itself is tracked once it has at least 1 element).

My question is that when you do arr[pos] = child; for example, technically that is a violation of the list API, as the reference count of the object should be increased (the list owns the object twice). Your code can be correct but there is extra effort needed to validate the correctness once is in violation of the API (also this can be correct now but not if the list API changes). For example, consider:

>>> import sys
>>> x = object()
>>> sys.getrefcount(x)
2
>>> y = [x,x,x]
>>> sys.getrefcount(x)
5

here you can see that the list increase the reference count for every time it contains the object.

I would advise sticking to the API and use PyList_SET_ITEM, PyList_GET_ITEM and Py_INCREF instead of manipulating the internal array directly and relly on the fact that half of the list is using borrowed references because this is very confusing and very difficult to read and maintain IMHO, even if is correct.

If tree is never externally accessible, is this a valid use for lists?

Even if tree is never externally accessible there may be part of the C-API that expects python objects in valid states (no NULL in the elements and the reference count being correctly computed).

This comment has been minimized.

Copy link
@sweeneyde

sweeneyde Dec 29, 2019

Author Contributor

Would it be okay to use a raw C array? Like

typedef struct mergeobject{
    PyObject_HEAD
    Py_ssize_t num_iterators;
    PyObject **iterators;        /* C-Array of iterator objects */
    PyObject *keyfunc;           
    PyObject **tree; /* C-Array of the elements being compared. Length determined by num_iterators */
    int reverse;
    int started;
} mergeobject;

I suppose the deallocation would then have to be done in a for-loop, but is that an issue?

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

I suppose the deallocation would then have to be done in a for-loop, but is that an issue?

It may not be an issue but you need to still take ownership of the elements. If you don't do this, the array may point to garbage if the object loses all its other references. Maintaining the references will basically lead you to do a similar thing than just using a list and doing PyList_SET_ITEM, PyList_GET_ITEM and Py_DECREF but it may be a bit better (at least won't be tracked by the collector, but then you risk leaking reference cycles).

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

Notice that even if we solve the reference count issue, you still need to deal with the NULLs. For example, this snippet will crash the interpreter with your code:

❯ ./python
Python 3.9.0a2+ (heads/play:4d893ea9dd, Dec 29 2019, 03:33:13) 
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import heapq
>>> import gc
>>> gc.collect()
10
>>> x = heapq.merge(iter([1,2,3]), iter([4,5,6]))
>>> next(x)
1
>>> next(x)
2
>>> all_lists = [x for x in gc.get_objects(generation=0) if isinstance(x, list)]
>>> len(all_lists)
5
>>> all_lists
[[1, 2, 3], [4, 5, 6], [<list_iterator object at 0x7f19f1d66550>, <list_iterator object at 0x7f19f1d66be0>], [<NULL>, 3, 4], ['all_lists', 'x', 'x', 'gc', 'get_objects', 'generation', 0, 'isinstance', 'x', 'list']]
>>> all_lists[3]
[<NULL>, 3, 4]
>>> all_lists[3][0]
[1]    725 segmentation fault (core dumped)  ./python

This comment has been minimized.

Copy link
@sweeneyde

sweeneyde Dec 29, 2019

Author Contributor

the reference count issue

What is this issue? Each item in the array is acquired from a call of PyIter_Next, and thus should be a new reference (or NULL). Each item is eventually returned once from merge_next, at which point ownership of that reference is yielded to the caller.

I had been thinking about the list as "a bunch of C local variables that can be modified at will, as long as we return each exactly once from merge_next." It also happens that at the end of each call to _tree_sift, each of the references that the mergeobject has acquired is stored exactly once in its tree.

But seeing your gc.get_objects() example, I realize that we can't quite use lists in that way. But what is the problem with plain C arrays (like PyObject *tree[], which we might initialize with PyMem_New)?

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

What is this issue? Each item in the array is acquired from a call of PyIter_Next, and thus should be a new reference (or NULL). Each item is eventually returned once from merge_next, at which point ownership of that reference is yielded to the caller.

The issue is that when you add the elements to the list after calling PyIter_Next the refcount is correct (let's say 1 because is once in the list), but when you do arr[pos] = child; the refcount is still 1 and it should be 2 because the list contains the element two times. I may be missing something, though.

Edit: I think what I was missing was that:

It also happens that at the end of each call to _tree_sift, each of the references that the mergeobject has acquired is stored exactly once in its tree.


But what is the problem with plain C arrays (like PyObject *tree[], which we might initialize with PyMem_New)?

They are not tracked by the garbage collector, and they should because they are containers. Also, note that you would still need to account correctly to the fact that you are owning references, so you would need to correctly call Py_INCCREF and Py_DECREF.

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

They are not tracked by the garbage collector, and they should because they are containers.

Although on a second thought this may be fine as the worst that can happen is that they are not accounted for until they are returned, so a PyObject** should be fine.

if (childvalue == NULL) {
Py_DECREF(it);
PyList_SET_ITEM(mo->iterators, pos/2 - (n - 1), NULL);
if (PyErr_Occurred())

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

PEP 7: Add braces

if (mo->tree == NULL)
return NULL;
if (mo->keyfunc == NULL) {
if (_tree_sift(mo, 0) < 0)
return NULL;
result = PyList_GET_ITEM(mo->tree, 0);
PyList_SET_ITEM(mo->tree, 0, NULL);
}
else {
if (_tree_sift_key(mo, 0) < 0)
return NULL;
Comment on lines +907 to +917

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

Please, add braces per PEP7

@@ -697,6 +1151,12 @@ PyInit__heapq(void)
m = PyModule_Create(&_heapqmodule);
if (m == NULL)
return NULL;

if (PyType_Ready(&merge_type) < 0)

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

PEP 7: Add braces

>>> list(merge(['dog', 'horse'], ['cat', 'fish', 'kangaroo'], key=len))\n\
['dog', 'cat', 'fish', 'horse', 'kangaroo']");

static PyTypeObject merge_type = {

This comment has been minimized.

Copy link
@pablogsal

pablogsal Dec 29, 2019

Member

Please, use C99 designated initializers for this class as is new (we use C99 now :) ).

@bedevere-bot

This comment has been minimized.

Copy link

bedevere-bot commented Dec 29, 2019

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.