Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upbpo-38938: Adjusting the heapq.merge algorithm for fewer comparisons #17729
Conversation
I made a quick review of general aspects of the PR and left some comments. When I have more time I can do a more in-depth review of the PR. |
Py_TPFLAGS_BASETYPE, /* tp_flags */ | ||
merge_doc, /* tp_doc */ | ||
(traverseproc)merge_traverse, /* tp_traverse */ | ||
0, /* tp_clear */ |
This comment has been minimized.
This comment has been minimized.
pablogsal
Dec 29, 2019
•
Member
Please, implement tp_clear (maybe we could prove that this object cannot participate in reference cycles but is risky and the implementation is straightforward and assures correctness).
Py_DECREF(tmpargs); | ||
} | ||
|
||
if (keyfunc == Py_None) |
This comment has been minimized.
This comment has been minimized.
} | ||
|
||
iterators = PyList_New(num_iters); | ||
if (iterators == NULL) |
This comment has been minimized.
This comment has been minimized.
} | ||
|
||
mo = (mergeobject *)type->tp_alloc(type, 0); | ||
if (mo == NULL) |
This comment has been minimized.
This comment has been minimized.
cmp = PyObject_RichCompareBool( | ||
childkey, otherchildkey, Py_LT); | ||
if (cmp < 0) { | ||
arr[pos] = NULL; |
This comment has been minimized.
This comment has been minimized.
pablogsal
Dec 29, 2019
Member
Why are we setting this to NULL
? If there is a PyObject*
we would be leaking its reference, no? (I may be missing something).
This comment has been minimized.
This comment has been minimized.
sweeneyde
Dec 29, 2019
Author
Contributor
I intended the tree to never be accessible from Python code.
This method repeatedly replaces a parent with its larger child, replaces that larger child with its larger child, etc. Before this sift method is complete, the list contains two copies of the item it just copied down. If, during this process, we encounter an error in the comparison, then this line ensures that the duplicate element won't be DECREFed twice when the list is cleaned up. In other words, this line ensures that the list is in a "consistent" state, except for the fact that it contains a bunch of NULLs. I ran the test case I added with -R 3:3
and it did not find any refleaks.
If tree
is never externally accessible, is this a valid use for lists? Is there a better way to store a private C array of PyObject pointers when some of them may be NULL?
This comment has been minimized.
This comment has been minimized.
pablogsal
Dec 29, 2019
•
Member
I intended the tree to never be accessible from Python code.
Bear in mind that the tree is tracked by the gc. In general it checks for NULL before visiting elements of the list but this is tricky business because by contract is exposed to the C-API via the collector (and you need to track it since is a container and the list itself is tracked once it has at least 1 element).
My question is that when you do arr[pos] = child;
for example, technically that is a violation of the list API, as the reference count of the object should be increased (the list owns the object twice). Your code can be correct but there is extra effort needed to validate the correctness once is in violation of the API (also this can be correct now but not if the list API changes). For example, consider:
>>> import sys
>>> x = object()
>>> sys.getrefcount(x)
2
>>> y = [x,x,x]
>>> sys.getrefcount(x)
5
here you can see that the list increase the reference count for every time it contains the object.
I would advise sticking to the API and use PyList_SET_ITEM
, PyList_GET_ITEM
and Py_INCREF
instead of manipulating the internal array directly and relly on the fact that half of the list is using borrowed references because this is very confusing and very difficult to read and maintain IMHO, even if is correct.
If tree is never externally accessible, is this a valid use for lists?
Even if tree
is never externally accessible there may be part of the C-API that expects python objects in valid states (no NULL in the elements and the reference count being correctly computed).
This comment has been minimized.
This comment has been minimized.
sweeneyde
Dec 29, 2019
Author
Contributor
Would it be okay to use a raw C array? Like
typedef struct mergeobject{
PyObject_HEAD
Py_ssize_t num_iterators;
PyObject **iterators; /* C-Array of iterator objects */
PyObject *keyfunc;
PyObject **tree; /* C-Array of the elements being compared. Length determined by num_iterators */
int reverse;
int started;
} mergeobject;
I suppose the deallocation would then have to be done in a for-loop, but is that an issue?
This comment has been minimized.
This comment has been minimized.
pablogsal
Dec 29, 2019
•
Member
I suppose the deallocation would then have to be done in a for-loop, but is that an issue?
It may not be an issue but you need to still take ownership of the elements. If you don't do this, the array may point to garbage if the object loses all its other references. Maintaining the references will basically lead you to do a similar thing than just using a list and doing PyList_SET_ITEM
, PyList_GET_ITEM
and Py_DECREF
but it may be a bit better (at least won't be tracked by the collector, but then you risk leaking reference cycles).
This comment has been minimized.
This comment has been minimized.
pablogsal
Dec 29, 2019
Member
Notice that even if we solve the reference count issue, you still need to deal with the NULLs. For example, this snippet will crash the interpreter with your code:
❯ ./python
Python 3.9.0a2+ (heads/play:4d893ea9dd, Dec 29 2019, 03:33:13)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import heapq
>>> import gc
>>> gc.collect()
10
>>> x = heapq.merge(iter([1,2,3]), iter([4,5,6]))
>>> next(x)
1
>>> next(x)
2
>>> all_lists = [x for x in gc.get_objects(generation=0) if isinstance(x, list)]
>>> len(all_lists)
5
>>> all_lists
[[1, 2, 3], [4, 5, 6], [<list_iterator object at 0x7f19f1d66550>, <list_iterator object at 0x7f19f1d66be0>], [<NULL>, 3, 4], ['all_lists', 'x', 'x', 'gc', 'get_objects', 'generation', 0, 'isinstance', 'x', 'list']]
>>> all_lists[3]
[<NULL>, 3, 4]
>>> all_lists[3][0]
[1] 725 segmentation fault (core dumped) ./python
This comment has been minimized.
This comment has been minimized.
sweeneyde
Dec 29, 2019
Author
Contributor
the reference count issue
What is this issue? Each item in the array is acquired from a call of PyIter_Next, and thus should be a new reference (or NULL). Each item is eventually returned once from merge_next, at which point ownership of that reference is yielded to the caller.
I had been thinking about the list as "a bunch of C local variables that can be modified at will, as long as we return each exactly once from merge_next." It also happens that at the end of each call to _tree_sift, each of the references that the mergeobject has acquired is stored exactly once in its tree.
But seeing your gc.get_objects() example, I realize that we can't quite use lists in that way. But what is the problem with plain C arrays (like PyObject *tree[]
, which we might initialize with PyMem_New
)?
This comment has been minimized.
This comment has been minimized.
pablogsal
Dec 29, 2019
•
Member
What is this issue? Each item in the array is acquired from a call of PyIter_Next, and thus should be a new reference (or NULL). Each item is eventually returned once from merge_next, at which point ownership of that reference is yielded to the caller.
The issue is that when you add the elements to the list after calling PyIter_Next
the refcount is correct (let's say 1 because is once in the list), but when you do arr[pos] = child;
the refcount is still 1 and it should be 2 because the list contains the element two times. I may be missing something, though.
Edit: I think what I was missing was that:
It also happens that at the end of each call to _tree_sift, each of the references that the mergeobject has acquired is stored exactly once in its tree.
But what is the problem with plain C arrays (like PyObject *tree[], which we might initialize with PyMem_New)?
They are not tracked by the garbage collector, and they should because they are containers. Also, note that you would still need to account correctly to the fact that you are owning references, so you would need to correctly call Py_INCCREF
and Py_DECREF
.
This comment has been minimized.
This comment has been minimized.
pablogsal
Dec 29, 2019
Member
They are not tracked by the garbage collector, and they should because they are containers.
Although on a second thought this may be fine as the worst that can happen is that they are not accounted for until they are returned, so a PyObject**
should be fine.
if (childvalue == NULL) { | ||
Py_DECREF(it); | ||
PyList_SET_ITEM(mo->iterators, pos/2 - (n - 1), NULL); | ||
if (PyErr_Occurred()) |
This comment has been minimized.
This comment has been minimized.
if (mo->tree == NULL) | ||
return NULL; | ||
if (mo->keyfunc == NULL) { | ||
if (_tree_sift(mo, 0) < 0) | ||
return NULL; | ||
result = PyList_GET_ITEM(mo->tree, 0); | ||
PyList_SET_ITEM(mo->tree, 0, NULL); | ||
} | ||
else { | ||
if (_tree_sift_key(mo, 0) < 0) | ||
return NULL; |
This comment has been minimized.
This comment has been minimized.
@@ -697,6 +1151,12 @@ PyInit__heapq(void) | |||
m = PyModule_Create(&_heapqmodule); | |||
if (m == NULL) | |||
return NULL; | |||
|
|||
if (PyType_Ready(&merge_type) < 0) |
This comment has been minimized.
This comment has been minimized.
>>> list(merge(['dog', 'horse'], ['cat', 'fish', 'kangaroo'], key=len))\n\ | ||
['dog', 'cat', 'fish', 'horse', 'kangaroo']"); | ||
|
||
static PyTypeObject merge_type = { |
This comment has been minimized.
This comment has been minimized.
pablogsal
Dec 29, 2019
Member
Please, use C99 designated initializers for this class as is new (we use C99 now :) ).
This comment has been minimized.
This comment has been minimized.
bedevere-bot
commented
Dec 29, 2019
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
sweeneyde commentedDec 28, 2019
•
edited by bedevere-bot
This pull request would:
heapq.merge
to be a class whose instances are iterable objects, rather than a generator function.heapq.merge
:heapreplace
on (index, iterator, item) tuples__next__
method of the iterator corresponding to the the item we are replacing. It would then sift that new leaf down into the appropriate part of the heap.heapq.merge
According to my benchmarking, this is at least 2.5 times as fast in almost all cases to the existing implementation, and is especially fast (about 5x as fast as the existing way) with small numbers of iterables. I tested up to 12,000 iterables, with varying degrees of "balance" among the number of items in each. Also noteworthy is that fewer comparisons on average were needed with the proposed new method.
https://bugs.python.org/issue38938