Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PySequence_GetItem adjusting negative indexes make it difficult to handle them properly #96436

Open
davidhewitt opened this issue Aug 30, 2022 · 6 comments
Labels
docs Documentation in the Doc dir

Comments

@davidhewitt
Copy link
Contributor

davidhewitt commented Aug 30, 2022

Bug report

Sequence types which attempt to handle negative indexes in __getitem__ experience double-corrected indexes when native code accesses them via PySequence_GetItem.

The same behaviour affects PySequence_SetItem and PySequence_DelItem, although not discussed further here.

See the ERROR comments in the snippet below:

from typing import Any, Iterable

# This is just a compiled wrapper to expose the PySequence_GetItem call 
# to Python. The point is to demonstrate that native code calling PySequence_GetItem
# interacts poorly with the sequence type Seq in this sample.
from native_module import pysequence_getitem

class Seq:
    """An immutable sequence type. This is implemented in pure python but it could be implemented
    in native code.
    """
    def __init__(self, values: Iterable[Any]):
        # store the sequence contents as {index -> value} mapping, because if we
        # used a list or tuple, that container would also handle negative indexes, confusing the example.
        self._items = dict(enumerate(values))

    def __len__(self) -> int:
        return len(self._items)

    def __getitem__(self, index: Any) -> Any:
        if isinstance(index, int):
            try:
                # Need to correct negative indexes
                corrected_index = index if index >= 0 else index + len(self._items)
                return self._items[corrected_index]
            except KeyError:
                # Raise the original index passed to __getitem__, ideally for better error message for users
                raise IndexError(index) from None
        else:
            # A real container would probably want to also implement slicing
            return NotImplemented()


seq = Seq("abcde")

print(seq[3])
print(seq[-3])
# raises IndexError(-10) as expected
# print(seq[-10])

print(pysequence_getitem(seq, 3))
print(pysequence_getitem(seq, -3))

# ERROR: should IndexError, but instead the -10 gets double-corrected
print(pysequence_getitem(seq, -10))

# ERROR: raises IndexError(-15), however the index requested was -20
print(pysequence_getitem(seq, -20))

(If requested I can push this as a sample repository)

Your environment

macOS 12.4, Python 3.12 main branch
I believe this negative index correction has been a feature of every Python 3 build on all OS variants.

Further background

PyO3 is a project to write native Python modules in Rust. To make the experience as straightforward as possible for Python developers, we aim to offer an API to create classes which matches Python as closely as possible.

PyO3 provides a way to implement __getitem__ which matches how user-defined Python classes work, like in the sample above. However, PySequence_GetItem's use of negative indexing has been a source of confusion: PyO3/pyo3#2601 (comment)
PyO3/pyo3#2392 (comment)

I am happy to be instructed that PyO3 should adopt a different design (ideas welcome). I would still defend it is reasonable to call the interaction demonstrated above for the Python class a bug.

I couldn't find any other discussion on this issue other than this comment from Guido in 2001, which implied that this behaviour was probably perceived as tricky even then: #34788 (comment)

Possible solutions

I believe it should be possible to add a new type flag, bikeshed it Py_TPFLAGS_NO_SEQ_INDEX_ADJUST, which disables this adjustment for PySequence_{Get,Set,Del}Item methods. Python-defined classes can have this flag set, avoiding the problematic behaviour displayed above. Native types can also use it to opt-out so that they can handle the negative index themselves. The main upside for native classes is that (if desired) the indexing operator would be able to raise IndexError containing the out-of-range index. E.g. (1, 2, 3)[-10] could raise IndexError(-10) instead of IndexError("tuple index out of range").

@rhettinger
Copy link
Contributor

rhettinger commented Aug 30, 2022

I would use PyObject_GetItem() for this purpose. That is the intended way to emulate the pure python call s[i] and that is what is used by ceval.c to implement 'BINARY_SUBSCR`.

Ideally, we should add a recommendation to not use the PySequence API for calling into pure Python code. The code in that API assumes that upstream checks have been made to distinguish between mappings, numbers, and sequences. A C programmer controls it by choosing between the mp_subscript and sq_item slots. That option isn't available to someone writing pure python.

@rhettinger rhettinger added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Aug 31, 2022
@davidhewitt
Copy link
Contributor Author

davidhewitt commented Aug 31, 2022

Agreed it's fair to say that PyObject_GetItem is the correct API for general usage (and that's what PyO3 uses).

Ideally, we should add a recommendation to not use the PySequence API for calling into pure Python code.

I think this recommendation isn't quite correct, there are many cases in CPython where PySequence_GetItem is used to index into what could be a pure-python object, such as _bisectmodule.c.

What I notice is that that every usage in CPython where I can find PySequence_GetItem and friends used, the index is guaranteed by construction to be positive. So it's impossible to hit the case demonstrated here from the standard library.

As you point out, users of the C/API are would already be aware of the mp_subscript and sq_item slots. Perhaps a potential option is to deprecate PySequence_GetItem and friends, redirecting C/API users to call sq_item directly (possibly via PyType_GetSlot)?

@rhettinger rhettinger added docs Documentation in the Doc dir and removed type-feature A feature request or enhancement labels Aug 31, 2022
@rhettinger
Copy link
Contributor

rhettinger commented Aug 31, 2022

Perhaps a potential option is to deprecate PySequence_GetItem and friends

These are part of the stable ABI and have been around for very long time. Deprecating them would be highly disruptive. I think it is sufficient to document that caller should use PyObject_GetItem unless their code either 1) is known to call only a class that doesn't adjust negative indices or 2) always has a non-negative input.

@davidhewitt
Copy link
Contributor Author

davidhewitt commented Sep 1, 2022

The point I am trying to get across is that in my eyes the issue here makes it impossible for me as a sequence implementor to handle negative indices correctly. Maybe documenting what a user of PySequence_GetItem should not do to avoid this edge case is fine, and then I can implement handling for negative indices in __getitem__ and blame the caller if they hit this behaviour.

The status quo in the standard library sequences such as tuple is that mp_subscript also adjusts negative indices in the same way as PySequence_GetItem, before forwarding the index to sq_item in the same way. This is correct in all cases, albeit with duplication of the index correction code, and Python classes which just have the one __getitem__ method obviously cannot implement this strategy.

I would still hope we can explore the solution space to see if there's a long-term resolution we can implement. Hence my proposal #96437 to add a flag to opt out of this behaviour.

Deprecating them would be highly disruptive. I think it is sufficient to document that caller should use PyObject_GetItem unless their code either 1) is known to call only a class that doesn't adjust negative indices or 2) always has a non-negative input.

I fully agree that deprecating them is highly disruptive, hence my proposal #96437 was my first suggestion. I would argue that the documentation you propose is equivalent to deprecating the use of these APIs with negative indices?

@rhettinger
Copy link
Contributor

rhettinger commented Sep 1, 2022

ISTM that a flag doesn't really do anything for us. In any case where a user would set the flag, they should instead be calling PyObject_GetItem() instead.

@davidhewitt
Copy link
Contributor Author

davidhewitt commented Sep 1, 2022

With respect, I think focusing on users still misses my point.

As a Python sequence implementor authoring __getitem__, I cannot handle negative indices correctly when PySequence_GetItem adjusts negative indices but PyObject_GetItem does not.

A sequence user should never need to care about whether this proposed flag is set on the type.

The flag's intended purpose is to allow sequence implementors to opt-out of this problematic part of PySequence_GetItem so that users can freely interact with the sequence using either that or PyObject_GetItem and not shoot themselves in the foot in the process.

(same for __setitem__ etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir
Projects
None yet
Development

No branches or pull requests

2 participants