Skip to content

TYP: how to annotate DataFrame.__getitem__ #46616

Open
@twoertwein

Description

@twoertwein

I was about to create a PR but I realized that annotating DataFrame.__getitem__ might be impossible without making some simplifications.

There are two problems with __getitem__:

  • We allow any Hashable key (one would expect that this always returns a Series) but slice (is Hashable) returns a DataFrame
  • Columns can be a multiindex, df["a"] can return a DataFrame.

The MS stubs seems to make two assumptions: 1) columns can only be of type str (and maybe a few more types - but not Hashable) and 2) multiindex doesn't exist. In practice, this will cover almost all cases.

I don't think there is a solution for the multiindex issue. Even if we make DataFrame generic to carry the type of the column index, there is no Not[Multiindex] type, so we will always end up with incompatible & overlapping overloads.

The Hashable issue can partly be addressed:

# cover most common cases that return a Series
@overloads
def __getitem__(self, key :Scalar) -> Series:
    ...

# cover most common cases that return a DataFrame
@overloads
def __getitem__(self, key : list[HashableT] | np.ndarray | slice | Index | Series) -> DataFrame:
    ...

# everything else
@overloads
def __getitem__(self, key : Hashable) -> Any:  #  or Series | DataFrame (but might create many errors, typshed also uses Any in some cases to avoid unions)
    ...

Do you see a way to cover all cases of __getitem__ and if not which assumptions are you willing to make? @simonjayhawkins @Dr-Irv

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs DiscussionRequires discussion from core team before further actionTypingtype annotations, mypy/pyright type checking

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions