Description
I was about to create a PR but I realized that annotating DataFrame.__getitem__
might be impossible without making some simplifications.
There are two problems with __getitem__
:
- We allow any
Hashable
key (one would expect that this always returns aSeries
) butslice
(isHashable
) returns aDataFrame
- Columns can be a multiindex,
df["a"
] can return aDataFrame
.
The MS stubs seems to make two assumptions: 1) columns can only be of type str (and maybe a few more types - but not Hashable) and 2) multiindex doesn't exist. In practice, this will cover almost all cases.
I don't think there is a solution for the multiindex issue. Even if we make DataFrame generic to carry the type of the column index, there is no Not[Multiindex]
type, so we will always end up with incompatible & overlapping overloads.
The Hashable issue can partly be addressed:
# cover most common cases that return a Series
@overloads
def __getitem__(self, key :Scalar) -> Series:
...
# cover most common cases that return a DataFrame
@overloads
def __getitem__(self, key : list[HashableT] | np.ndarray | slice | Index | Series) -> DataFrame:
...
# everything else
@overloads
def __getitem__(self, key : Hashable) -> Any: # or Series | DataFrame (but might create many errors, typshed also uses Any in some cases to avoid unions)
...
Do you see a way to cover all cases of __getitem__
and if not which assumptions are you willing to make? @simonjayhawkins @Dr-Irv