Description
Currently, DataFrame.reindex
has three overlapping keywords:
labels
index
columns
I (naively) expected it to work to pass different values to labels
/index
(motivating example below), but this does not work. I'm going to make a proposal of how this could be incorporated, but independently from that -- in the current state -- at least an error should be raised on conflicting values to labels
/index
(or even just using both kwargs).
2018-10-08 EDIT: This is as far as necessary for the purpose of raising errors.
Alternatively (or maybe complementarily), one could the following use case for allowing different values for labels
/index
- as .reindex
(at least by name) has two interpretations:
- selecting an index
- assigning an index
[end of EDIT]
The example is related to what I'm working on in #21645, where I want to construct an inverse to .duplicated
-- allowing to reconstruct the original object from the deduplicated one.
As a toy example:
df = pd.DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']})
df
# A B
# 0 0 a
# 1 1 b
# 2 1 b
# 3 2 c
# 4 0 a
isdup, inv = df.duplicated(keep='last', return_inverse=True)
isdup
# 0 True
# 1 True
# 2 False
# 3 False
# 4 False
# dtype: bool
inv
# 0 4
# 1 2
# 2 2
# 3 3
# 4 4
# dtype: int64
unique = df.loc[~isdup]
unique
# A B
# 2 1 b
# 3 2 c
# 4 0 a
unique.reindex(inv)
# A B
# 4 0 a
# 2 1 b
# 2 1 b
# 3 2 c
# 4 0 a
This is obviously not identical to the original object yet, because -- while we have read the correct indexes from unique
, we haven't assigned them to the correct output indexes yet.
I had been long working with .loc[]
until v.0.23 started telling me to use .reindex
, and consequently, I wasn't very acquainted with it. I started by trying the following, which would conceptually make sense to me (as opposed to interpreting .reindex(inv)
directly, which would break heaps of code):
unique.reindex(labels=inv.values, index=inv.index)
# A B
# 0 NaN NaN
# 1 NaN NaN
# 2 1.0 b
# 3 2.0 c
# 4 0.0 a
This was surprising, because labels
is completely ignored (even though it is the first argument in the call signature), and no warning is raised for swallowing contradictory results.
In any case, this is not very high priority, as a more-or-less simple work-around exists, but it is still something to consider, IMO.
## the workaround
unique.reindex(inv.values).set_index(inv.index).equals(df)
# True