Skip to content

ENH/API: allow different values for labels/index in DF.reindex __OR__ raise error #21685

Open
@h-vetinari

Description

@h-vetinari

Currently, DataFrame.reindex has three overlapping keywords:

  • labels
  • index
  • columns

I (naively) expected it to work to pass different values to labels/index (motivating example below), but this does not work. I'm going to make a proposal of how this could be incorporated, but independently from that -- in the current state -- at least an error should be raised on conflicting values to labels/index (or even just using both kwargs).

2018-10-08 EDIT: This is as far as necessary for the purpose of raising errors.

Alternatively (or maybe complementarily), one could the following use case for allowing different values for labels/index - as .reindex (at least by name) has two interpretations:

  • selecting an index
  • assigning an index

[end of EDIT]

The example is related to what I'm working on in #21645, where I want to construct an inverse to .duplicated -- allowing to reconstruct the original object from the deduplicated one.

As a toy example:

df = pd.DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']})
df
#    A  B
# 0  0  a
# 1  1  b
# 2  1  b
# 3  2  c
# 4  0  a

isdup, inv = df.duplicated(keep='last', return_inverse=True)
isdup
# 0     True
# 1     True
# 2    False
# 3    False
# 4    False
# dtype: bool

inv
# 0    4
# 1    2
# 2    2
# 3    3
# 4    4
# dtype: int64

unique = df.loc[~isdup]
unique
#    A  B
# 2  1  b
# 3  2  c
# 4  0  a

unique.reindex(inv)
#    A  B
# 4  0  a
# 2  1  b
# 2  1  b
# 3  2  c
# 4  0  a

This is obviously not identical to the original object yet, because -- while we have read the correct indexes from unique, we haven't assigned them to the correct output indexes yet.

I had been long working with .loc[] until v.0.23 started telling me to use .reindex, and consequently, I wasn't very acquainted with it. I started by trying the following, which would conceptually make sense to me (as opposed to interpreting .reindex(inv) directly, which would break heaps of code):

unique.reindex(labels=inv.values, index=inv.index)
#      A    B
# 0  NaN  NaN
# 1  NaN  NaN
# 2  1.0    b
# 3  2.0    c
# 4  0.0    a

This was surprising, because labels is completely ignored (even though it is the first argument in the call signature), and no warning is raised for swallowing contradictory results.

In any case, this is not very high priority, as a more-or-less simple work-around exists, but it is still something to consider, IMO.

## the workaround
unique.reindex(inv.values).set_index(inv.index).equals(df)
# True

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions