Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for future copy / view semantics in indexing operations #36195

Open
TomAugspurger opened this issue Sep 7, 2020 · 79 comments
Open

Proposal for future copy / view semantics in indexing operations #36195

TomAugspurger opened this issue Sep 7, 2020 · 79 comments
Labels

Comments

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 7, 2020

Update: concrete proposal for Copy-on-Write: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing


[original post]

Hi all,

During the recent sprint we discussed copy / view semantics in indexing operations. I've tried to summarize a proposal that came out of it.

The tldr is

  1. Indexing operations always return a view where possible (always when indexing columns, if possible when indexing the rows).
  2. For consistent behavior, we implement either Copy on Write or Error on Write (explained in the proposal)
  3. For readability / understandability, we give users the tools to control in code whether mutations to a "child" dataframe (result of an indexing operation) propagate to the parent, while avoiding unnecessary "defensive" copies.

The full proposal is available at https://docs.google.com/document/d/1csGE4qigPR2vzmU2--jwURn3sK5k5eVewinxd8OUPk0/edit. I'd like to collect comments there and then propose properly by adding it to the roadmap.

cc @pandas-dev/pandas-core

@jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Sep 7, 2020

asv from a branch that changes column-based indexing to always return a view:

       before           after         ratio
     [42289d0d]       [f99abd57]
     <master>         <myway>   
-     9.24±0.06ms      8.77±0.02ms     0.95  algorithms.Hashing.time_series_string
-        57.3±2ms       54.3±0.2ms     0.95  join_merge.MergeAsof.time_by_int('backward', 5)
-      76.1±0.3ms       72.0±0.3ms     0.95  frame_methods.Describe.time_dataframe_describe
-        57.3±1ms       54.0±0.2ms     0.94  join_merge.MergeAsof.time_by_int('backward', None)
-     14.0±0.07ms      13.2±0.08ms     0.94  groupby.MultiColumn.time_cython_sum
-        1.29±0ms      1.20±0.01ms     0.94  timeseries.AsOf.time_asof_nan_single('DataFrame')
-     1.50±0.01ms      1.38±0.01ms     0.92  timeseries.AsOf.time_asof_single('DataFrame')
-     3.62±0.01ms      3.31±0.01ms     0.91  groupby.CountMultiDtype.time_multi_count
-      67.0±0.4ms       60.7±0.4ms     0.91  reshape.Crosstab.time_crosstab_normalize_margins
-      14.8±0.4ms      13.2±0.08ms     0.89  join_merge.MergeAsof.time_on_int32('nearest', 5)
-     14.4±0.09ms      12.8±0.04ms     0.89  join_merge.MergeAsof.time_on_int32('nearest', None)
-      14.7±0.1ms      13.0±0.06ms     0.89  join_merge.MergeAsof.time_on_uint64('nearest', 5)
-     14.3±0.05ms      12.6±0.03ms     0.88  join_merge.MergeAsof.time_on_uint64('nearest', None)
-      69.8±0.5ms       61.4±0.7ms     0.88  reshape.PivotTable.time_pivot_table_margins
-     14.3±0.06ms       12.6±0.1ms     0.88  join_merge.MergeAsof.time_on_int('nearest', 5)
-     12.7±0.05ms      11.1±0.04ms     0.87  join_merge.MergeAsof.time_on_int32('forward', 5)
-     14.0±0.06ms      12.3±0.05ms     0.87  join_merge.MergeAsof.time_on_int('nearest', None)
-     12.5±0.05ms      10.9±0.05ms     0.87  join_merge.MergeAsof.time_on_int32('forward', None)
-     12.1±0.05ms      10.5±0.02ms     0.87  join_merge.MergeAsof.time_on_int32('backward', 5)
-     12.0±0.05ms      10.3±0.04ms     0.86  join_merge.MergeAsof.time_on_int32('backward', None)
-     12.7±0.09ms      10.9±0.04ms     0.86  join_merge.MergeAsof.time_on_uint64('forward', 5)
-     12.2±0.06ms      10.5±0.07ms     0.85  join_merge.MergeAsof.time_on_int('forward', 5)
-     12.4±0.03ms      10.6±0.05ms     0.85  join_merge.MergeAsof.time_on_uint64('forward', None)
-     12.0±0.06ms      10.2±0.04ms     0.85  join_merge.MergeAsof.time_on_int('forward', None)
-     11.9±0.05ms      10.1±0.05ms     0.85  join_merge.MergeAsof.time_on_uint64('backward', 5)
-     11.8±0.04ms      9.98±0.05ms     0.85  join_merge.MergeAsof.time_on_uint64('backward', None)
-      11.4±0.4ms      9.61±0.02ms     0.84  join_merge.MergeAsof.time_on_int('backward', None)
-     11.5±0.06ms      9.71±0.04ms     0.84  join_merge.MergeAsof.time_on_int('backward', 5)
-         407±3μs          291±2μs     0.72  reindex.Reindex.time_reindex_columns
-      3.99±0.1ms       2.72±0.2ms     0.68  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ge')
-     3.56±0.02ms      2.36±0.03ms     0.66  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ne')
-      3.60±0.2ms      2.35±0.02ms     0.65  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('lt')
-     47.8±0.08ms       28.4±0.1ms     0.59  frame_methods.Dropna.time_dropna_axis_mixed_dtypes('all', 1)
-      4.81±0.9ms      2.81±0.03ms     0.58  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('truediv')
-     4.13±0.04ms       2.41±0.2ms     0.58  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('eq')
-        11.9±1ms       6.88±0.5ms     0.58  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('pow')
-     4.12±0.04ms      2.34±0.05ms     0.57  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('le')
-        4.67±1ms       2.55±0.2ms     0.55  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('mul')
-        4.90±1ms       2.55±0.2ms     0.52  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('sub')
-      40.3±0.2ms       20.5±0.2ms     0.51  frame_methods.Dropna.time_dropna('all', 1)

Methodological note: i did a full asv run, then for everything that came back as changed re-ran those and changes that were not reproducible (e.g. tslibs benchmarks which have to be noise)

@TomAugspurger
Copy link
Contributor Author

@TomAugspurger TomAugspurger commented Sep 29, 2020

I'd like to move forward with this, including a version of the proposal on https://pandas.pydata.org/pandas-docs/dev/development/roadmap.html.

There hasn't been much activity on the google doc, https://docs.google.com/document/d/1csGE4qigPR2vzmU2--jwURn3sK5k5eVewinxd8OUPk0/edit. The biggest outstanding item is a decision / discussion of "Copy on Write" vs. "Error on Write". It'd be nice if the proposal was explicit on this point, but I worry we won't be able to decide without actually trying it out.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 1, 2020

Some assorted thoughts I gathered the last weeks (not necessarily all needing an answer in an initial discussion / proposal, though):

  • The copy/view in indexing discussion also impacts methods that rely on indexing internally (eg drop, dropna). I think that is also shown from the fact that dropna performance improves in the benchmark output above.
    So the indexing vs methods (what you put in the "extended proposal" section) are certainly somewhat intertwined, although it is of course possible to first only target indexing by ensuring we do additional copies internally in methods like drop that expect to return a copy.

  • What to do with duplicate column selection? (df['a', 'a']]) Should that also still be a view? (if we would go for views) Mutating one column would then also mutate the other.

  • What kind of mutation operations do we consider in this discussion? We clearly talked about "changing values in-place (setitem)", but there are also other kinds of mutation like changing metadata, flags, ..

  • What to do with the constructors? (eg DataFrame(df)) Also copy-on-write? Or if we keep this as non-copy, what with constructors that have potentially a "reindexing" behaviour? (eg DataFrame(df, columns=[...]))

  • What do we do with accessing a single column as a Series? Do we keep this as a pure view, or do we also want to (eventually) introduce copy-on-write/error for this? (after deprecation cycle)

  • I think the crucial question that deserves more discussion is: are we fine with ditching the SettingWithCopyWarning for those cases it was introduced for?
    Because there is a reason that it was introduced: to warn uses that were doing a setitem operation that might have no effect silently, and thus be confusing / easy to miss for users.

    I think the classical example is something like df[df['B'] > 4]['B'] = 10 - a chained indexing assignment that doesn't work. The warning was introduced to prevent people doing that and silently having no effect.
    Are we fine with it that this will indeed have silently no effect? (which would be the consequence of copy-on-write)

@Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Oct 1, 2020

@TomAugspurger
Copy link
Contributor Author

@TomAugspurger TomAugspurger commented Oct 1, 2020

Agreed that the classic chained indexing example is a compelling argument for Error on Write rather than Copy on Write.

What to do with duplicate column selection? (df[['a', 'a']]) Should that also still be a view? (if we would go for views) Mutating one column would then also mutate the other.

IMO, yes that should create two views to the same data.

What kind of mutation operations do we consider in this discussion? We clearly talked about "changing values in-place (setitem)", but there are also other kinds of mutation like changing metadata, flags, ..

I'd like to keep this focused just on setting data / values. I'm hopeful that the metadata / flags issue can be solved by doing shallow copies by default and providing an option to copy the data if desired, like we did for set_flags.

What do we do with accessing a single column as a Series?

If possible, treating it the same as a DataFrame seems best to me.

@jreback
Copy link
Contributor

@jreback jreback commented Oct 1, 2020

is there any system that does error-on-write? I do prefer this, but am a little concerned that we are treading new group vs. copy-on-write (which I think has a large body of systems / literature).

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 12, 2020

Putting it here because most of the discussion on this topic is still here (but can also move to #36988).

While the classic chained indexing example is indeed an argument for Error-on-Write (for this case, it's basically what we have now but with an error instead of warning), I think there are also arguments/examples to give in favor of Copy-on-Write:

  • I would personally love to get the (IMO more common case) of subset = df[mask] or subset[list_of_columns] and then continuing with subset without worrying about the original df working out of the box, without the user needing to do .copy()/.copy_if_needed() (as is the case now if you don't want to risk getting the warning, even though those are always already copies)

  • There are some cases where we currently don't raise a warning (eg adding a new column to a subset created like above), that woud start raising an error.

  • We basically need Copy-on-Write anyway, because I think this still is the behaviour we want for methods, even if we do Error-on-Write for indexing? (drop, reindex, ...)

Also, I've seen the following behavior with people I've
worked with: ...

In this specific case, having consistent Copy-on-Write could also "solve" it? If is provides consistent behaviour (always copy, never mutate the passed data), not depending on the actual data passed to the function.
(of course, it depends on a lot on the actual case, and you were only giving a generic flow. But making things consistent in itself (always copy-on-write or always error-on-write) could also help avoiding such issues, I think? (while now you can actually mutate or not, when you get (and ignore) warning)

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 12, 2020

Assume that we could actually detect chained indexing? In this hypothetical situation, would people then be more in favor of Copy-on-Write over Error-on-Write?

The assumption would be that we can detect the difference in __setitem__ between

df[df['B'] > 4]['B'] = 10

vs

temp = df[df['B'] > 4]
temp['B'] = 10

I am wondering if we shouldn't be able to actually detect this based on reference counting. In the first case, there is no additional reference to df[df['B'] > 4] than the calling dataframe (self) of the second __setitem__ operation.
(I am not sure how robust such a detection would be though)

@Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Oct 12, 2020

In this specific case, having consistent Copy-on-Write could also "solve" it? If is provides consistent behaviour (always copy, never mutate the passed data), not depending on the actual data passed to the function.
(of course, it depends on a lot on the actual case, and you were only giving a generic flow. But making things consistent in itself (always copy-on-write or always error-on-write) could also help avoiding such issues, I think? (while now you can actually mutate or not, when you get (and ignore) warning)

It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through. The other issue for me is that copy-on-write silently does something, which might have side effects that a user would like to be made aware of, whereas error-on-write makes the user very aware of what he is doing.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 12, 2020

It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through.

Isn't the same true for Error-on-Write? We also need to detect all cases where we need to error instead of allow the mutation.

@jreback
Copy link
Contributor

@jreback jreback commented Oct 12, 2020

@jorisvandenbossche you have just described the current implementation of SettingCopyWarning and it's entire purpose

it is tracking refs

but i don't actually think this is that robust - and thats the same reason i am not enthusiastic about copy on write

yes if we can guarantee that we can always do it great

but that's the problem i am not sure we really can

though maybe now that we have full control over PandasArrays it might be fully possible (way back when when i did SettingWithCopy) some ops just indexes directly to the numpy arrays and we had to be very defensive about this

today we have much more robust handling

@Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Oct 12, 2020

It's possible a consistent copy-on-write would solve this issue, but we have to make sure we get all of the cases right, and I'm just not convinced we'll be able to think all of those cases through.

Isn't the same true for Error-on-Write? We also need to detect all cases where we need to error instead of allow the mutation.

Yes, I agree. But if we miss a case, we fix it by raising an exception. With copy-on-write, we fix it by doing the copy, but I'm guessing that there might be cases where we say "Given this case, we should have chosen error-on-write".

My sense is that it is easier to go from the current state of SettingWithCopyWarning to error-on-write and then to copy-on-write, than to go from SettingWithCopyWarning to copy-on-write and then discover we have to do error-on-write because we simply missed something. At least as a first step, we can change all the SettingWithCopyWarning stuff to just raise an exception and we have a first implementation. (That's my sense - I could be totally wrong)

@TomAugspurger
Copy link
Contributor Author

@TomAugspurger TomAugspurger commented Oct 12, 2020

Just to clarify the choice, is it fair to say that the only real choice is between

  1. Error on Write
  2. Copy on Write with SettingWithCopyWarning for chained indexing?

The initial proposal (and my PR) left out the warning specifically for chained indexing part. If we can reliably detect chained indexing (a big if?) then do people prefer option 2?

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 12, 2020

@jorisvandenbossche you have just described the current implementation of SettingCopyWarning and it's entire purpose

@jreback It might be achieved using the same mechanism (I would need to look more into the code of this, it's not a part I am very familiar with), but at least the goal I described is different: the current SettingWithCopyWarning warns in both example cases, while I want only the first (actual chained indexing) to raise, and not the second (assigning the subset to a variable).

Yes, I agree. But if we miss a case, we fix it by raising an exception. With copy-on-write, we fix it by doing the copy, but I'm guessing that there might be cases where we say "Given this case, we should have chosen error-on-write".

@Dr-Irv I don't fully understand your last sentence.
It's true that, in case of a bug in the detection code when to copy/error, the fix is more explicit with error-on-write ("loud", since people will start seeing an error if they relied on wrong behaviour, instead of simply no longer getting the wrong behaviour).
So do you mean that even if we choose copy-on-write, we need to fix cases by raising an error instead of copying because it would otherwise be a "silent" fix?

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 12, 2020

Just to clarify the choice, is it fair to say that the only real choice is between

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

@jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Oct 12, 2020

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

I prefer this over the status quo.

@Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Oct 12, 2020

@Dr-Irv I don't fully understand your last sentence.
It's true that, in case of a bug in the detection code when to copy/error, the fix is more explicit with error-on-write ("loud", since people will start seeing an error if they relied on wrong behaviour, instead of simply no longer getting the wrong behaviour).
So do you mean that even if we choose copy-on-write, we need to fix cases by raising an error instead of copying because it would otherwise be a "silent" fix?

No, I mean let's say that we choose copy-on-write. We think we get it right. We release the code. Someone discovers a place where we didn't get it right, and then we discuss it, and realize we should have chosen "error-on-write" in the first place. I'm just erring on the side of caution by preferring error-on-write.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 13, 2020

let's say that we choose copy-on-write. We think we get it right. We release the code. Someone discovers a place where we didn't get it right, and then we discuss it, and realize we should have chosen "error-on-write" in the first place. I'm just erring on the side of caution by preferring error-on-write.

OK, understand it now ;)
But, I personally think we should discuss / choose what we actually prefer. Let's assume that after more discussions we think we prefer copy-on-write over error-on-write, then I would say we should go with that, even if that makes it more difficult to still change mind afterwards.
This also touches upon implementation / adoption questions which we didn't really discuss yet. I think we will need some way to opt-in to this new behaviour initially / have a flag to enable it / .. as an experimental stage in which we can still make breaking changes (like going from copy-on-write to error-on-write, which in principle could also still be done with deprecations first). This might be quite complex to first have both old and new behaviour side by side, but not sure there is a good alternative ..

(and to be clear: while I am mainly arguing here for Copy-on-Write to give some counter-arguments, I am not yet sure myself which of the two is "best". I mainly think it's something to futher explore/experiment/discuss)

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

I prefer this over the status quo.

I also prefer that over status-quo. But, @jbrockmendel, do you also prefer that over the copy-on-write/error-on-write alternatives discussed? (personally, I think that I am currently in favor of one of those two other options)
Note that also in this case of "clear rules", the discussion around what to do with the "typical chained indexing example" is still needed (eg if we have the rule that subsetting rows is always a copy, do we stop raising SettingWithCopyWarning for cases like subset = df[mask] with subsequent changes ?)

@Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Oct 14, 2020

Here's something that I brought up in the October, 2020, dev meeting. Consider:

>>> df = pd.DataFrame({"x":[1,2,3], "y":[4,5,6]})
>>> df
   x  y
0  1  4
1  2  5
2  3  6
>>> sx=df['x']
>>> sx.iloc[1] = 10
>>> sx
0     1
1    10
2     3
Name: x, dtype: int64
>>> df
    x  y
0   1  4
1  10  5
2   3  6
>>> sdf = df[['x']]
>>> type(sdf)
<class 'pandas.core.frame.DataFrame'>
>>> sdf
    x
0   1
1  10
2   3
>>> sdf.loc[1,'x'] = 20
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
>>> sdf
    x
0   1
1  20
2   3
>>> df
    x  y
0   1  4
1  10  5
2   3  6

Right now, the behavior of picking a single column using the syntax df['x'], which creates a Series versus df[['x']] which creates a DataFrame is different, in that when you get a Series, modification of that series silently modifies the underlying DataFrame while when you get a DataFrame, you get the SettingWithCopyWarning. If we choose copy-on-write, then someone who used the df['x'] syntax with the dependence that the underlying DataFrame was modified will no longer have working code, and will have a hard time figuring out that a new version of pandas is what caused the issue.

On the other hand, if we use error-on-write, then in BOTH cases, we raise an error, which makes it clear to the user that they need to make their intent clear.

@jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Oct 14, 2020

Do either the CoW or EoW allow for the possibility that when i edit a view I am doing so intentionally?

@Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Oct 14, 2020

Do either the CoW or EoW allow for the possibility that when i edit a view I am doing so intentionally?

By "doing so intentionally", you mean modifying the underlying DataFrame as a result of modifying the view?

@jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Oct 14, 2020

By "doing so intentionally", you mean modifying the underlying DataFrame as a result of modifying the view?

Yes. A case where I want to modify both.

@Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Oct 14, 2020

By "doing so intentionally", you mean modifying the underlying DataFrame as a result of modifying the view?

Yes. A case where I want to modify both.

Yes, the proposal has a as_mutable_view() method to allow the user to indicate they want to allow mutation. Works whether we do COW or EOW.

@toobaz
Copy link
Member

@toobaz toobaz commented Oct 14, 2020

Not sure if there is anybody in favor of it, but the original write-up also still has the option of "clear rules" (eg indexing columns could always be actual views)

Sorry, I'm missing the difference between these "clear rules" and the "whenever possible", could you clarify?

I'm also missing one thing in the proposal: how can something like df[["A"]].as_mutable_view() be implemented? Either df[["A"]] already provides a mutable view - hence the method call is noop - or it provides a copy of data - and it will be painfully hard to keep track and edit the original data (right?).

A syntax I would understand would be something like df.v[["A"]] - where .v is lazy and only operates when indexed. But even leaving aside the syntax, are we thinking this should work for all indexing operations?! For instance df.loc[['a', 'b'], slice(1, 10, 4)]?! Not that it's impossible, but it does looks difficult to me.

@TomAugspurger
Copy link
Contributor Author

@TomAugspurger TomAugspurger commented Oct 14, 2020

I'm also missing one thing in the proposal: how can something like df[["A"]].as_mutable_view() be implemented?

Indexing columns would always give views. The difference comes when you try to mutate the view

>>> df2 = df1[["A']]  # df2 is a view on df1
>>> df2.iloc[:, :] = 0  # Copies (with Copy on Write) or raises (with Error on Write).

And for the case of "I want to mutate df1, you'd use df2 = df1[["A"].as_mutable_view().

are we thinking this should work for all indexing operations?

I think this discussion is limited to

  • Indexing just the columns
  • (maybe) Indexing the rows in a way where NumPy can return a view when slicing a 1D array. So df.loc[:5, ["A"]] can be a view. But your example of ``df.loc[['a', 'b'], ["A"]]` cannot be a view, and would copy immediately.

If that sounds right, I'll clarify it in the proposal document.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jul 11, 2021

@nickeubank yes, I went through your PRs and the discussion from back when before starting my proof-of-concept implementation (I am also using weakrefs, although with a slightly different approach: at the array level inside the manager instead of at the DataFrame level). Thanks for your work from back then!

Giving the PR a test drive is certainly welcome.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jul 11, 2021

I created a modified version of Tom's google doc with a longer version of the proposal: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing, and also notified the mailing list about this proposal.

@nickeubank
Copy link
Contributor

@nickeubank nickeubank commented Jul 12, 2021

@nickeubank yes, I went through your PRs and the discussion from back when before starting my proof-of-concept implementation (I am also using weakrefs, although with a slightly different approach: at the array level inside the manager instead of at the DataFrame level). Thanks for your work from back then!

Haha -- glad you found something useful in there! As you can still I got quite stuck, so I'm glad you found a workaround by embedding the functionality at the array manager level.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Jul 16, 2021

One aspect that has come up on the mailing list, is the risk that users can incorrectly mutate views in places where we expose internal data.

Since this is Python, a user will always find some (private) way to incorrectly mutate data without triggering the copy-on-write paths, but we should indeed look into ways to protect against such improper mutation of views.

Listing the possible ways to get the internal "array" data from DataFrame/Series objects:

  • Series.values / Series.array -> returning a numpy array or pandas ExtensionArray, which currently return the stored data as mutable arrays as is (or as views). Mutating such an array wouldn't trigger Copy-on-Write which is managed on the DataFrame/Series level. To prevent users from doing this, we could return those arrays as "read-only"? (to avoid always doing a defensive copy here)

    • (numpy arrays already have a concept of "read-only", for EAs we would need to add this)
  • Series.to_numpy() -> returning a numpy array. This method has a copy keyword with currently a default of False. We could either make this copy=True by default, or similarly to the above make it read-only by default, leaving the copy=True/False options to choose from explicitly.

  • DataFrame.to_numpy() / DataFrame.values -> returning a 2D numpy array, which is by definition always a copy (by concatting multiple 1D arrays). Except for the 1-column case, this could still be a view. For simplicity, I would make this case return a copy as well (if you want a view the user can get the Series). Or alternatively this case could follow the logic of Series.to_numpy above.

@nickeubank
Copy link
Contributor

@nickeubank nickeubank commented Jul 16, 2021

As memory serves, numpy arrays moving in and out of pandas was definitely where I got stuck in my own implementation attempt. I think I struggled in particular with people passing numpy arrays into pd.DataFrame() without a defensive copy, but I think the situation here is analogous.

Do we think it'd be better for Series.values to return a read-only copy, or just return a copy? The former is faster, but the latter seems far more user-friendly. Personally, given anyone who really needs performance would be able to get a view using Series.to_numpy(), and I think for most users that performance hit isn't that big of a deal, my inclination would be just to do a defensive copy for Series.values.

Moreover, that would prevent a breaking change, since I imagine many people have Series.values objects that they manipulate in their current code, and that would break if we moved to returning "read-only" arrays.

(I recognize this may sometimes be a minority view, but my own preference is almost always for the default pandas behavior to emphasize (a) ease-of-use and (b) correctness when there's a tension with those and performance, provided there are ways to get performance for motivated parties)

@jreback
Copy link
Contributor

@jreback jreback commented Jul 16, 2021

I would like to see what happens in our test suite if we change both .to_numpy() and .values to return read-only view as per #36195 (comment) . I view this as a weird contract that we have with the outside world, e.g. a private api essentially. Ideally would like to remove this access point. Yes we could copy as per @nickeubank but ok with doing the api change i think. Of course this is basically impossible to deprecate so would have to make a decision on this and release in a major update.

@jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Jul 16, 2021

I'm glad there is a proof of concept to help clarify what this looks like.

I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these.

We should revisit the alternative "clear/simple rules" approach that is "indexing on columns always gives a view" (#33597). This is simpler to explain/grok, simpler to implement, and not dependent on BlockManager vs ArrayManager.

@matthew-brett
Copy link
Contributor

@matthew-brett matthew-brett commented Aug 19, 2021

Please forgive my lack of understanding, but how does the Error-on-Write proposal here differ from the current behavior after:

pd.set_option('mode.chained_assignment', 'raise')

? I mean, apart from the proposal to add the extra copy_if_needed and as_mutable_view methods?

@matthew-brett
Copy link
Contributor

@matthew-brett matthew-brett commented Aug 19, 2021

And - rreflecting on my fairly heavy use of Numpy for about 15 years - I rarely depend on having a view onto the array, outside the direct indexing caes like a[:10] = 1. I'm sure I would be happy to have the option to do that in Pandas, and sometimes using as_mutable_view, but, speaking as a heavy Numpy user, Copy-on-Write in Pandas would make my teaching and working life a lot easier.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 19, 2021

@matthew-brett thanks for the feedback!

how does the Error-on-Write proposal here differ from the current behavior after: pd.set_option('mode.chained_assignment', 'raise') ?

For writing to subsets, it might not differ that much in general (there are some cases where we currently don't warn when writing to a view or copy, but that could be changed). But for the current mode to raise on chained assignment, it doesn't work the other way around. Say you have a DataFrame df and create a subset of it subset = df[:5]. Modifying subset will raise a warning (or error with the option), but you can currently happily modify df which could be reflected in subset without getting a warning/error.
With the Copy-on-Write proposal, the idea is to ensure that both objects don't modify each other in both directions (both parent -> child, as child -> parent).

@matthew-brett
Copy link
Contributor

@matthew-brett matthew-brett commented Aug 20, 2021

@jorisvandenbossche - thanks for the clarification. I was asking because I have used and taught pd.set_option('mode.chained_assignment', 'raise') - but I found that it was not uncommon for my students to run into errors that took some significant investigation to fix. I bet you would get a fairly constant level of question and complaint with that option.

@matthew-brett
Copy link
Contributor

@matthew-brett matthew-brett commented Aug 20, 2021

I guess one major worry here would be that, with Copy-on-Write, Pandas 2.0 will silently return different answers from Pandas < 2.0.

Is there some way to harvest notebooks from Github, to see how often that would happen, in practice? And in what proportion of cases the use of the < 2.0 behavior was actually a bug, where the user had not realized that:

  • The assignment was going to a copy, and thus being silently discarded, or
  • The assignment was going to a view, and unexpectedly modifying another series or data frame.

@matthew-brett
Copy link
Contributor

@matthew-brett matthew-brett commented Aug 23, 2021

Can I ask - what would it take - to get this one over the line? And how can I help? For example, if I can find some time, maybe funded, to do the Github notebook exploration above, and write it up, how useful would that be? If not very useful, is there something else I can do?

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 25, 2021

I guess one major worry here would be that, with Copy-on-Write, Pandas 2.0 will silently return different answers from Pandas < 2.0.

Yes, this is indeed a worry, and it will be a breaking change for sure. The problematic cases is where people rely on the fact that mutating a subset will mutate the parent or vice versa (both in chained indexing as in other situations). If setitem operations would silently be ignored or no longer propagate (i.e. have no effect because of the copy semantics), you will end up with different values.

The goal is that we would have a version of pandas (eg the last 1.x release before 2.0, or 2.x before 3.0) that raises a deprecation warning for all cases where we would currently mutate a parent/child but would no longer do with CoW. I am relatively confident that this should be possible to do. But of course, people ignore warnings / update versions without first checking etc. So many people would run into such problems anyway, I think.

It might be an option to keep warning for a longer time also after the default behaviour has changed (with the option to turn off those warnings).

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 25, 2021

Is there some way to harvest notebooks from Github, to see how often that would happen, in practice? And in what proportion of cases the use of the < 2.0 behavior was actually a bug, where the user had not realized that:

If you would have the means / time to do something like that, I think it would certainly be useful and potentially insightful. I am not sure how easy it would be, though. Typically notebooks don't have tests so to understand if the code would be impacted by those changes, it might require understanding/inspecting code manually.
(running test suites of projects depending on pandas might be another option? Although notebooks give more an idea of end-user code)

In general being able to better estimate impact / whether users rely much on mutation propagation or not would be very useful (but hard ..)

Can I ask - what would it take - to get this one over the line? And how can I help?

In addition to the aspect above, I think there are two main things to move forward on the short term: 1) I would like to see more (explicit) support / discussion of the proposal (both by core devs and users), and 2) the code implementation itself also needs to get some actual (technical) review (#41878)

@matthew-brett
Copy link
Contributor

@matthew-brett matthew-brett commented Aug 25, 2021

Thanks. My thinking with the Github / notebook harvesting, was to identify a smallish random sample of - say - 100 notebooks, then run them:

  • With the usual default settings, and then
  • With pd.set_option('mode.chained_assignment', 'raise') or some more useful superset if that is practical, and then
  • With the new copy-on-write branch.

and look for

  1. Any new errors with pd.set_option('mode.chained_assignment', 'raise').
  2. Any differences in output between the versions with and without copy-on-write.

After that, yes, whoever was doing that work would need to go through and work out why the output changed / why the error was raised, and whether the change / error was as a result of the user intentionally modifying the data frame / series in-place, or accidental / unintentional / incidental. I can't think of a way to do that automatically, so it would have to be a judgement call by reading the code.

If you thought that would make a significant difference to the discussion, I can try and work out whether and how I can carve out some time to do that.

@justmarkham
Copy link
Contributor

@justmarkham justmarkham commented Oct 11, 2021

Hi all. I'm a long-time pandas user, and I've been teaching pandas to beginner/intermediate data science students since 2015. Much of that teaching is online, and as a result I've answered hundreds (or perhaps thousands) of pandas questions from newer users. That is all to say that I have somewhat of a pulse on what newer pandas users ask about.

I read through the copy-on-write proposal, and I wanted to share a few reactions to the proposal:

  1. As a pandas user, I don't want to have to memorize complicated rules in order to properly use the library, regardless of whether the rules are well-documented or not. The proposal gives me a simple rule to memorize, and so I'm in favor of that.

  2. I don't know what most pandas users expect, but personally, I never intend to mutate the "original" object when changing a "derived" object. Thus the proposal would make things actually work like I would hope they would work.

  3. I learned pandas before NumPy, and in fact I still don't know NumPy that well. As such, I'm not of the belief that "pandas always needs to behave consistently with NumPy". I still teach pandas with the assumption that the student knows nothing about NumPy, and I don't believe that every pandas user needs to also learn NumPy at some point. However, I do understand that many pandas users do know NumPy, and those users might be thrown off by differences in behavior between pandas and NumPy.

  4. If I understood correctly, chained assignment would stop working under this proposal. I would be in favor of chained assignment no longer being allowed, even for cases in which it currently works, since you can always just use loc instead (right?). In that case, I would want an error (or at least a warning) to be raised if you try to do a chained assignment.

I hope some of this is helpful to the discussion!

@chendaniely
Copy link
Contributor

@chendaniely chendaniely commented Oct 14, 2021

Hello maintainers. Thank you for putting out a thoughtful proposal and getting more community feedback in the process.

I echo all of justmarkham's comments above, and left my comments in the google doc.

the tl;dr:

  • Proposal is great for an instructor who teaches novices. It seems to be most aligned with the learner's mental model when I see their pandas code.
  • I'm do not know about any of the performance implications of this change. And what use cases people have implemented that rely on the copy/view behaviors in pandas.
  • Documentation-wise I think all of the examples in the google document would fit well in the indexing and selecting data chapter in the user guide, and let me know how I can help on that end.

Here's an example of what things look like from the R world of copy-on-modify: https://gist.github.com/chendaniely/53c2ea975a61bd2f05ef2aca46f66062

@chendaniely
Copy link
Contributor

@chendaniely chendaniely commented Oct 15, 2021

Sorry for the double post, but I updated the above gist for what "chained indexing" looks like from the R world (if that helps any)

The direct file is here: https://gist.github.com/chendaniely/53c2ea975a61bd2f05ef2aca46f66062#file-chained_indexing-base_tidyverse-r

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 29, 2021

@justmarkham @chendaniely Thanks a lot for the feedback, that's really valuable.

I am personally happy to hear that you both think the current proposal would be more intuitive for (novice) users than the current situation (since this was one of my goals with the proposal). I also think that we should not assume our users to know (numpy) concepts of copy vs view in indexing.

[@justmarkham] If I understood correctly, chained assignment would stop working under this proposal. I would be in favor of chained assignment no longer being allowed, even for cases in which it currently works, since you can always just use loc instead (right?).
[@chendaniely] I updated the above gist for what "chained indexing" looks like from the R world

So chained assignment actually works in R!
With my current proposal, chained assignment would indeed never work.

There is one aspect related to this brought up by @shoyer on the mailing list: we could make one exception on the copy-on-write rule, i.e. accessing a single column of a DataFrame as a series would still be a view, any other indexing operation would follow copy-on-write as proposed. I recently commented about this on the mailing list at https://mail.python.org/pipermail/pandas-dev/2021-October/001395.html

This would ensure that one type of chained assignment (and possibly the most common one) keeps working: df[col][...] = .... This would make the transition smoother.
But at the same time, exactly because it is only one type of chained assignment that would work, the user needs to understand that selecting the column first works, but something like df[row_selection][col] = ... does not work. Although it seems a simple exception to the general rule, my feeling is that it will give rise to several potentially confusing corner cases.

[@chendaniely] I'm do not know about any of the performance implications of this change. And what use cases people have implemented that rely on the copy/view behaviors in pandas.

Yes, that's a good point and something I should explore in more detail (and add a section about in the document). I think in general for the majority of workflows, the performance will either stay the same or improve (because of avoiding unnecessary copies).

There will be specific cases that might be impacted though:

  • The relatively frequent case of mutating a dataframe that has a reference to another object: this can potentially become slower (if it's a case where we don't copy right now), or see a "delayed" impact (overall performance might be the same but the copy is now delayed at the moment when actually mutating)
  • A potential corner case that is heavily impacted is where you mutate a dataframe repeatedly in a for loop, and where you do some operation in the for loop that triggers the copy-on-write mechanism repeatedly (I should try to come up with some example code, but can't think of a sensible example right now. For example, @shoyer showed one example in an old discussion at #10973 (comment), but that would actually not trigger copy-on-write in my proof-of-concept implementation since I keep track of views per column instead of per full dataframe).

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 10, 2022

As a further update on this thread: I have an initial implementation at #46958 (now for the default BlockManager, not for the opt-in ArrayManager). I also opened a separate PR that tries to add the examples from the proposal (and more) as unit tests: #46979

@nickeubank
Copy link
Contributor

@nickeubank nickeubank commented May 17, 2022

A little cross promotion: https://youtu.be/aBeEN2klZQE

@mattharrison
Copy link

@mattharrison mattharrison commented May 20, 2022

This can't come soon enough. (Background: I teach 50-100+ folks per month on Pandas and wrote Learning the Pandas Library, Pandas Cookbook 2nd Ed, and Effective Pandas. I also consult on Pandas.)

Happy to give a prototype a run through on some of my Pandas material. However, I never hit our friend SettingWithCopyWarning with the code I write. (I am interested in seeing speed/memory improvements though.)

As far as some code breaking, IMO you should release the next minor version before this gets pushed updating the deprecation warning to state that certain behavior will be removed, and then push a minor (major?) version with the fix.

WRT "the breaking away from NumPy behavior" argument preventing this. No one I teach (or have consulted with) cares about that. Most of them don't use NumPy (and don't need to). Seems like a weird implementation detail to prevent such a useful update.

Again, this can't come soon enough for those who haven't embraced chaining.

@nickeubank
Copy link
Contributor

@nickeubank nickeubank commented May 23, 2022

@jorisvandenbossche Out of curiosity, while I recognize that moving to copy on view is technically a breaking change, are you aware of any real-world examples where people are relying on pandas returning views in their code?

The rules around when pandas returns views are so idiosyncratic and sensitive to apparently-unrelated code changes that I've never felt remotely safe using views, and am curious if anyone else ever does...

@matthew-brett
Copy link
Contributor

@matthew-brett matthew-brett commented May 23, 2022

I just wanted to add a huge - yes please - how can I help - when can we have this! And @nickeubank - I agree - I never feel safe deliberately using views. I guess the problem is that a lot of code doesn't deliberately use them, but uses them by accident. I did offer to do a survey to find out if that happened in real life, in #36195 (comment) - but I would in general love to know what it would take to get this in.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 23, 2022

Out of curiosity, while I recognize that moving to copy on view is technically a breaking change, are you aware of any real-world examples where people are relying on pandas returning views in their code?

I think that any case of chained assignment that currently works (intentionally, or somewhat by accident) is an example using a view that for sure people do rely on.

Apart from that, I fully agree that it's probably not common that people create an actual subset dataframe that is a view, and mutate that with the intention to also mutate the original dataframe (which is for me a good reason to not do this, because this is not what people expect, and in a large majority of cases also not what people want).

And to be clear, the chained assignment comprises both the typical case of doing something like df["C"][df["A"] > 1] = 10 (example from my slides), as the non-typical cases like df[1:3]["C"] = 10.
But I think the first one (i.e. selecting a column, and then directly mutating that column in a chained setitem) is certainly a common pattern (although we will typically recommend to use .loc instead), and the fact that this will no longer work will certainly affect quite some people. (I do think that we can (first warn and then) error for this, to make this very explicit that it doesn't work, and to avoid silent failures because of this change)

@nickeubank
Copy link
Contributor

@nickeubank nickeubank commented May 23, 2022

But I think the first one (i.e. selecting a column, and then directly mutating that column in a chained setitem) is certainly a common pattern (although we will typically recommend to use .loc instead), and the fact that this will no longer work will certainly affect quite some people

Yeah, ok, I can see that! The compactness of the pattern means all the weird ways a view can "break" don't come into play, and so I can see how it would be frequently used naively. Thanks!

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 25, 2022

About the "what would it take to get this in / how can I help":

First, it would actually be good to get some more explicit support from other pandas core devs (it's hard to estimate to what extent there is consensus around the latest proposal).

But apart from that:

  • We need an implementation:
    • #46958 implements the core of the proposal, and mainly needs people with time to review this.

    • Apart from that core implementation, there will be other work needed to fully implement the proposal. Some additional work items:

      • Implement the new behaviour for constructors (eg constructing a DataFrame/Series from an existing DataFrame/Series should follow the same rules as indexing -> behaves as a copy through CoW)
      • Use the lazy copy (with CoW) mechanism in more methods where it can be used (currently I only explicitly did reset_index and rename as POC)
      • Explore / update the APIs that return numpy arrays (.values, to_numpy()). Potential idea is to make the returned array read-only by default
      • ...

      (I should create a tracker issues with a task list for those items)

    • For now I only implemented the new behaviour (toggled with an option), but in addition we will also need to implement all the deprecation warnings for cases that will change in the current behaviour (initially also behind an option).

  • Once the (core) implementation lands in master / a release, then it will be super useful to have people give this a try, run their code or test suites with it, etc, so we can iron out bugs / missing warnings / or discover unexpected consequences that need to be addressed/discussed.

So on the short term helping with with reviewing or some extra implementation tasks is most needed, but certainly also not super approachable to help with.
Once there is a basic implementation it will be easier to help with testing / giving feedback (although it's certainly already welcome to give the PR branch a try as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests