Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: New copy / view semantics using Copy-on-Write #46958

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 6, 2022

This is a port of the proof of concept using the ArrayManager in #41878 to the default BlockManager.

This PR is a start to implement the proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit / discussed in #36195
A very brief summary of the behaviour you get:

  • Any subset (so also a slice, single column access, etc) behaves as a copy (using CoW, or is already a copy)
  • DataFrame methods that return a new DataFrame return shallow copies (using CoW) if applicable (for now, this is only implemented / tested for reset_index and rename, needs to be expanded to other methods)

Implementation approach
This PR adds Copy-on-Write (CoW) functionality to the DataFrame/Series at the BlockManager level. It does this by adding a new .refs attribute to the BlockManager that, if populated, keeps a list of weakref references to the blocks it shares data with (so for the BlockManager, this reference tracking is done per block, so len(mgr.blocks) == len(mgr.refs)).
This ensures that if we are modifying a block of a child manager, we can check if it is referencing (viewing) another block, and if needed do a copy on write. And also if we are modifying a block of a parent manager, we can check if that block is being referenced by another manager and if needed do a copy on write in this parent frame. (of course, a manager can both be parent and child at the same time, so those two checks always happen both)


How to enable this new behaviour?
Currently this PR simply enabled the new behaviour with CoW, but of course that will need to be turned off before merging (which also means that some of the changes will need to put behind a feature flag. I only did that now in some places).

I think that ideally, (on the short term) users have a way to enable the future behaviour (eg using an option), but also have a way to enable additional warnings.
I already started adding an option, currently the boolean flag options.mode.copy_on_write=True|False:

  • Do we have a better name? I personally don't like that it uses "copy_on_write", because this is the internal implementation detail, and not what most end users really have to care about. But something like "new_copy_view_behaviour" is also not super ..
  • In addition to True/False, we can probably add "warn" as a third option, which gives warnings in cases where behaviour would change.

Some notes:

  • Not everything is already implemented (there are a couple of TODO(CoW) in the code), although the majority for indexing / setitem is done.
  • This PR does not yet try to tackle copy/view behaviour for the constructors, or for numpy array access (.values). Given the size of this PR already, those can probably be done in separate PRs?
  • Most tests are already passing (with changes), but still need to fix a few tests outside of /indexing
  • We will also need to think about a way to test this (in a similar way as the ArrayManager with an environment variable?)

I will also pull out some of the changes in separate PRs (eg the new test file could already be discussed/reviewed separately (-> #46979), and the column_setitem is maybe also something that could be done as pre-cursor)

@pandas-dev pandas-dev deleted a comment from jreback May 7, 2022
@twoertwein

This comment was marked as outdated.

@jreback

This comment was marked as outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants