API: New copy / view semantics using Copy-on-Write #46958
Draft
+1,503
−235
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
This is a port of the proof of concept using the ArrayManager in #41878 to the default BlockManager.
This PR is a start to implement the proposal described in more detail in https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit / discussed in #36195
A very brief summary of the behaviour you get:
reset_index
andrename
, needs to be expanded to other methods)Implementation approach
This PR adds Copy-on-Write (CoW) functionality to the DataFrame/Series at the BlockManager level. It does this by adding a new
.refs
attribute to theBlockManager
that, if populated, keeps a list ofweakref
references to the blocks it shares data with (so for the BlockManager, this reference tracking is done per block, solen(mgr.blocks) == len(mgr.refs)
).This ensures that if we are modifying a block of a child manager, we can check if it is referencing (viewing) another block, and if needed do a copy on write. And also if we are modifying a block of a parent manager, we can check if that block is being referenced by another manager and if needed do a copy on write in this parent frame. (of course, a manager can both be parent and child at the same time, so those two checks always happen both)
How to enable this new behaviour?
Currently this PR simply enabled the new behaviour with CoW, but of course that will need to be turned off before merging (which also means that some of the changes will need to put behind a feature flag. I only did that now in some places).
I think that ideally, (on the short term) users have a way to enable the future behaviour (eg using an option), but also have a way to enable additional warnings.
I already started adding an option, currently the boolean flag
options.mode.copy_on_write=True|False
:Some notes:
TODO(CoW)
in the code), although the majority for indexing / setitem is done..values
). Given the size of this PR already, those can probably be done in separate PRs?I will also pull out some of the changes in separate PRs (eg the new test file could already be discussed/reviewed separately (-> #46979), and the
column_setitem
is maybe also something that could be done as pre-cursor)