Pandas creating new vs. overwriting existing dataframe

Question

I am working with a ~70 GB data frame consisting of around 7 million rows and 56 columns. I want to subset this dataframe to a smaller one, taking 100.000 random rows out of the original dataframe.

While doing so, I observed very strange behavior: df is my 7 million rows dataframe which I read into python as a .parquet file.

I first tried the following:

import pandas as pd
df = df.sample(100000)

However, exeuting this chunk takes forever. I always interrupted the command after ten minutes, because I am sure drawing random rows from a dataframe can't take that long

Now if I execute the following chunk, the code runs through in just a few seconds:

import pandas as pd
df2 = df
df = df2.sample(100000)

What is happening there? Why is .sample() taking forever in the first try, and executing in just a few seconds in the second try? How could copying the dataframe affect the speed of the computation? dfand df2 should be exactly the same objects, right? I could of course now just continue working with df2, but I don't want two 70 GB files to be stored in memory.

Something is not quite right here. df2 = df creates an additional name pointing to the same DataFrame. Pandas does not know that you have two references to the DataFrame, therefore the performance difference you are seeing cannot be because of the assignment. Something else is different. I advise hunting that down and adding it to your question. — Steven Rumbalski, Commented Aug 21, 2023 at 15:17
Thanks for your response, Steven. I can guarantee you that I restart a completely fresh python/spyder session and then ONLY run the code above after importing the df. I do nothing else, and it takes ages if I then execute the first chunk, but is very fast if I execute the second chunk after restarting.. I don't get it. — n_arch, Commented Aug 21, 2023 at 18:55
I was not considering that this is a very large object which may take a long time to garbage collect. See my answer for details. — Steven Rumbalski, Commented Aug 21, 2023 at 21:08
What data types are in the dataframe. That would help answer whether deleting the original dataframe is expensive. If its all a single numpy data type like int64, it would be much faster than strings (which need individual deallocation). — tdelaney, Commented Aug 21, 2023 at 21:28

Steven Rumbalski · Accepted Answer · 2023-08-21 23:07:41Z

1

Here is what I believe is happening.

In this code

df = df.sample(100000)

when you assign the sample to the same name as the original dataframe, the ref count to the original dataframe drops to zero causing it to be garbage collected. Once the dataframe is garbage collected, any Python objects your dataframe contained (other than the 100k you have sampled) also get garbage collected. With a 7 million row dataframe this could take a while.

In this code

df2 = df
df = df2.sample(100000)

after assigning the sample to df, the original dataframe is still referenced by the name df2 which avoids garbage collection.

The way to verify this is to change your second version to

df2 = df
df = df2.sample(100000)
del df2

Doing del df2 will remove the name df2 causing the reference count to drop to zero. You should now see this version take as long as your original code.

edited Aug 21, 2023 at 23:07

answered Aug 21, 2023 at 21:07

Steven Rumbalski

45.6k10 gold badges94 silver badges124 bronze badges

1

It would be interesting to skip sampling and assignment completely. just a del df to see how long that takes.
– tdelaney
Commented Aug 21, 2023 at 21:33
Thanks again, Steven. You are completely right, doing del df2 (same as doing del df to start with as @tdelaney suggested) also takes ages. Then it is indeed the garbaged collection that takes place, when assigning the sample to the same name.
– n_arch
Commented Aug 22, 2023 at 7:33
Any suggestions how I should cope with this? Should I just use the additional name reference then and work with df2 for example, if I don't have memory constraints? I also just relalized that doing ```df2 = df```` doesn't actually lead to doubling the memory usage.
– n_arch
Commented Aug 22, 2023 at 7:37

Add a comment |

Collectives™ on Stack Overflow

Pandas creating new vs. overwriting existing dataframe

1 Answer 1

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related