I am working with a ~70 GB data frame consisting of around 7 million rows and 56 columns. I want to subset this dataframe to a smaller one, taking 100.000 random rows out of the original dataframe.
While doing so, I observed very strange behavior:
df
is my 7 million rows dataframe which I read into python as a .parquet file.
I first tried the following:
import pandas as pd
df = df.sample(100000)
However, exeuting this chunk takes forever. I always interrupted the command after ten minutes, because I am sure drawing random rows from a dataframe can't take that long
Now if I execute the following chunk, the code runs through in just a few seconds:
import pandas as pd
df2 = df
df = df2.sample(100000)
What is happening there? Why is .sample() taking forever in the first try, and executing in just a few seconds in the second try? How could copying the dataframe affect the speed of the computation? df
and df2
should be exactly the same objects, right? I could of course now just continue working with df2, but I don't want two 70 GB files to be stored in memory.
df2 = df
creates an additional name pointing to the same DataFrame. Pandas does not know that you have two references to the DataFrame, therefore the performance difference you are seeing cannot be because of the assignment. Something else is different. I advise hunting that down and adding it to your question.