Description
Is your feature request related to a problem?
I recently posted an inquiry on Stack Overflow requesting assistance regarding Pandas and down-sampling. I had some forward binned data, particle counts logged at various diameters:
296.54,303.14,311.88,320.87,330.12,339.63,349.42,359.49,369.86,380.52,391.49,402.77 nm
The counts were logged at between 296.54 and 303.14 nm, and so on.
I wanted to re-bin those counts to a new set of diameters:
300,325,350,375,400 nm
where the counts would be logged between 300 and 325 nm and so on.
I had to, column by column, brute force add and interpolate matching diameters. For instance the 300-325 nm bin would contain part of the 296-303 bin, plus the bins up to 320 nm, and part of the 320-330nm bin. and so on.
Describe the solution you'd like
I would like to see a solution where one pandas dataframe can be interpolated to not a higher resolution, but a lower one, set by a second set of values, either on columns (my specific example) or indices. Currently the interpolate function only fills missing data, it does not, to my knowledge, completely rebuild the dataframe to an entirely new resolution, one that may not, in fact, be uniformly distributed.
API breaking implications
[this should provide a description of how this feature will affect the API]
Describe alternatives you've considered
The challenge is that the size spectrum cannot be linearly interpreted between the smallest and largest diameters. So a simple sum, re-index and interpolate would not conserve spectral data.
Nor does the resample function do the trick since it applies to datetime indices only.
Additional context
This particular example if for forward-binned data. Some thought would have to be given to manage backward-binned or center-binned data.
My code example
I apologize for the crudity, but I am not a Pandas developer. Here is what my implementation ended up being:
# set up a 10x12 test array
test_array=(np.arange(120)).reshape(10,12)
# set a simple index
index_list=list(string.ascii_lowercase)[:10]
# initialize the high-resolution dataframe
df=pd.DataFrame(test_array, index=index_list)
# set the high-resolution column headers
df.columns=[296.54,303.14,311.88,320.87,330.12,339.63,349.42,359.49,369.86,380.52,391.49,402.77]
# set a column diameter array to search sort through
df_column_array=np.array(df.columns)
# set the new low-resolution diameters
new_columns=np.array([300,325,350,375,400])
# intialize a low-res dataframe
new_df=pd.DataFrame(columns=new_columns,index=index_list)
# loop through the low-res columns up to the second last (since these bins are forward counted)
for i,column in enumerate(new_columns[:-1]):
df_columns_backward_index=np.searchsorted(df.columns,column)-1 # locate the closest high-res bin before the current low res bin
df_columns_forward_index=np.searchsorted(df.columns,new_columns[i+1]) # locate the closest high-res bin after the current low res bin
backward_bin_size=df.columns[df_columns_backward_index+1]-df.columns[df_columns_backward_index] # calculate the first high-res bin size
# calculate the offset and fraction for interpolation
backward_offset=column-df.columns[df_columns_backward_index]
backward_fraction=1-backward_offset/backward_bin_size
# repeat for the last high-res bins closest to the current low-res bin
forward_bin_size=df.columns[df_columns_forward_index]-df.columns[df_columns_forward_index-1]
forward_offset=df.columns[df_columns_forward_index]-new_columns[i+1]
forward_fraction=1-forward_offset/forward_bin_size
# sum the fraction of high-res bins partially in the low-res bin along with the high-res bins fully within the low-res bin
new_df[column]=df.iloc[:,df_columns_backward_index]*backward_fraction+df.iloc[:,df_columns_backward_index+1:df_columns_forward_index-1].sum(axis=1)+df.iloc[:,df_columns_forward_index-1]*forward_fraction