Description
Add more explicit docs / work-around for dealing with groupby and NA groups
(see comments)
Changelog: 07.Nov.2013: Add line to example below to preprocess table content.
I expect the following behavior: A DataFrame.groupby
splits the dataframe/table into subtables according to the grouping-condition. A column name as a grouping-condition will give me subtables for each individual value in that column. Similarly, grouping with multiple columns (a list of column names) gives me a group for each occurring combination of these columns (or let me put it differently, the unique "values" of multiple columns to group for are tuples).
So if I'm wrong with my expectations, I couldn't read a different meaning or to-expect-behavior from the documentation (e.g. pandas.DataFrame.groupby.__doc__
), then there is a lake of clarification.
Otherwise I found a bug and I am in the need for a fix: Some existing combinations are not provided with a group or splited subtable -- I checked it with drop_duplicates
. And, finally, grouped.__iter__
ignores more/other combinations as grouped.groups.keys()
-- Here, I also would expect, that both follows the same implementation...
I tracked it to the depth of pandas to pandas.core.Grouper._get_group_keys
or better _KeyMapper.get_key
, self.levels
looks good, but the list-comprehension-getmethod-zip-action goes wrong or eventually pandas.core.Grouper.group_info
provides a too small ngroups
value oorr something else.
pandas.__version__
: 0.12.0-1062-g3c57949 (from 6.11.2013)
numpy.__version__
: 1.7.2
MacOSX 10.9
Test Example:
import pickle
import sys
import os
import pandas as pd
grp_cols = ['algorithm', 'customalpha']
df = "ccopy_reg\n_reconstructor\np0\n(cpandas.core.frame\nDataFrame\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\ng0\n(cpandas.core.internals\nBlockManager\np5\ng2\nNtp6\nRp7\n((lp8\ncnumpy.core.multiarray\n_reconstruct\np9\n(cpandas.core.index\nIndex\np10\n(I0\ntp11\nS'b'\np12\ntp13\nRp14\n((I1\n(I2\ntp15\ncnumpy\ndtype\np16\n(S'O8'\np17\nI0\nI1\ntp18\nRp19\n(I3\nS'|'\np20\nNNNI-1\nI-1\nI63\ntp21\nbI00\n(lp22\nS'algorithm'\np23\naS'customalpha'\np24\natp25\n(Ntp26\ntp27\nbag9\n(cpandas.core.index\nInt64Index\np28\n(I0\ntp29\ng12\ntp30\nRp31\n((I1\n(I13\ntp32\ng16\n(S'i8'\np33\nI0\nI1\ntp34\nRp35\n(I3\nS'<'\np36\nNNNI-1\nI-1\nI0\ntp37\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x05\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x0c\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\r\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x11\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x15\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x17\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x18\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1a\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1e\\x00\\x00\\x00\\x00\\x00\\x00\\x00 \\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np38\ntp39\n(Ntp40\ntp41\nba(lp42\ng9\n(cnumpy\nndarray\np43\n(I0\ntp44\ng12\ntp45\nRp46\n(I1\n(I2\nI13\ntp47\ng16\n(S'O8'\np48\nI0\nI1\ntp49\nRp50\n(I3\nS'|'\np51\nNNNI-1\nI-1\nI63\ntp52\nbI00\n(lp53\nS'ScenarioAlgoLocalHeuristicM'\np54\naS'ScenarioAlgoLocalHeuristicM'\np55\naS'ScenarioAlgoLocalHeuristicM'\np56\naS'ScenarioAlgoLocalHeuristicM'\np57\naS'ScenarioAlgoLocalHeuristicMC'\np58\naS'ScenarioAlgoCMTFLP'\np59\naS'ScenarioAlgoLocalHeuristicMC'\np60\naS'ScenarioAlgoLocalHeuristicMC'\np61\naS'ScenarioAlgoLocalHeuristicMC'\np62\naS'ScenarioAlgoLocalHeuristicMC'\np63\naS'ScenarioAlgoLocalHeuristicM'\np64\naS'ScenarioAlgoLocalHeuristicM'\np65\naS'ScenarioAlgoLocalHeuristicMC'\np66\naS'exp'\np67\naS'r100'\np68\naNaS'r333'\np69\naNaNaS'r333'\np70\naS'r100'\np71\naS'linear'\np72\naS'exp'\np73\naS'r10'\np74\naS'linear'\np75\naS'r10'\np76\natp77\nba(lp78\ng9\n(g10\n(I0\ntp79\ng12\ntp80\nRp81\n((I1\n(I2\ntp82\ng19\nI00\n(lp83\ng23\nag24\natp84\n(Ntp85\ntp86\nbatp87\nbb."
df = pickle.loads(df)
# Unexpected behavior was caused by None - values (which are treaded as NaN values), thanks jreback
df.fillna("default", inplace=True) # replaces None/NaN values
print "raw data: (", len(df), ")\n", df
print
print
df_grps1 = df[grp_cols].drop_duplicates()
df_grps2 = df.groupby(grp_cols)
df_grps3 = [grp for grp, _ in df.groupby(grp_cols)]
print "df_grps1 (#", len(df_grps1), "): \n", df_grps1
print
print "df_grps2 (#", len(df_grps2), "): "
for tpl in df_grps2.groups.keys():
print tpl
print
print "df_grps3 (#", len(df_grps3), "): "
for tpl in df_grps3:
print tpl
assert len(df_grps1) == len(df_grps2), "baad bug !!!"
assert len(df_grps2) == len(df_grps3), "baad bug !!!"
assert len(df_grps1) == len(df_grps3), "baad bug!!!"
print "passed without error"