DOC: update groupby NA group handing / workaround

Add more explicit docs / work-around for dealing with groupby and NA groups

(see comments)

Changelog: 07.Nov.2013: Add line to example below to preprocess table content.

I expect the following behavior: A `DataFrame.groupby` splits the dataframe/table into subtables according to the grouping-condition. A column name as a grouping-condition will give me subtables for each individual value in that column. Similarly, grouping with multiple columns (a list of column names) gives me a group for each occurring combination of these columns (or let me put it differently, the unique "values" of multiple columns to group for are tuples).

So if I'm wrong with my expectations, I couldn't read a different meaning or to-expect-behavior from the documentation (e.g. `pandas.DataFrame.groupby.__doc__`), then there is a lake of clarification.

Otherwise I found a bug and I am in the need for a fix: Some existing combinations are not provided with a group or splited subtable -- I checked it with `drop_duplicates`. And, finally, `grouped.__iter__` ignores more/other combinations as `grouped.groups.keys()` -- Here, I also would expect, that both follows the same implementation...

I tracked it to the depth of pandas to `pandas.core.Grouper._get_group_keys` or better `_KeyMapper.get_key`, `self.levels`looks good, but the list-comprehension-getmethod-zip-action goes wrong or eventually `pandas.core.Grouper.group_info` provides a too small `ngroups` value oorr something else.

`pandas.__version__` : 0.12.0-1062-g3c57949  (from 6.11.2013)
`numpy.__version__` : 1.7.2
MacOSX 10.9 

Test Example:

``` python

import pickle
import sys
import os

import pandas as pd

grp_cols = ['algorithm', 'customalpha']
df = "ccopy_reg\n_reconstructor\np0\n(cpandas.core.frame\nDataFrame\np1\nc__builtin__\nobject\np2\nNtp3\nRp4\ng0\n(cpandas.core.internals\nBlockManager\np5\ng2\nNtp6\nRp7\n((lp8\ncnumpy.core.multiarray\n_reconstruct\np9\n(cpandas.core.index\nIndex\np10\n(I0\ntp11\nS'b'\np12\ntp13\nRp14\n((I1\n(I2\ntp15\ncnumpy\ndtype\np16\n(S'O8'\np17\nI0\nI1\ntp18\nRp19\n(I3\nS'|'\np20\nNNNI-1\nI-1\nI63\ntp21\nbI00\n(lp22\nS'algorithm'\np23\naS'customalpha'\np24\natp25\n(Ntp26\ntp27\nbag9\n(cpandas.core.index\nInt64Index\np28\n(I0\ntp29\ng12\ntp30\nRp31\n((I1\n(I13\ntp32\ng16\n(S'i8'\np33\nI0\nI1\ntp34\nRp35\n(I3\nS'<'\np36\nNNNI-1\nI-1\nI0\ntp37\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x04\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x05\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x08\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x0c\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\r\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x11\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x15\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x17\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x18\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1a\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x1e\\x00\\x00\\x00\\x00\\x00\\x00\\x00 \\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np38\ntp39\n(Ntp40\ntp41\nba(lp42\ng9\n(cnumpy\nndarray\np43\n(I0\ntp44\ng12\ntp45\nRp46\n(I1\n(I2\nI13\ntp47\ng16\n(S'O8'\np48\nI0\nI1\ntp49\nRp50\n(I3\nS'|'\np51\nNNNI-1\nI-1\nI63\ntp52\nbI00\n(lp53\nS'ScenarioAlgoLocalHeuristicM'\np54\naS'ScenarioAlgoLocalHeuristicM'\np55\naS'ScenarioAlgoLocalHeuristicM'\np56\naS'ScenarioAlgoLocalHeuristicM'\np57\naS'ScenarioAlgoLocalHeuristicMC'\np58\naS'ScenarioAlgoCMTFLP'\np59\naS'ScenarioAlgoLocalHeuristicMC'\np60\naS'ScenarioAlgoLocalHeuristicMC'\np61\naS'ScenarioAlgoLocalHeuristicMC'\np62\naS'ScenarioAlgoLocalHeuristicMC'\np63\naS'ScenarioAlgoLocalHeuristicM'\np64\naS'ScenarioAlgoLocalHeuristicM'\np65\naS'ScenarioAlgoLocalHeuristicMC'\np66\naS'exp'\np67\naS'r100'\np68\naNaS'r333'\np69\naNaNaS'r333'\np70\naS'r100'\np71\naS'linear'\np72\naS'exp'\np73\naS'r10'\np74\naS'linear'\np75\naS'r10'\np76\natp77\nba(lp78\ng9\n(g10\n(I0\ntp79\ng12\ntp80\nRp81\n((I1\n(I2\ntp82\ng19\nI00\n(lp83\ng23\nag24\natp84\n(Ntp85\ntp86\nbatp87\nbb."
df = pickle.loads(df)

# Unexpected behavior was caused by None - values (which are treaded as NaN values), thanks jreback
df.fillna("default", inplace=True) # replaces None/NaN values

print "raw data: (", len(df), ")\n", df
print
print

df_grps1 = df[grp_cols].drop_duplicates()
df_grps2 = df.groupby(grp_cols)
df_grps3 = [grp for grp, _ in df.groupby(grp_cols)]

print "df_grps1 (#", len(df_grps1), "): \n", df_grps1
print
print "df_grps2 (#", len(df_grps2), "): "
for tpl in df_grps2.groups.keys():
    print tpl
print
print "df_grps3 (#", len(df_grps3), "): "
for tpl in df_grps3:
    print tpl

assert len(df_grps1) == len(df_grps2), "baad bug !!!"
assert len(df_grps2) == len(df_grps3), "baad bug !!!"
assert len(df_grps1) == len(df_grps3), "baad bug!!!"

print "passed without error"

```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: update groupby NA group handing / workaround #5456

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DOC: update groupby NA group handing / workaround #5456

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions