5.19. DataFrame GroupBy

  • .size()

  • .mean()

  • .nunique()

  • .sum()

  • .count()

  • .max()

  • .first()

5.19.1. SetUp

>>> import pandas as pd
>>>
>>>
>>> DATA = 'https://python3.info/_static/phones-en.csv'
>>>
>>> df = pd.read_csv(DATA, parse_dates=['date'])
>>> df.drop(columns='index', inplace=True)
date

The date and time of the entry

duration

The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry

item

A description of the event occurring – can be one of call, sms, or data

month

The billing month that each entry belongs to – of form YYYY-MM

network

The mobile network that was called/texted for each entry

network_type

Whether the number being called was a mobile, international ('world'), voicemail, landline, or other ('special') number.

Source [1]

5.19.2. Grouping

Group by one item:

>>> df.groupby('item')  
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>

Group by multiple items:

>>> df.groupby(['month', 'item'])  
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x...>

5.19.3. Groupby Methods

  • Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns

  • .size()

  • .mean()

  • .nunique()

  • .sum()

  • .count()

  • .max()

  • .first()

5.19.4. Size

>>> df.groupby('item').size()
item
call    388
data    150
sms     292
dtype: int64
>>> df.groupby(['month', 'item']).size()
month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
         sms      48
2015-01  call     88
         data     31
         sms      86
2015-02  call     67
         data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
dtype: int64

5.19.5. Mean

>>> df.groupby('item')['duration'].mean()
item
call    237.940722
data     34.429000
sms       1.000000
Name: duration, dtype: float64
>>> df.groupby(['month', 'item'])['duration'].mean()
month    item
2014-11  call    238.757009
         data     34.429000
         sms       1.000000
2014-12  call    171.658228
         data     34.429000
         sms       1.000000
2015-01  call    193.977273
         data     34.429000
         sms       1.000000
2015-02  call    215.164179
         data     34.429000
         sms       1.000000
2015-03  call    462.276596
         data     34.429000
         sms       1.000000
Name: duration, dtype: float64

5.19.6. Number of Uniques

>>> df.groupby('item').nunique()  
      date  duration  month  network  network_type
item
call   378       220      5        6             3
data   150         1      5        1             1
sms    222         1      5        6             3
>>> df.groupby(['month', 'item']).nunique()  
              date  duration  network  network_type
month   item
2014-11 call   104        76        6             3
        data    29         1        1             1
        sms     79         1        5             2
2014-12 call    76        61        6             3
        data    30         1        1             1
        sms     41         1        5             2
2015-01 call    84        70        6             3
        data    31         1        1             1
        sms     58         1        4             1
2015-02 call    67        63        6             3
        data    31         1        1             1
        sms     27         1        5             2
2015-03 call    47        46        6             3
        data    29         1        1             1
        sms     17         1        4             2

5.19.7. Sum

>>> df.groupby('item')['duration'].sum()  
item
call    92321.00
data     5164.35
sms       292.00
Name: duration, dtype: float64
>>> df.groupby(['month', 'item'])['duration'].sum()  
month    item
2014-11  call    25547.000
         data      998.441
         sms        94.000
2014-12  call    13561.000
         data     1032.870
         sms        48.000
2015-01  call    17070.000
         data     1067.299
         sms        86.000
2015-02  call    14416.000
         data     1067.299
         sms        39.000
2015-03  call    21727.000
         data      998.441
         sms        25.000
Name: duration, dtype: float64

5.19.8. Count

>>> df.groupby('item').count()  
      date  duration  month  network  network_type
item
call   388       388    388      388           388
data   150       150    150      150           150
sms    292       292    292      292           292
>>> df.groupby(['month', 'item']).count()  
              date  duration  network  network_type
month   item
2014-11 call   107       107      107           107
        data    29        29       29            29
        sms     94        94       94            94
2014-12 call    79        79       79            79
        data    30        30       30            30
        sms     48        48       48            48
2015-01 call    88        88       88            88
        data    31        31       31            31
        sms     86        86       86            86
2015-02 call    67        67       67            67
        data    31        31       31            31
        sms     39        39       39            39
2015-03 call    47        47       47            47
        data    29        29       29            29
        sms     25        25       25            25

5.19.9. Minimum

>>> df.groupby('item').min()  
                    date  duration    month network network_type
item
call 2014-01-11 15:13:00     1.000  2014-11  Meteor     landline
data 2014-01-11 06:58:00    34.429  2014-11    data         data
sms  2014-01-12 12:51:00     1.000  2014-11  Meteor       mobile
>>> df.groupby(['month', 'item']).min()  
                            date  duration network network_type
month   item
2014-11 call 2014-01-11 15:13:00     1.000  Meteor     landline
        data 2014-01-11 06:58:00    34.429    data         data
        sms  2014-03-11 08:40:00     1.000  Meteor       mobile
2014-12 call 2014-02-12 11:40:00     2.000  Meteor     landline
        data 2014-01-12 06:58:00    34.429    data         data
        sms  2014-01-12 12:51:00     1.000  Meteor       mobile
2015-01 call 2014-12-15 20:03:00     2.000  Meteor     landline
        data 2014-12-13 06:58:00    34.429    data         data
        sms  2014-12-15 19:56:00     1.000  Meteor       mobile
2015-02 call 2015-01-02 13:33:00     1.000  Meteor     landline
        data 2015-01-02 06:58:00    34.429    data         data
        sms  2015-01-15 12:23:00     1.000  Meteor       mobile
2015-03 call 2015-01-03 12:19:00     2.000  Meteor     landline
        data 2015-01-03 06:58:00    34.429    data         data
        sms  2015-02-03 09:19:00     1.000   Tesco       mobile

5.19.10. Maximum

>>> df.groupby('item').max()  
                    date   duration    month    network network_type
item
call 2015-12-02 20:51:00  10528.000  2015-03  voicemail    voicemail
data 2015-12-03 06:58:00     34.429  2015-03       data         data
sms  2015-12-01 18:26:00      1.000  2015-03      world        world
>>> df.groupby(['month', 'item']).max()  
                            date   duration    network network_type
month   item
2014-11 call 2014-12-11 19:01:00   1940.000  voicemail    voicemail
        data 2014-12-11 06:58:00     34.429       data         data
        sms  2014-12-11 19:20:00      1.000    special      special
2014-12 call 2014-12-14 19:54:00   2120.000  voicemail    voicemail
        data 2014-12-12 06:58:00     34.429       data         data
        sms  2014-11-30 14:44:00      1.000      world        world
2015-01 call 2015-12-01 18:23:00   1859.000  voicemail    voicemail
        data 2015-12-01 06:58:00     34.429       data         data
        sms  2015-12-01 18:26:00      1.000   Vodafone       mobile
2015-02 call 2015-09-02 17:54:00   1863.000  voicemail    voicemail
        data 2015-12-02 06:58:00     34.429       data         data
        sms  2015-10-02 21:40:00      1.000    special      special
2015-03 call 2015-12-02 20:51:00  10528.000  voicemail    voicemail
        data 2015-12-03 06:58:00     34.429       data         data
        sms  2015-04-03 10:30:00      1.000      world        world

5.19.11. First

>>> df.groupby('item').first()  
                    date  duration    month   network network_type
item
call 2014-10-15 06:58:00    13.000  2014-11  Vodafone       mobile
data 2014-10-15 06:58:00    34.429  2014-11      data         data
sms  2014-10-16 22:18:00     1.000  2014-11    Meteor       mobile
>>> df.groupby(['month', 'item']).first()  
                            date  duration    network network_type
month   item
2014-11 call 2014-10-15 06:58:00    13.000   Vodafone       mobile
        data 2014-10-15 06:58:00    34.429       data         data
        sms  2014-10-16 22:18:00     1.000     Meteor       mobile
2014-12 call 2014-11-14 17:24:00   124.000  voicemail    voicemail
        data 2014-11-13 06:58:00    34.429       data         data
        sms  2014-11-14 17:28:00     1.000   Vodafone       mobile
2015-01 call 2014-12-15 20:03:00     4.000      Three       mobile
        data 2014-12-13 06:58:00    34.429       data         data
        sms  2014-12-15 19:56:00     1.000      Three       mobile
2015-02 call 2015-01-15 10:36:00    28.000      Three       mobile
        data 2015-01-13 06:58:00    34.429       data         data
        sms  2015-01-15 12:23:00     1.000    special      special
2015-03 call 2015-12-02 20:15:00    69.000   landline     landline
        data 2015-02-13 06:58:00    34.429       data         data
        sms  2015-02-19 18:46:00     1.000   Vodafone       mobile

5.19.12. Last

>>> df.groupby('item').last()  
                    date   duration    month   network network_type
item
call 2015-04-03 12:29:00  10528.000  2015-03  landline     landline
data 2015-03-13 06:58:00     34.429  2015-03      data         data
sms  2015-03-14 00:16:00      1.000  2015-03     world        world
>>> df.groupby(['month', 'item']).last()  
                            date   duration   network network_type
month   item
2014-11 call 2014-12-11 19:01:00      7.000  Vodafone       mobile
        data 2014-12-11 06:58:00     34.429      data         data
        sms  2014-11-13 22:31:00      1.000  Vodafone       mobile
2014-12 call 2014-12-14 19:54:00     25.000     Three       mobile
        data 2014-12-12 06:58:00     34.429      data         data
        sms  2014-07-12 23:22:00      1.000     world        world
2015-01 call 2015-01-14 20:47:00     36.000     Three       mobile
        data 2015-12-01 06:58:00     34.429      data         data
        sms  2015-01-14 23:36:00      1.000     Three       mobile
2015-02 call 2015-09-02 17:54:00     89.000     Three       mobile
        data 2015-12-02 06:58:00     34.429      data         data
        sms  2015-10-02 21:40:00      1.000  Vodafone       mobile
2015-03 call 2015-04-03 12:29:00  10528.000  landline     landline
        data 2015-03-13 06:58:00     34.429      data         data
        sms  2015-03-14 00:16:00      1.000     world        world

5.19.13. Output format

  • Series or DataFrame?

Produces Pandas Series:

>>> df.groupby('month')['duration'].sum()  
month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

Produces Pandas DataFrame:

>>> df.groupby('month')[['duration']].sum()  
          duration
month
2014-11  26639.441
2014-12  14641.870
2015-01  18223.299
2015-02  15522.299
2015-03  22750.441

5.19.14. Use Case - 0x01

>>> list(df.groupby(['month']).groups.keys())
['2014-11', '2014-12', '2015-01', '2015-02', '2015-03']
>>> len(df.groupby(['month']).groups['2014-11'])
230

5.19.15. Use Case - 0x02

Get the first entry for each month:

>>> df.groupby('month').first()  
                       date  duration  item   network network_type
month
2014-11 2014-10-15 06:58:00    34.429  data      data         data
2014-12 2014-11-13 06:58:00    34.429  data      data         data
2015-01 2014-12-13 06:58:00    34.429  data      data         data
2015-02 2015-01-13 06:58:00    34.429  data      data         data
2015-03 2015-12-02 20:15:00    69.000  call  landline     landline

5.19.16. Use Case - 0x03

Get the sum of the durations per month:

>>> df.groupby('month')['duration'].sum()  
month
2014-11    26639.441
2014-12    14641.870
2015-01    18223.299
2015-02    15522.299
2015-03    22750.441
Name: duration, dtype: float64

5.19.17. Use Case - 0x04

Get the number of dates / entries in each month:

>>> df.groupby('month')['date'].count()  
month
2014-11    230
2014-12    157
2015-01    205
2015-02    137
2015-03    101
Name: date, dtype: int64

5.19.18. Use Case - 0x05

What is the sum of durations, for calls only, to each network:

>>> df.loc[df['item'] == 'call'].groupby('network')['duration'].sum()  
network
Meteor        7200.0
Tesco        13828.0
Three        36464.0
Vodafone     14621.0
landline     18433.0
voicemail     1775.0
Name: duration, dtype: float64

5.19.19. Use Case - 0x06

How many calls, sms, and data entries are in each month?:

>>> df.groupby(['month', 'item'])['date'].count()  
month    item
2014-11  call    107
         data     29
         sms      94
2014-12  call     79
         data     30
         sms      48
2015-01  call     88
         data     31
         sms      86
2015-02  call     67
         data     31
         sms      39
2015-03  call     47
         data     29
         sms      25
Name: date, dtype: int64

5.19.20. Use Case - 0x07

How many calls, texts, and data are sent per month, split by network_type?:

>>> df.groupby(['month', 'network_type'])['date'].count()  
month    network_type
2014-11  data             29
         landline          5
         mobile          189
         special           1
         voicemail         6
2014-12  data             30
         landline          7
         mobile          108
         voicemail         8
         world             4
2015-01  data             31
         landline         11
         mobile          160
         voicemail         3
2015-02  data             31
         landline          8
         mobile           90
         special           2
         voicemail         6
2015-03  data             29
         landline         11
         mobile           54
         voicemail         4
         world             3
Name: date, dtype: int64

5.19.21. Datasets

5.19.22. References

5.19.23. Assignments

Code 5.104. Solution
"""
* Assignment: DataFrame Groupby Phones
* Complexity: easy
* Lines of code: 5 lines
* Time: 8 min

English:
    1. Read data from `DATA` as `df: pd.DataFrame`
    2. Give information about total number of all phone calls for each calendar month
    3. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    2. Podaj informacje o łącznej liczbie wszystkich połączeń telefonicznych dla każdego miesiąca kalendarzowego
    3. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.Series, \
    'Variable `result` must be a `pd.Series` type'

    >>> result  # doctest: +NORMALIZE_WHITESPACE
    year  month
    1999  10       16309.0
          11       16780.0
          12       14861.0
    2000  1        18705.0
          2        11019.0
          3        14647.0
    Name: duration, dtype: float64
"""

import pandas as pd


DATA = 'https://python3.info/_static/phones-pl.csv'


# type: pd.DataFrame
result = ...


Code 5.105. Solution
"""
* Assignment: DataFrame Groupby Astro Female
* Complexity: medium
* Lines of code: 5 lines
* Time: 8 min

English:
    1. Read data from `DATA` as `df: pd.DataFrame`
    2. Which nationality has the most flight time of a female in space?
    3. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    2. Który kraj ma największy nalot kobiet w kosmosie?
    3. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)

    # >>> assert result is not Ellipsis, \
    # 'Assign result to variable: `result`'
    # >>> assert type(result) is pd.DataFrame, \
    # 'Variable `result` must be a `pd.DataFrame` type'

    >>> result  # doctest: +NORMALIZE_WHITESPACE
    Ellipsis

TODO: Doctests
"""

import pandas as pd


DATA = 'https://python3.info/_static/astro-gender.csv'

# type: pd.DataFrame
result = ...


Code 5.106. Solution
"""
* Assignment: DataFrame Groupby Astro Flights
* Complexity: medium
* Lines of code: 5 lines
* Time: 13 min

English:
    1. Read data from `DATA`
    2. Create ranking of the most experienced astronauts (number of flights)
    3. Drop duplicates
    4. Reset index
    5. Sort values:
        a. `total_number_of_missions` descending
        b. `name` ascending
    6. Define `result: pd.Dataframe` with top 10
    7. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA`
    2. Stwórz ranking najbardziej doświadczonych astronautów (liczba lotów)
    3. Usuń duplikaty
    4. Zresetuj indeks
    5. Posortuj wartości:
        a. `total_number_of_missions` malejąco
        b. `name` rosnąco
    6. Zdefiniuj `result: pd.Dataframe` z top 10
    7. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'

    >>> result  # doctest: +NORMALIZE_WHITESPACE
                           name  total_number_of_missions
    0   Chang-Diaz, Franklin R.                         7
    1            Ross, Jerry L.                         7
    2     Brown, Curtis L., Jr.                         6
    3         Foale, C. Michael                         6
    4          Krikalev, Sergei                         6
    5         Malenchenko, Yuri                         6
    6  Musgrave, Franklin Story                         6
    7       Wetherbee, James D.                         6
    8            Young, John W.                         6
    9            Blaha, John E.                         5
"""

import pandas as pd

DATA = 'https://python3.info/_static/astro-selection.csv'

# type: pd.DataFrame
result = ...


Code 5.107. Solution
"""
* Assignment: DataFrame Groupby Astro EVA
* Complexity: medium
* Lines of code: 13 lines
* Time: 21 min

English:
    1. Read data from `DATA` as `df: pd.DataFrame`
    2. Create ranking of astronauts with the most time spent on EVA (ExtraVehicular Activity)
    3. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    2. Stwórz ranking astronautów z największym czasem EVA (Spacerów kosmicznych)
    3. Uruchom doctesty - wszystkie muszą się powieść

Hints:
    * Note, that file delimiter is semicolon ";" (not comma)
    * Parse CSV and replace newlines inside fields with `","`
    * Split names into separate columns for each spacewalker (first, second, third)
    * Split names into separate rows for each spacewalker (use ffill)
    * Split times into separate columns (hours, minutes)

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'

    >>> result  # doctest: +NORMALIZE_WHITESPACE
                                 Duration
    Astronaut
    Anatoliy Solovyov     3 days 06:48:00
    Michael Lopez-Alegria 2 days 19:40:00
    Peggy Whitson         2 days 12:21:00
    Fyodor Yurchikhin     2 days 11:29:00
    Jerry Ross            2 days 10:38:00
    ...                               ...
    Aleksei Yeliseyev     0 days 00:53:00
    Edward White          0 days 00:46:00
    Alfred Worden         0 days 00:39:00
    Alexi Leonov          0 days 00:23:00
    Zhai Zhi-gang         0 days 00:14:00
    <BLANKLINE>
    [226 rows x 1 columns]
"""

import pandas as pd


DATA = 'https://python3.info/_static/astro-eva.csv'

# type: pd.DataFrame
result = ...