5. Pandas DataFrame

  • 2-dimensional object

  • Each column Series and have name

  • All columns has common indexes

  • Operations can be executed on columns or rows

Warning

Following values are generated with np.random.seed(0)

5.1. Creating

5.1.1. Simple pd.DataFrame

values = np.arange(16).reshape(4, 4)
indexes = range(0, 4)
columns = range(0, 4)

df = pd.DataFrame(values, index=indexes, columns=columns)
#     0   1   2   3
# 0   0   1   2   3
# 1   4   5   6   7
# 2   8   9  10  11
# 3  12  13  14  15

5.1.2. With date indexes

values = np.random.randn(6, 4)
columns = ['Morning', 'Noon', 'Evening', 'Midnight']
indexes = pd.date_range('1970-01-01', periods=6)

df = pd.DataFrame(values, index=indexes, columns=columns)
#                 Morning        Noon    Evening   Midnight
# 1970-01-01     2.269755   -1.454366   0.045759  -0.187184
# 1970-01-02     1.532779    1.469359   0.154947   0.378163
# 1970-01-03    -0.887786   -1.980796  -0.347912   0.156349
# 1970-01-04     1.230291    1.202380  -0.387327  -0.302303
# 1970-01-05    -1.048553   -1.420018  -1.706270   1.950775
# 1970-01-06    -0.509652   -0.438074  -1.252795   0.777490

5.1.3. With custom values in columns

pd.DataFrame({'A' : 1.,
              'B' : pd.Timestamp('1961-04-12'),
              'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
              'D' : np.array([3] * 4, dtype='int32'),
              'E' : pd.Categorical(["test", "train", "test", "train"]),
              'F' : 'foo' })
#      A           B    C  D      E    F
# 0  1.0  1961-04-12  1.0  3   test  foo
# 1  1.0  1961-04-12  1.0  3  train  foo
# 2  1.0  1961-04-12  1.0  3   test  foo
# 3  1.0  1961-04-12  1.0  3  train  foo

5.1.4. With multiple rows

pd.DataFrame([{'A': 1, 'B': 2}, {'C': 3}])
#      A    B    C
# 0  1.0  2.0  NaN
# 1  NaN  NaN  3.0

5.2. Properties

values = np.random.randn(6, 4)
columns = ['Morning', 'Noon', 'Evening', 'Midnight']
indexes = pd.date_range('1970-01-01', periods=6)

df = pd.DataFrame(values, index=indexes, columns=columns)
#                 Morning        Noon    Evening   Midnight
# 1970-01-01     2.269755   -1.454366   0.045759  -0.187184
# 1970-01-02     1.532779    1.469359   0.154947   0.378163
# 1970-01-03    -0.887786   -1.980796  -0.347912   0.156349
# 1970-01-04     1.230291    1.202380  -0.387327  -0.302303
# 1970-01-05    -1.048553   -1.420018  -1.706270   1.950775
# 1970-01-06    -0.509652   -0.438074  -1.252795   0.777490

5.2.1. Indexes

df.index
# DatetimeIndex(['1970-01-01', '1970-01-02', '1970-01-03', '1970-01-04', '1970-01-05', '1970-01-06'],
#               dtype='datetime64[ns]', freq='D')

5.2.2. Columns

df.columns
# Index(['Morning', 'Noon', 'Evening', 'Midnight'], dtype='object')

Todo

convert all below values in this chapter to np.random.seed(0)

5.3. Slicing

5.3.1. Slicing by index

values = np.random.randn(6, 4)
columns = ['Morning', 'Noon', 'Evening', 'Midnight']
indexes = pd.date_range('1970-01-01', periods=6)

df = pd.DataFrame(values, index=indexes, columns=columns)
#               Morning       Noon    Evening   Midnight
# 1970-01-01   0.486726  -0.291364  -1.105248  -0.333574
# 1970-01-02   0.301838  -0.603001   0.069894   0.309209
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749
# 1970-01-04   0.909958  -0.986246   0.122176   1.205697
# 1970-01-05  -0.172540  -0.974159  -0.848519   1.691875
# 1970-01-06   0.047059   0.359687   0.531386  -0.587663
df[1:3]
# 1970-01-02   0.301838  -0.603001   0.069894   0.309209
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749

5.3.2. Slicing by columns

values = np.random.randn(6, 4)
columns = ['Morning', 'Noon', 'Evening', 'Midnight']
indexes = pd.date_range('1970-01-01', periods=6)

df = pd.DataFrame(values, index=indexes, columns=columns)
#               Morning       Noon    Evening   Midnight
# 1970-01-01   0.486726  -0.291364  -1.105248  -0.333574
# 1970-01-02   0.301838  -0.603001   0.069894   0.309209
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749
# 1970-01-04   0.909958  -0.986246   0.122176   1.205697
# 1970-01-05  -0.172540  -0.974159  -0.848519   1.691875
# 1970-01-06   0.047059   0.359687   0.531386  -0.587663
df.Morning
# 1970-01-01   -0.438232
# 1970-01-02   -1.798254
# 1970-01-03   -0.802938
# 1970-01-04    0.820863
# 1970-01-05    1.800466
# 1970-01-06    0.141029
# Freq: D, Name: Morning, dtype: float64
df['Morning']
# 1970-01-01   -0.438232
# 1970-01-02   -1.798254
# 1970-01-03   -0.802938
# 1970-01-04    0.820863
# 1970-01-05    1.800466
# 1970-01-06    0.141029
# Freq: D, Name: Morning, dtype: float64
df[['Morning', 'Evening']]
#               Morning    Evening
# 1970-01-01  -0.438232  -1.113116
# 1970-01-02  -1.798254  -0.946041
# 1970-01-03  -0.802938  -0.258279
# 1970-01-04   0.820863  -0.901532
# 1970-01-05   1.800466   0.611194
# 1970-01-06   0.141029  -0.046938
df.loc[:, 'Morning':'Evening']
#                     Morning            Noon         Evening
# 1970-01-01        -1.185919        0.929399        0.546952
# 1970-01-02         1.223428       -0.132430       -0.504896
# 1970-01-03         0.377136       -0.637106       -0.104753
# 1970-01-04         0.844626        0.908642        0.982422
# 1970-01-05         0.089944       -0.706245        0.052225
# 1970-01-06         1.382942        0.386913       -1.332453

5.4. Filtering

values = np.random.randn(6, 4)
columns = ['Morning', 'Noon', 'Evening', 'Midnight']
indexes = pd.date_range('1970-01-01', periods=6)

df = pd.DataFrame(values, index=indexes, columns=columns)
#               Morning       Noon    Evening   Midnight
# 1970-01-01   0.486726  -0.291364  -1.105248  -0.333574
# 1970-01-02   0.301838  -0.603001   0.069894   0.309209
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749
# 1970-01-04   0.909958  -0.986246   0.122176   1.205697
# 1970-01-05  -0.172540  -0.974159  -0.848519   1.691875
# 1970-01-06   0.047059   0.359687   0.531386  -0.587663
df.loc[df['Morning'] < 0]
#               Morning       Noon    Evening   Midnight
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749
# 1970-01-05  -0.172540  -0.974159  -0.848519   1.691875

5.5. Selecting Rows

  • loc zaawansowane opcje wyszukiwania

  • iloc integer locate - tylko po numerkach indeksów

Warning

  • df.loc - start and stop are included!!

  • df.iloc - behaves like Python slices

values = np.random.randn(6, 4)
columns = ['Morning', 'Noon', 'Evening', 'Midnight']
indexes = pd.date_range('1970-01-01', periods=6)

df = pd.DataFrame(values, index=indexes, columns=columns)
#               Morning       Noon    Evening   Midnight
# 1970-01-01   0.486726  -0.291364  -1.105248  -0.333574
# 1970-01-02   0.301838  -0.603001   0.069894   0.309209
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749
# 1970-01-04   0.909958  -0.986246   0.122176   1.205697
# 1970-01-05  -0.172540  -0.974159  -0.848519   1.691875
# 1970-01-06   0.047059   0.359687   0.531386  -0.587663

5.5.1. Single row

  • Returns the row as a Series

df.loc['1970-01-01']
# Morning    -0.438232
# Noon        1.493865
# Evening    -1.113116
# Midnight   -0.042712
# Name: 1970-01-01 00:00:00, dtype: float64

5.5.2. Range of rows

df.loc['1970-01-02': '1970-01-04']
#               Morning       Noon    Evening   Midnight
# 1970-01-02   0.301838  -0.603001   0.069894   0.309209
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749
# 1970-01-04   0.909958  -0.986246   0.122176   1.205697

5.5.3. Range of dates

df.loc['1970-01']
#                 Morning        Noon    Evening   Midnight
# 1970-01-01     2.269755   -1.454366   0.045759  -0.187184
# 1970-01-02     1.532779    1.469359   0.154947   0.378163
# 1970-01-03    -0.887786   -1.980796  -0.347912   0.156349
# 1970-01-04     1.230291    1.202380  -0.387327  -0.302303
# 1970-01-05    -1.048553   -1.420018  -1.706270   1.950775
# 1970-01-06    -0.509652   -0.438074  -1.252795   0.777490

5.5.4. Single row and single column

df.loc['1970-01-02', 'Morning']
# -1.7982538699804334

5.5.5. Range of rows and single column

  • Note that both the start and stop of the slice are included

df.loc['1970-01-02':'1970-01-04', 'Noon']
# 1970-01-02   -1.440613
# 1970-01-03    0.301141
# 1970-01-04   -0.574301
# Freq: D, Name: Noon, dtype: float64

5.5.6. Range of rows and single column

Todo

naprawić to

df.loc[['1970-01-02','1970-01-04'], 'Noon']
# KeyError: "None of [['1970-01-02', '1970-01-04']] are in the [index]"

5.5.7. Single row and selected columns

df.loc['1970-01-02', ['Noon', 'Midnight']]
# Noon       -0.132430
# Midnight   -0.444758
# Name: 1970-01-02 00:00:00, dtype: float64

5.5.8. Single row and column range

df.loc['1970-01-02', 'Noon':'Midnight']
# Noon       -0.132430
# Evening    -0.504896
# Midnight   -0.444758
# Name: 1970-01-02 00:00:00, dtype: float64

5.5.9. Boolean list with the same length as the row axis

  • Print row for given index is True

  • Default to False

df.loc[[True, False, True]]
#               Morning      Noon    Evening   Midnight
# 1970-01-01  -0.438232  1.493865  -1.113116  -0.042712
# 1970-01-03  -0.802938  0.301141  -0.258279  -1.492688

5.5.10. Conditional that returns a boolean Series

df.loc[df['Morning'] < 0]
#               Morning       Noon    Evening   Midnight
# 1970-01-01  -0.438232   1.493865  -1.113116  -0.042712
# 1970-01-02  -1.798254  -1.440613  -0.946041  -2.732719
# 1970-01-03  -0.802938   0.301141  -0.258279  -1.492688

5.5.11. Conditional that returns a boolean Series with column labels specified

df.loc[df['Morning'] < 0, 'Evening']
# 1970-01-01   -1.113116
# 1970-01-02   -0.946041
# 1970-01-03   -0.258279
# Freq: D, Name: Evening, dtype: float64
df.loc[df['Morning'] < 0, ['Morning', 'Evening']]
#               Morning    Evening
# 1970-01-01  -0.438232  -1.113116
# 1970-01-02  -1.798254  -0.946041
# 1970-01-03  -0.802938  -0.258279

5.5.12. Filtering with callable

def morning_below_zero(df):
    return df['Morning'] < 0

df.loc[morning_below_zero]
#               Morning       Noon    Evening   Midnight
# 1970-01-01  -0.438232   1.493865  -1.113116  -0.042712
# 1970-01-02  -1.798254  -1.440613  -0.946041  -2.732719
# 1970-01-03  -0.802938   0.301141  -0.258279  -1.492688
df.loc[lambda df: df['Morning'] < 0]
#               Morning       Noon    Evening   Midnight
# 1970-01-01  -0.438232   1.493865  -1.113116  -0.042712
# 1970-01-02  -1.798254  -1.440613  -0.946041  -2.732719
# 1970-01-03  -0.802938   0.301141  -0.258279  -1.492688

5.5.13. Set value for all items matching the list of labels

df.loc[df['Morning'] < 0, 'Evening'] = 0.0
#               Morning       Noon    Evening   Midnight
# 1970-01-01  -0.438232   1.493865   0.000000  -0.042712
# 1970-01-02  -1.798254  -1.440613   0.000000  -2.732719
# 1970-01-03  -0.802938   0.301141   0.000000  -1.492688
# 1970-01-04   0.820863  -0.574301  -0.901532  -0.191122
# 1970-01-05   1.800466  -0.777165   0.611194   1.345492
# 1970-01-06   0.141029  -0.134463  -0.046938   0.401554

5.5.14. Set value for an entire row

df.loc['1970-01-01'] = 0.0
#               Morning       Noon    Evening   Midnight
# 1970-01-01   0.000000   0.000000   0.000000   0.000000
# 1970-01-02   0.391381  -0.034658  -0.026441  -0.528525
# 1970-01-03   0.292151   1.328559   1.510460   0.973299
# 1970-01-04   0.985247   1.033980  -0.688412   1.171957
# 1970-01-05  -0.210144   0.112805  -0.691808   0.339706
# 1970-01-06   1.259968  -0.283706  -1.333459  -0.962464

5.5.15. Set value for an entire column

df.loc[:, 'Evening'] = 0.0
#               Morning       Noon  Evening   Midnight
# 1970-01-01   0.000000   0.000000      0.0   0.000000
# 1970-01-02   0.391381  -0.034658      0.0  -0.528525
# 1970-01-03   0.292151   1.328559      0.0   0.973299
# 1970-01-04   0.985247   1.033980      0.0   1.171957
# 1970-01-05  -0.210144   0.112805      0.0   0.339706
# 1970-01-06   1.259968  -0.283706      0.0  -0.962464

5.5.16. Set value for rows matching callable condition

  • Important!

df.loc[df['Morning'] < 0] = 0.0
#              Morning       Noon  Evening   Midnight
# 1970-01-01  0.000000   0.000000      0.0   0.000000
# 1970-01-02  0.391381  -0.034658      0.0  -0.528525
# 1970-01-03  0.292151   1.328559      0.0   0.973299
# 1970-01-04  0.985247   1.033980      0.0   1.171957
# 1970-01-05  0.000000   0.000000      0.0   0.000000
# 1970-01-06  1.259968  -0.283706      0.0  -0.962464

5.6. Accessing values

  • Access a single value for a row/column pair by integer position

  • Use iat if you only need to get or set a single value in a DataFrame or Series

  • iat integer at (bez where i innych bajerów)

values = np.random.randn(6, 4)
columns = ['Morning', 'Noon', 'Evening', 'Midnight']
indexes = pd.date_range('1970-01-01', periods=6)

df = pd.DataFrame(values, index=indexes, columns=columns)
#               Morning       Noon    Evening   Midnight
# 1970-01-01   0.486726  -0.291364  -1.105248  -0.333574
# 1970-01-02   0.301838  -0.603001   0.069894   0.309209
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749
# 1970-01-04   0.909958  -0.986246   0.122176   1.205697
# 1970-01-05  -0.172540  -0.974159  -0.848519   1.691875
# 1970-01-06   0.047059   0.359687   0.531386  -0.587663

5.6.1. Get value at specified row/column pair

  • First argument is column

  • Second argument is row

df.iat[0, 0]
# -0.728881431659923

df.iat[1, 0]
# 1.2427906060319527

df.iat[0, 1]
# 2.4525672341751084

5.6.2. Set value at specified row/column pair

df.iat[0, 0] = 0.0
df.iat[0, 0]
# 0.0

5.6.3. Get value within a series

  • loc returns Series

df.loc['1970-01-01'].iat[1]
# 2.4525672341751084

5.7. Modifying values

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0

5.7.1. Adding column

df['Z'] = ['aa', 'bb']
#       A     B     C   Z
# 0   1.0   2.0   NaN  aa
# 1   NaN   2.0   3.0  bb

5.7.2. Drop row if all values are NaN

  • axis=0: rows

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.dropna(how='all')
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0

5.7.3. Drop column if all values are NaN

  • axis=1: columns

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.dropna(how='all', axis=1)
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0

5.7.4. Drop row if any value is NaN

  • axis=0: rows

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.dropna(how='any')
#       A     B     C

5.7.5. Drop column if any value is NaN

  • axis=1: columns

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.dropna(how='any', axis=1)
#       B
# 0   2.0
# 1   2.0

5.7.6. Fill NA/NaN with specified values

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.fillna(0.0)
#       A     B     C
# 0   1.0   2.0   0.0
# 1   0.0   2.0   3.0

5.7.7. Fill NA/NaN with values from dict with column names

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
values = {'A': 5, 'B': 7, 'C': 9}

df.fillna(values)
#       A     B     C
# 0   1.0   2.0   9.0
# 1   5.0   2.0   3.0

5.7.8. Fill NA/NaN values from previous row

  • ffill: propagate last valid observation forward to next valid backfill

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.fillna(method='ffill')
#       A     B     C
# 0   1.0   2.0   NaN
# 1   1.0   2.0   3.0

5.7.9. Fill NA/NaN values from next row

  • bfill: use NEXT valid observation to fill gap

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.fillna(method='bfill')
#       A     B     C
# 0   1.0   2.0   3.0
# 1   NaN   2.0   3.0

5.7.10. Transpose

values = np.random.randn(6, 4)
columns = ['Morning', 'Noon', 'Evening', 'Midnight']
indexes = pd.date_range('1970-01-01', periods=6)

df = pd.DataFrame(values, index=indexes, columns=columns)
#               Morning       Noon    Evening   Midnight
# 1970-01-01   0.486726  -0.291364  -1.105248  -0.333574
# 1970-01-02   0.301838  -0.603001   0.069894   0.309209
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749
# 1970-01-04   0.909958  -0.986246   0.122176   1.205697
# 1970-01-05  -0.172540  -0.974159  -0.848519   1.691875
# 1970-01-06   0.047059   0.359687   0.531386  -0.587663
df.T
df.transpose()
#          1970-01-01  1970-01-02  1970-01-03  1970-01-04  1970-01-05  1970-01-06
# Morning   -0.728881    1.242791   -0.300652    0.973488    0.527855    0.805407
# Noon       2.452567    0.595302   -0.272770   -2.083819   -0.911698   -0.931830
# Evening    0.911723    0.176457   -0.471503    0.402725   -0.842518   -0.063189
# Midnight  -0.849580   -0.560606   -0.852577   -0.331235    1.653468   -0.792088

5.7.11. Substitute values in columns

df.loc[df['Species'] == 0, 'Species'] = 'Setosa'
df.loc[df['Species'] == 1, 'Species'] = 'Versicolor'
df.loc[df['Species'] == 2, 'Species'] = 'Virginica'
df['Species'].replace(to_replace={
    0: 'setosa',
    1: 'versicolor',
    2: 'virginica'
}, inplace=True)

5.8. Selecting values

values = np.random.randn(6, 4)
columns = ['A', 'B', 'C', 'D']
indexes = pd.date_range('1970-01-01', periods=6)
# DatetimeIndex(['1970-01-01',
#                '1970-01-02',
#                '1970-01-03',
#                '1970-01-04',
#                '1970-01-05',
#                '1970-01-06'], dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(values, index=indexes, columns=columns)

5.8.1. First n records

df.head(2)
#                    A          B          C          D
# 1970-01-01  0.131926  -1.825204  -1.909562   1.274718
# 1970-01-02  0.084471  -0.932586   0.160637  -0.275183

5.8.2. Last n records

df.tail(3)
#                     A          B          C         D
# 1970-01-04  -0.974425   1.327082  -0.435516  1.328745
# 1970-01-05   0.589973   0.748417  -1.680741  0.510512
# 1970-01-06   1.361922  -0.827940   0.400024  0.047176

5.8.3. Sample n elements

df.sample()
#                     A          B          C         D
# 1970-01-05   0.589973   0.748417  -1.680741  0.510512
df.sample(2)
#                     A          B          C         D
# 1970-01-04  -0.974425   1.327082  -0.435516  1.328745
# 1970-01-01  0.131926  -1.825204  -1.909562   1.274718
df.sample(n=2, repeat=True)
#                     A          B          C         D
# 1970-01-05   0.589973   0.748417  -1.680741  0.510512
# 1970-01-05   0.589973   0.748417  -1.680741  0.510512

5.8.4. Sample n percent of elements

  • 0.05 is 5%

  • 1.0 is 100%

df.sample(frac=0.05)
#      Sepal length  Sepal width  Petal length  Petal width     Species
# 146           5.9          3.0           4.2          1.5  Versicolor
# 135           4.7          3.2           1.3          0.2      Setosa
# 15            6.6          3.0           4.4          1.4  Versicolor
# 68            5.0          3.6           1.4          0.2      Setosa
# 42            6.2          2.8           4.8          1.8   Virginica
# 10            6.5          3.0           5.2          2.0   Virginica
# 17            5.8          2.7           5.1          1.9   Virginica
# 66            5.4          3.4           1.7          0.2      Setosa
df.sample(frac=0.05).reset_index(drop=True)
#      Sepal length  Sepal width  Petal length  Petal width     Species
# 0             5.9          3.0           4.2          1.5  Versicolor
# 1             4.7          3.2           1.3          0.2      Setosa
# 2             6.6          3.0           4.4          1.4  Versicolor
# 3             5.0          3.6           1.4          0.2      Setosa
# 4             6.2          2.8           4.8          1.8   Virginica
# 5             6.5          3.0           5.2          2.0   Virginica
# 6             5.8          2.7           5.1          1.9   Virginica
# 7             5.4          3.4           1.7          0.2      Setosa

5.9. Sorting

values = np.random.randn(6, 4)
columns = ['A', 'B', 'C', 'D']
indexes = pd.date_range('1970-01-01', periods=6)
# DatetimeIndex(['1970-01-01',
#                '1970-01-02',
#                '1970-01-03',
#                '1970-01-04',
#                '1970-01-05',
#                '1970-01-06'], dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(values, index=indexes, columns=columns)

5.9.1. Sort by index

df.sort_index(ascending=False) # default axis=0
df.sort_index(ascending=False, inplace=True)

A

B

C

D

1970-01-06

1.361922

-0.827940

0.400024

0.047176

1970-01-05

0.589973

0.748417

-1.680741

0.510512

1970-01-04

-0.974425

1.327082

-0.435516

1.328745

1970-01-03

-1.308835

-0.285436

-0.757591

-0.042493

1970-01-02

0.084471

-0.932586

0.160637

-0.275183

1970-01-01

0.131926

-1.825204

-1.909562

1.274718

5.9.2. Sort by columns

df.sort_index(axis=1, ascending=False)

D

C

B

A

1970-01-01

1.274718

-1.909562

-1.825204

0.131926

1970-01-02

-0.275183

0.160637

-0.932586

0.084471

1970-01-03

-0.042493

-0.757591

-0.285436

-1.308835

1970-01-04

1.328745

-0.435516

1.327082

-0.974425

1970-01-05

0.510512

-1.680741

0.748417

0.589973

1970-01-06

0.047176

0.400024

-0.827940

1.361922

5.9.3. Sort by values

df.sort_values('B')
df.sort_values('B', inplace=True)

# można sortować po wielu kolumnach (jeżeli wartości w pierwszej będą równe)
df.sort_values(['B', 'C'])
df.sort_values(['B', 'C'])

1970-01-01

0.131926

-1.825204

-1.909562

1.274718

1970-01-02

0.084471

-0.932586

0.160637

-0.275183

1970-01-06

1.361922

-0.827940

0.400024

0.047176

1970-01-03

-1.308835

-0.285436

-0.757591

-0.042493

1970-01-05

0.589973

0.748417

-1.680741

0.510512

1970-01-04

-0.974425

1.327082

-0.435516

1.328745

5.10. Statistics

values = np.random.randn(6, 4)
columns = ['A', 'B', 'C', 'D']
indexes = pd.date_range('1970-01-01', periods=6)
# DatetimeIndex(['1970-01-01',
#                '1970-01-02',
#                '1970-01-03',
#                '1970-01-04',
#                '1970-01-05',
#                '1970-01-06'], dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(values, index=indexes, columns=columns)

5.10.1. Arithmetic mean

df.mean()
# A   -0.078742
# B    0.241929
# C    0.110231
# D   -0.092946
# dtype: float64

5.10.2. Descriptive stats

df.describe()
#               A          B          C          D
# count  6.000000   6.000000   6.000000   6.000000
# mean  -0.078742   0.241929   0.110231  -0.092946
# std    0.690269   0.845521   0.746167   1.207483
# min   -0.928127  -0.931601  -0.812575  -1.789321
# 25%   -0.442016  -0.275899  -0.359650  -0.638282
# 50%   -0.202288   0.287667  -0.045933  -0.332729
# 75%    0.189195   0.882916   0.733453   0.902115
# max    1.062487   1.190259   1.036800   1.323504

5.10.3. Percentiles

values = np.array([[1, 1], [2, 10], [3, 100], [4, 100]])
columns = ['a', 'b']

df = pd.DataFrame(values, columns=columns)
#    a    b
# 0  1    1
# 1  2   10
# 2  3  100
# 3  4  100
df.quantile(.1)
# a    1.3
# b    3.7
# dtype: float64
df.quantile([.1, .5])
#        a     b
# 0.1  1.3   3.7
# 0.5  2.5  55.0

5.10.4. Other methods

Table 5.4. Descriptive statistics

Function

Description

count

Number of non-null observations

sum

Sum of values

mean

Mean of values

mad

Mean absolute deviation

median

Arithmetic median of values

min

Minimum

max

Maximum

mode

Mode

abs

Absolute Value

prod

Product of values

std

Unbiased standard deviation

var

Unbiased variance

sem

Unbiased standard error of the mean

skew

Unbiased skewness (3rd moment)

kurt

Unbiased kurtosis (4th moment)

quantile

Sample quantile (value at %)

cumsum

Cumulative sum

cumprod

Cumulative product

cummax

Cumulative maximum

cummin

Cumulative minimum

5.11. Grouping

  • Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns

  • Check:

    • .value_counts()

    • .nunique()

    • .sum()

    • .count()

    • .max()

    • .first()

values = np.random.randn(6, 4)
columns = ['A', 'B', 'C', 'D']
indexes = pd.date_range('1970-01-01', periods=6)
# DatetimeIndex(['1970-01-01',
#                '1970-01-02',
#                '1970-01-03',
#                '1970-01-04',
#                '1970-01-05',
#                '1970-01-06'], dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(values, index=indexes, columns=columns)

5.11.1. By count of elements

df.groupby('D').size()
#         D
# -1.789321    1
# -0.709686    1
# -0.424071    1
# -0.241387    1
#  1.283282    1
#  1.323504    1
# dtype: int64

5.11.2. By mean of elements

df.groupby('D').mean()
#         D          A          B          C
# -1.789321   0.257330   1.190259   0.074459
# -0.709686  -0.459565   0.827296   0.953118
# -0.424071   1.062487  -0.251961  -0.424092
# -0.241387  -0.928127  -0.931601   1.036800
# 1.283282   -0.015208   0.901456  -0.812575
# 1.323504   -0.389369  -0.283878  -0.166324

5.11.3. Example

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

#      A      B          C          D
# 0  foo    one   0.239653  -1.505271
# 1  bar    one   0.567327  -0.109503
# 2  foo    two   1.726200  -0.401514
# 3  bar  three  -1.145225   1.379611
# 4  foo    two  -0.808037   1.148953
# 5  bar    two   0.883013  -0.347327
# 6  foo    one   0.225142  -0.995694
# 7  foo  three  -0.484968  -0.547152

df.groupby('A').mean()
#   A         C          D
# bar  0.101705   0.307594
# foo  0.179598  -0.460136

5.12. Aggregations

  • df.groupby('month', as_index=False).agg({"duration": "sum"})

aggregations = {
    'duration':'sum',
    'date': lambda x: max(x) - 1
}
data.groupby('month').agg(aggregations)
aggregations = {
    'duration': [min, max, sum],        # find the min, max, and sum of the duration column
    'network_type': 'count',            # find the number of network type entries
    'date': [min, 'first', 'nunique']   # get the min, first, and number of unique dates per group
}

data.groupby(['month', 'item']).agg(aggregations)

5.13. Joins

../_images/sql-joins1.png

Figure 5.7. Joins

values = np.random.randn(6, 4)
columns = ['A', 'B', 'C', 'D']
indexes = pd.date_range('1970-01-01', periods=6)
# DatetimeIndex(['1970-01-01',
#                '1970-01-02',
#                '1970-01-03',
#                '1970-01-04',
#                '1970-01-05',
#                '1970-01-06'], dtype='datetime64[ns]', freq='D')

df1 = pd.DataFrame(values, index=indexes, columns=columns)
df2 = pd.DataFrame([ {'A': 1, 'B': 2},
                     {'C': 3}])

5.13.1. Left Join

df1.join(df2, how='left', rsuffix='_2')  # gdyby była kolizja nazw kolumn, to dodaj suffix '_2'
df1.merge(df2, right_index=True, left_index=True, how='left', suffixes=('', '_2'))

5.13.2. Outer Join

df1.merge(df2)
df1.merge(df2, how='outer')

5.13.3. Append

  • jak robi appenda, to nie zmienia indeksów (uwaga na indeksy powtórzone)

  • nowy dataframe będzie miał kolejne indeksy

df1.append(df2)
df1.append(df2, ignore_index=True)

5.13.4. Concat

  • Przydatne przy łączeniu dataframe wczytanych z wielu plików

pd.concat([df1, df2])
pd.concat([df1, df2], ignore_index=True)
pd.concat([df1, df2], join='inner')

5.14. Practical Example

import pandas as pd
from reach.importer.models import Spreadsheet


df = pd.read_excel(
    io='filename.xls',
    encoding='utf-8',
    parse_dates=['from', 'to'],  # list of columns to parse for dates
    sheet_name=['Sheet 1'],
    skip_blank_lines=True,
    skiprows=1,
)

# Rename Columns to match database columns
df.rename(columns={
    'from': 'date_start',
    'to': 'date_end',
}, inplace=True)

# Drop all records where "Name" is empty (NaN)
df.dropna(subset=['name'], how='all', inplace=True)

# Add column ``blacklis`` with data
df['blacklist'] = [True, False, True, False]

# Change NaN to None
df.fillna(None, inplace=True)

# Choose columns
columns = ['name', 'date_start', 'date_end', 'blacklist']

return df[columns].to_dict('records')

5.15. Assignments

5.15.1. Iris Dirty

  1. Mając dane Irysów przekonwertuj je na DataFrame

  2. Pomiń pierwszą linię z metadanymi

  3. Zmień nazwy kolumn na:

    • Sepal length

    • Sepal width

    • Petal length

    • Petal width

    • Species

  4. Podmień wartości w kolumnie species

    • 0 -> ‘setosa’,

    • 1 -> ‘versicolor’,

    • 2 -> ‘virginica’

  5. Ustaw wszystkiw wiersze w losowej kolejności i zresetuj index

  6. Wyświetl pierwsze 5 i ostatnie 3 wiersze

  7. Wykreśl podstawowe statystyki opisowe

5.15.2. Iris Clean

  1. Mając dane Irysów przekonwertuj je na DataFrame

  2. Podaj jawnie encoding

  3. Pierwsza linijka stanowi metadane (nie wyświetlaj jej)

  4. Nazwy poszczególnych kolumn:

    • Sepal length

    • Sepal width

    • Petal length

    • Petal width

    • Species

  5. Przefiltruj inplace kolumnę ‘Petal length’ i pozostaw wartości powyżej 2.0

  6. Dodaj kolumnę datetime i wpisz do niej dzisiejszą datę (UTC)

  7. Dodaj kolumnę big_enough i dla wartości ‘Petal width’ powyżej 1.0 ustawi True, a dla mniejszych False

  8. Pozostaw tylko kolumny ‘Sepal length’, ‘Sepal width’ oraz ‘Species’

  9. Wykreśl podstawowe statystyki opisowe

5.15.3. Cars

  • Filename: pandas_cars.py

  • Lines of code to write: 15 lines

  • Estimated time of completion: 45 min

  1. Stwórz DataFrame samochody z:

    • losową kolumną liczb całkowitych przebieg z przedziału [0, 200 000]

    • losową kolumną spalanie z przedziału [2, 20]

  2. Dodaj kolumnę marka:

    • jeżeli samochód ma spalanie [0, 5] marka to VW

    • jeżeli samochód ma spalanie [6, 10] marka to Ford

    • jeżeli samochód ma spalanie 11 i więcej, marka to UAZ

  3. Dodaj kolumnę pochodzenie:

    • jeżeli przebieg poniżej 100 km, pochodzenie nowy

    • jeżeli przebieg powyżej 100 km, pochodzenie uzywany

    • jeżeli przebieg powyżej 100 000 km, pochodzenie z niemiec

  4. Przeanalizuj dane statystycznie

    • sprawdź liczność grup

    • wykonaj analizę statystyczną

  5. Pogrupuj dane po marce i po pochodzenie

5.15.4. EVA

  • Filename: pandas_eva.py

  • Lines of code to write: 25 lines

  • Estimated time of completion: 30 min

  1. Na podstawie podanych URL:

  2. Scrappuj stronę wykorzystując pandas.read_html()

  3. Połącz dane wykorzystując pd.concat

  4. Przygotuj plik CSV z danymi dotyczącymi spacerów kosmicznych