4.20. DataFrame Modification

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0

4.20.1. Adding column

df['Z'] = ['aa', 'bb']
#       A     B     C   Z
# 0   1.0   2.0   NaN  aa
# 1   NaN   2.0   3.0  bb

4.20.2. Drop row if all values are NaN

  • axis=0: rows

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.dropna(how='all')
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0

4.20.3. Drop column if all values are NaN

  • axis=1: columns

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.dropna(how='all', axis=1)
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0

4.20.4. Drop row if any value is NaN

  • axis=0: rows

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.dropna(how='any')
#       A     B     C

4.20.5. Drop column if any value is NaN

  • axis=1: columns

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.dropna(how='any', axis=1)
#       B
# 0   2.0
# 1   2.0

4.20.6. Fill NA/NaN with specified values

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.fillna(0.0)
#       A     B     C
# 0   1.0   2.0   0.0
# 1   0.0   2.0   3.0

4.20.7. Fill NA/NaN with values from dict with column names

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
values = {'A': 5, 'B': 7, 'C': 9}

df.fillna(values)
#       A     B     C
# 0   1.0   2.0   9.0
# 1   5.0   2.0   3.0

4.20.8. Fill NA/NaN values from previous row

  • ffill: propagate last valid observation forward to next valid backfill

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.fillna(method='ffill')
#       A     B     C
# 0   1.0   2.0   NaN
# 1   1.0   2.0   3.0

4.20.9. Fill NA/NaN values from next row

  • bfill: use NEXT valid observation to fill gap

df = pd.DataFrame([ {'A': 1, 'B': 2},
                    {'B': 2, 'C': 3}])
#       A     B     C
# 0   1.0   2.0   NaN
# 1   NaN   2.0   3.0
df.fillna(method='bfill')
#       A     B     C
# 0   1.0   2.0   3.0
# 1   NaN   2.0   3.0

4.20.10. Transpose

import numpy as np
import pandas as pd

np.random.seed(0)


values = np.random.randn(6, 4)
columns = ['Morning', 'Noon', 'Evening', 'Midnight']
indexes = pd.date_range('1970-01-01', periods=6)

df = pd.DataFrame(values, index=indexes, columns=columns)
#               Morning       Noon    Evening   Midnight
# 1970-01-01   0.486726  -0.291364  -1.105248  -0.333574
# 1970-01-02   0.301838  -0.603001   0.069894   0.309209
# 1970-01-03  -0.424429   0.845898  -1.460294   0.109749
# 1970-01-04   0.909958  -0.986246   0.122176   1.205697
# 1970-01-05  -0.172540  -0.974159  -0.848519   1.691875
# 1970-01-06   0.047059   0.359687   0.531386  -0.587663
df.T
df.transpose()
#          1970-01-01  1970-01-02  1970-01-03  1970-01-04  1970-01-05  1970-01-06
# Morning   -0.728881    1.242791   -0.300652    0.973488    0.527855    0.805407
# Noon       2.452567    0.595302   -0.272770   -2.083819   -0.911698   -0.931830
# Evening    0.911723    0.176457   -0.471503    0.402725   -0.842518   -0.063189
# Midnight  -0.849580   -0.560606   -0.852577   -0.331235    1.653468   -0.792088

4.20.11. Substitute values in columns

df.loc[df['Species'] == 0, 'Species'] = 'Setosa'
df.loc[df['Species'] == 1, 'Species'] = 'Versicolor'
df.loc[df['Species'] == 2, 'Species'] = 'Virginica'
df['Species'].replace(to_replace={
    0: 'setosa',
    1: 'versicolor',
    2: 'virginica'
}, inplace=True)

4.20.11.1. Assignments

4.20.12. Iris Dirty

  1. Mając dane Irysów przekonwertuj je na DataFrame

  2. Pomiń pierwszą linię z metadanymi

  3. Zmień nazwy kolumn na:

    • Sepal length

    • Sepal width

    • Petal length

    • Petal width

    • Species

  4. Podmień wartości w kolumnie species

    • 0 -> 'setosa',

    • 1 -> 'versicolor',

    • 2 -> 'virginica'

  5. Ustaw wszystkie wiersze w losowej kolejności i zresetuj index

  6. Wyświetl pierwsze 5 i ostatnie 3 wiersze

  7. Wykreśl podstawowe statystyki opisowe