5.25. DataFrame NaN

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, np.nan, 3, np.nan, 4],
    'B': [1.1, 2.2, np.nan, np.nan, 3.3, np.nan, 4.4],
    'C': ['a', 'b', np.nan, np.nan, 'c', np.nan, 'd'],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
})

df
#      A    B    C    D
# 0  1.0  1.1    a  NaN
# 1  2.0  2.2    b  NaN
# 2  NaN  NaN  NaN  NaN
# 3  NaN  NaN  NaN  NaN
# 4  3.0  3.3    c  NaN
# 5  NaN  NaN  NaN  NaN
# 6  4.0  4.4    d  NaN

5.25.1. Find NaN

5.25.1.1. Any

df.any()
# A     True
# B     True
# C     True
# D    False
# dtype: bool

5.25.1.2. All

df.all()
# A    True
# B    True
# C    True
# D    True
# dtype: bool

5.25.1.3. Is Null

df.isnull()
#        A      B      C     D
# 0  False  False  False  True
# 1  False  False  False  True
# 2   True   True   True  True
# 3   True   True   True  True
# 4  False  False  False  True
# 5   True   True   True  True
# 6  False  False  False  True

5.25.1.4. Is NA

df.isna()
#        A      B      C     D
# 0  False  False  False  True
# 1  False  False  False  True
# 2   True   True   True  True
# 3   True   True   True  True
# 4  False  False  False  True
# 5   True   True   True  True
# 6  False  False  False  True

5.25.2. Fill NaN

5.25.2.1. With scalar value

  • axis=0 - rows

  • axis=1 - columns

df.fillna(0.0)
#      A    B  C    D
# 0  1.0  1.1  a  0.0
# 1  2.0  2.2  b  0.0
# 2  0.0  0.0  0  0.0
# 3  0.0  0.0  0  0.0
# 4  3.0  3.3  c  0.0
# 5  0.0  0.0  0  0.0
# 6  4.0  4.4  d  0.0

5.25.2.2. With dict values

  • axis=0 - rows

  • axis=1 - columns

df.fillna({
    'A': 99,
    'B': 88,
    'C': 77
})
#       A     B   C    D
# 0   1.0   1.1   a  NaN
# 1   2.0   2.2   b  NaN
# 2  99.0  88.0  77  NaN
# 3  99.0  88.0  77  NaN
# 4   3.0   3.3   c  NaN
# 5  99.0  88.0  77  NaN
# 6   4.0   4.4   d  NaN

5.25.2.3. Forward Fill

  • Values from previous row

  • ffill: propagate last valid observation forward

df.fillna(method='ffill')
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 2  2.0  2.2  b  NaN
# 3  2.0  2.2  b  NaN
# 4  3.0  3.3  c  NaN
# 5  3.0  3.3  c  NaN
# 6  4.0  4.4  d  NaN

5.25.2.4. Backward Fill

  • Values from next row

  • bfill: use NEXT valid observation to fill gap

df.fillna(method='bfill')
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 2  3.0  3.3  c  NaN
# 3  3.0  3.3  c  NaN
# 4  3.0  3.3  c  NaN
# 5  4.0  4.4  d  NaN
# 6  4.0  4.4  d  NaN

5.25.2.5. Interpolate

df.interpolate()
#           A         B    C    D
# 0  1.000000  1.100000    a  NaN
# 1  2.000000  2.200000    b  NaN
# 2  2.333333  2.566667  NaN  NaN
# 3  2.666667  2.933333  NaN  NaN
# 4  3.000000  3.300000    c  NaN
# 5  3.500000  3.850000  NaN  NaN
# 6  4.000000  4.400000    d  NaN

5.25.3. Drop NaN

5.25.3.1. Drop Rows

  • axis=0 - rows

df.dropna(how='all')
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 4  3.0  3.3  c  NaN
# 6  4.0  4.4  d  NaN

df.dropna(how='all', axis='rows')
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 4  3.0  3.3  c  NaN
# 6  4.0  4.4  d  NaN

df.dropna(how='all', axis=0)
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 4  3.0  3.3  c  NaN
# 6  4.0  4.4  d  NaN
df.dropna(how='any')
# Empty DataFrame
# Columns: [A, B, C, D]
# Index: []

df.dropna(how='any', axis=0)
# Empty DataFrame
# Columns: [A, B, C, D]
# Index: []

df.dropna(how='any', axis='rows')
# Empty DataFrame
# Columns: [A, B, C, D]
# Index: []

5.25.3.2. Drop Column

  • axis=1 - columns

df.dropna(how='all', axis='columns')
#      A    B    C
# 0  1.0  1.1    a
# 1  2.0  2.2    b
# 2  NaN  NaN  NaN
# 3  NaN  NaN  NaN
# 4  3.0  3.3    c
# 5  NaN  NaN  NaN
# 6  4.0  4.4    d

df.dropna(how='all', axis=1)
#      A    B    C
# 0  1.0  1.1    a
# 1  2.0  2.2    b
# 2  NaN  NaN  NaN
# 3  NaN  NaN  NaN
# 4  3.0  3.3    c
# 5  NaN  NaN  NaN
# 6  4.0  4.4    d

df.dropna(how='all', axis=-1)
# ValueError: No axis named -1 for object type <class 'pandas.core.frame.DataFrame'>
df.dropna(how='any', axis='columns')
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4, 5, 6]

df.dropna(how='any', axis=1)
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4, 5, 6]

df.dropna(how='any', axis=-1)
# ValueError: No axis named -1 for object type <class 'pandas.core.frame.DataFrame'>

5.25.4. Assignments

5.25.4.1. Iris Dirty

  • Complexity level: easy

  • Lines of code to write: 10 lines

  • Estimated time of completion: 20 min

  • Solution: solution/df_update.py

  1. Pobierz dane Irysów: data/iris-dirty.csv

  2. Mając dane Irysów przekonwertuj je na DataFrame

  3. Pomiń pierwszą linię z metadanymi

  4. Zmień nazwy kolumn na:

    • Sepal length

    • Sepal width

    • Petal length

    • Petal width

    • Species

  5. Podmień wartości w kolumnie species

    • 0 -> 'setosa',

    • 1 -> 'versicolor',

    • 2 -> 'virginica'

  6. Zastąp ustaw na NaN wszystkie wartości wartości w kolumnie 'Petal length' mniejsze od 4

  7. Interpoluj liniowo wszystkie wartości NaN

  8. Usuń wiersze z pozostałymi wartościami NaN

  9. Wyświetl pierwsze 2 i ostatni wiersz

  10. Wykreśl podstawowe statystyki opisowe