4.13. DataFrame NA

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, np.nan, 3, np.nan, 4],
    'B': [1.1, 2.2, np.nan, np.nan, 3.3, np.nan, 4.4],
    'C': ['a', 'b', np.nan, np.nan, 'c', np.nan, 'd'],
    'D': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
})

df
#      A    B    C    D
# 0  1.0  1.1    a  NaN
# 1  2.0  2.2    b  NaN
# 2  NaN  NaN  NaN  NaN
# 3  NaN  NaN  NaN  NaN
# 4  3.0  3.3    c  NaN
# 5  NaN  NaN  NaN  NaN
# 6  4.0  4.4    d  NaN

4.13.1. Select

Any:

df.any()
# A     True
# B     True
# C     True
# D    False
# dtype: bool

All:

df.all()
# A    True
# B    True
# C    True
# D    True
# dtype: bool

Is Null:

df.isnull()
#        A      B      C     D
# 0  False  False  False  True
# 1  False  False  False  True
# 2   True   True   True  True
# 3   True   True   True  True
# 4  False  False  False  True
# 5   True   True   True  True
# 6  False  False  False  True

Is NA:

df.isna()
#        A      B      C     D
# 0  False  False  False  True
# 1  False  False  False  True
# 2   True   True   True  True
# 3   True   True   True  True
# 4  False  False  False  True
# 5   True   True   True  True
# 6  False  False  False  True

4.13.2. Update

  • axis=0 - rows

  • axis=1 - columns

Fill NA with scalar value:

df.fillna(0.0)
#      A    B  C    D
# 0  1.0  1.1  a  0.0
# 1  2.0  2.2  b  0.0
# 2  0.0  0.0  0  0.0
# 3  0.0  0.0  0  0.0
# 4  3.0  3.3  c  0.0
# 5  0.0  0.0  0  0.0
# 6  4.0  4.4  d  0.0

Fill NA with dict values:

df.fillna({
    'A': 99,
    'B': 88,
    'C': 77
})
#       A     B   C    D
# 0   1.0   1.1   a  NaN
# 1   2.0   2.2   b  NaN
# 2  99.0  88.0  77  NaN
# 3  99.0  88.0  77  NaN
# 4   3.0   3.3   c  NaN
# 5  99.0  88.0  77  NaN
# 6   4.0   4.4   d  NaN

Forward Fill. ffill: propagate last valid observation forward:

df.fillna(method='ffill')
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 2  2.0  2.2  b  NaN
# 3  2.0  2.2  b  NaN
# 4  3.0  3.3  c  NaN
# 5  3.0  3.3  c  NaN
# 6  4.0  4.4  d  NaN

Backward Fill. bfill: use NEXT valid observation to fill gap:

df.fillna(method='bfill')
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 2  3.0  3.3  c  NaN
# 3  3.0  3.3  c  NaN
# 4  3.0  3.3  c  NaN
# 5  4.0  4.4  d  NaN
# 6  4.0  4.4  d  NaN

Interpolate:

df.interpolate()
#           A         B    C    D
# 0  1.000000  1.100000    a  NaN
# 1  2.000000  2.200000    b  NaN
# 2  2.333333  2.566667  NaN  NaN
# 3  2.666667  2.933333  NaN  NaN
# 4  3.000000  3.300000    c  NaN
# 5  3.500000  3.850000  NaN  NaN
# 6  4.000000  4.400000    d  NaN
Table 4.3. Interpolation techniques

Method

Description

linear

Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes

time

Works on daily and higher resolution data to interpolate given length of interval

index, values

use the actual numerical values of the index.

pad

Fill in NA using existing values

nearest, zero, slinear, quadratic, cubic, spline, barycentric, polynomial

Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both polynomial and spline require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5)

krogh, piecewise_polynomial, spline, pchip, akima

Wrappers around the SciPy interpolation methods of similar names

from_derivatives

Refers to scipy.interpolate.BPoly.from_derivatives which replaces piecewise_polynomial interpolation method in scipy 0.18.

4.13.3. Drop

  • axis=0 - rows

  • axis=1 - columns

Drop Rows:

df.dropna(how='all')
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 4  3.0  3.3  c  NaN
# 6  4.0  4.4  d  NaN

df.dropna(how='all', axis='rows')
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 4  3.0  3.3  c  NaN
# 6  4.0  4.4  d  NaN

df.dropna(how='all', axis=0)
#      A    B  C    D
# 0  1.0  1.1  a  NaN
# 1  2.0  2.2  b  NaN
# 4  3.0  3.3  c  NaN
# 6  4.0  4.4  d  NaN

df.dropna(how='any')
# Empty DataFrame
# Columns: [A, B, C, D]
# Index: []

df.dropna(how='any', axis=0)
# Empty DataFrame
# Columns: [A, B, C, D]
# Index: []

df.dropna(how='any', axis='rows')
# Empty DataFrame
# Columns: [A, B, C, D]
# Index: []

Drop Columns:

df.dropna(how='all', axis='columns')
#      A    B    C
# 0  1.0  1.1    a
# 1  2.0  2.2    b
# 2  NaN  NaN  NaN
# 3  NaN  NaN  NaN
# 4  3.0  3.3    c
# 5  NaN  NaN  NaN
# 6  4.0  4.4    d

df.dropna(how='all', axis=1)
#      A    B    C
# 0  1.0  1.1    a
# 1  2.0  2.2    b
# 2  NaN  NaN  NaN
# 3  NaN  NaN  NaN
# 4  3.0  3.3    c
# 5  NaN  NaN  NaN
# 6  4.0  4.4    d

df.dropna(how='all', axis=-1)
# Traceback (most recent call last):
# ValueError: No axis named -1 for object type <class 'pandas.core.frame.DataFrame'>

df.dropna(how='any', axis='columns')
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4, 5, 6]

df.dropna(how='any', axis=1)
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3, 4, 5, 6]

df.dropna(how='any', axis=-1)
# Traceback (most recent call last):
# ValueError: No axis named -1 for object type <class 'pandas.core.frame.DataFrame'>

4.13.4. Assignments

Code 4.33. Solution
"""
* Assignment: DataFrame NaN
* Complexity: easy
* Lines of code: 10 lines
* Time: 8 min

English:
    TODO: English Translation
    X. Run doctests - all must succeed

Polish:
    1. Użyj danych z sekcji "Given" (patrz poniżej)
    2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    3. Pomiń pierwszą linię z metadanymi
    4. Zmień nazwy kolumn na:
        a. Sepal length
        b. Sepal width
        c. Petal length
        d. Petal width
        e. Species
    5. Podmień wartości w kolumnie species
        a. 0 -> 'setosa',
        b. 1 -> 'versicolor',
        c. 2 -> 'virginica'
    6. Wybierz wartości w kolumnie 'Petal length' mniejsze od 4
    7. Wybrane wartości ustaw na `NaN`
    8. Interpoluj liniowo wszystkie wartości `NaN`
    9. Usuń wiersze z pozostałymi wartościami `NaN`
    10. Zdefiniuj `result` jako dwa pierwsze wiersze
    X. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> type(result) is pd.DataFrame
    True
    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)
    >>> result  # doctest: +NORMALIZE_WHITESPACE
       Sepal length  Sepal width  Petal length  Petal width     Species
    1           5.9          3.0           5.1          1.8   virginica
    2           6.0          3.4           4.5          1.6  versicolor
"""


# Given
import pandas as pd
import numpy as np


DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/iris-dirty.csv'
COLUMNS = [
    'Sepal length',
    'Sepal width',
    'Petal length',
    'Petal width',
    'Species']

result = ...