5.13. DataFrame NA

  • pd.NA and np.nan Represents missing values

  • pd.NA will be used in future, but for now, there are function which does not support it yet

  • .isna()

  • .dropna(how='any|all', axis='rows|columns')

  • .any()

  • .all()

  • .fillna(value|dict)

  • .ffill()

  • .bfill()

  • .interpolate() - works only with np.nan (not pd.NA)

A floating-point 'not a number' (NaN) value. Equivalent to the output of float('nan'). Due to the requirements of the IEEE-754 standard, math.nan and float('nan') are not considered to equal to any other numeric value, including themselves. To check whether a number is a NaN, use the isnan() function to test for NaNs instead of is or ==. Example [1]:

Python Standard Library:

>>> import math
>>>
>>> math.nan == math.nan
False
>>> float('nan') == float('nan')
False
>>> math.isnan(math.nan)
True
>>> math.isnan(float('nan'))
True

5.13.1. SetUp

>>> import pandas as pd
>>> import numpy as np
>>>
>>>
>>> df = pd.DataFrame({
...     'A': [1, 2, np.nan, np.nan, 3, np.nan, 4],
...     'B': [1.1, 2.2, np.nan, np.nan, 3.3, np.nan, 4.4],
...     'C': ['a', 'b', np.nan, np.nan, 'c', np.nan, 'd'],
...     'D': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
... })
>>>
>>> df
     A    B    C   D
0  1.0  1.1    a NaN
1  2.0  2.2    b NaN
2  NaN  NaN  NaN NaN
3  NaN  NaN  NaN NaN
4  3.0  3.3    c NaN
5  NaN  NaN  NaN NaN
6  4.0  4.4    d NaN

5.13.2. Check if Any

>>> df.any()
A     True
B     True
C     True
D    False
dtype: bool

5.13.3. Check if All

>>> df.all()
A    True
B    True
C    True
D    True
dtype: bool

5.13.4. Check if Null

>>> df.isnull()
       A      B      C     D
0  False  False  False  True
1  False  False  False  True
2   True   True   True  True
3   True   True   True  True
4  False  False  False  True
5   True   True   True  True
6  False  False  False  True

5.13.5. Check if NA

>>> df.isna()
       A      B      C     D
0  False  False  False  True
1  False  False  False  True
2   True   True   True  True
3   True   True   True  True
4  False  False  False  True
5   True   True   True  True
6  False  False  False  True

5.13.6. Fill With Scalar Value

>>> df.fillna(0.0)
     A    B    C    D
0  1.0  1.1    a  0.0
1  2.0  2.2    b  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0
4  3.0  3.3    c  0.0
5  0.0  0.0  0.0  0.0
6  4.0  4.4    d  0.0

5.13.7. Fill With Dict Values

>>> df.fillna({
...     'A': 99,
...     'B': 88,
...     'C': 77
... })
      A     B   C   D
0   1.0   1.1   a NaN
1   2.0   2.2   b NaN
2  99.0  88.0  77 NaN
3  99.0  88.0  77 NaN
4   3.0   3.3   c NaN
5  99.0  88.0  77 NaN
6   4.0   4.4   d NaN

5.13.8. Fill Forwards

ffill: propagate last valid observation forward:

>>> df.fillna(method='ffill')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
2  2.0  2.2  b NaN
3  2.0  2.2  b NaN
4  3.0  3.3  c NaN
5  3.0  3.3  c NaN
6  4.0  4.4  d NaN

5.13.9. Fill Backwards

bfill: use NEXT valid observation to fill gap:

>>> df.fillna(method='bfill')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
2  3.0  3.3  c NaN
3  3.0  3.3  c NaN
4  3.0  3.3  c NaN
5  4.0  4.4  d NaN
6  4.0  4.4  d NaN

5.13.10. Interpolate

>>> df.interpolate()
          A         B    C   D
0  1.000000  1.100000    a NaN
1  2.000000  2.200000    b NaN
2  2.333333  2.566667  NaN NaN
3  2.666667  2.933333  NaN NaN
4  3.000000  3.300000    c NaN
5  3.500000  3.850000  NaN NaN
6  4.000000  4.400000    d NaN
Table 5.12. Interpolation techniques

Method

Description

linear

Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes

time

Works on daily and higher resolution data to interpolate given length of interval

index, values

use the actual numerical values of the index.

pad

Fill in NA using existing values

nearest, zero, slinear, quadratic, cubic, spline, barycentric, polynomial

Passed to scipy.interpolate.interp1d. These methods use the numerical values of the index. Both polynomial and spline require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5)

krogh, piecewise_polynomial, spline, pchip, akima

Wrappers around the SciPy interpolation methods of similar names

from_derivatives

Refers to scipy.interpolate.BPoly.from_derivatives which replaces piecewise_polynomial interpolation method in scipy 0.18.

5.13.11. Drop Rows with NA

>>> df.dropna(how='all')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN
>>> df.dropna(how='all', axis='rows')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN
>>> df.dropna(how='all', axis=0)
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN
>>> df.dropna(how='any')
Empty DataFrame
Columns: [A, B, C, D]
Index: []
>>> df.dropna(how='any', axis=0)
Empty DataFrame
Columns: [A, B, C, D]
Index: []
>>> df.dropna(how='any', axis='rows')
Empty DataFrame
Columns: [A, B, C, D]
Index: []

5.13.12. Drop Columns with NA

>>> df.dropna(how='all', axis='columns')
     A    B    C
0  1.0  1.1    a
1  2.0  2.2    b
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  3.0  3.3    c
5  NaN  NaN  NaN
6  4.0  4.4    d
>>> df.dropna(how='all', axis=1)
     A    B    C
0  1.0  1.1    a
1  2.0  2.2    b
2  NaN  NaN  NaN
3  NaN  NaN  NaN
4  3.0  3.3    c
5  NaN  NaN  NaN
6  4.0  4.4    d
>>> df.dropna(how='all', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame
>>> df.dropna(how='any', axis='columns')
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]
>>> df.dropna(how='any', axis=1)
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]
>>> df.dropna(how='any', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame

5.13.13. Recap

>>> data = pd.DataFrame({
...     'A': [1,2,3,4,5,6,7,8,9],
...     'B': [1,2,np.nan,np.nan,5,6,7,8,9]
... })
>>>
>>> a = data['A'].isnull()
>>> b = data['B'].isnull()
>>> c = data['B'].isnull().any()
>>> d = data['B'].isnull().all()
>>>
>>> e = data.fillna(0)
>>>
>>> f = data.dropna()
>>> g = data.dropna(how='any')
>>> h = data.dropna(how='any', axis='rows')
>>> i = data.dropna(how='all', axis='columns')
>>>
>>> j = data.ffill()
>>> k = data.bfill()
>>> l = data.interpolate('linear')
>>> m = data.interpolate('quadratic')
>>> n = data.interpolate('polynomial', order=3)

5.13.14. References

5.13.15. Assignments

Code 5.97. Solution
"""
* Assignment: DataFrame NaN
* Complexity: easy
* Lines of code: 10 lines
* Time: 8 min

English:
    1. Read data from `DATA` as `df: pd.DataFrame`
    2. Skip first line with metadata
    3. Rename columns to:
        a. sepal_length
        b. sepal_width
        c. petal_length
        d. petal_width
        e. species
    4. Replace values in column species
        a. 0 -> 'setosa',
        b. 1 -> 'versicolor',
        c. 2 -> 'virginica'
    5. Select values in column 'petal_length' less than 4
    6. Set selected values to `NaN`
    7. Interpolate linearly all `NaN` values
    8. Drop rows with remaining `NaN` values
    9. Define `result` as first two rows
    10. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    2. Pomiń pierwszą linię z metadanymi
    3. Zmień nazwy kolumn na:
        a. sepal_length
        b. sepal_width
        c. petal_length
        d. petal_width
        e. species
    4. Podmień wartości w kolumnie species
        a. 0 -> 'setosa',
        b. 1 -> 'versicolor',
        c. 2 -> 'virginica'
    5. Wybierz wartości w kolumnie 'petal_length' mniejsze od 4
    6. Wybrane wartości ustaw na `NaN`
    7. Interpoluj liniowo wszystkie wartości `NaN`
    8. Usuń wiersze z pozostałymi wartościami `NaN`
    9. Zdefiniuj `result` jako dwa pierwsze wiersze
    10. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'

    >>> result  # doctest: +NORMALIZE_WHITESPACE
       sepal_length  sepal_width  petal_length  petal_width     species
    1           5.9          3.0           5.1          1.8   virginica
    2           6.0          3.4           4.5          1.6  versicolor
"""

import pandas as pd
import numpy as np


DATA = 'https://python3.info/_static/iris-dirty.csv'
COLUMNS = [
    'sepal_length',
    'sepal_width',
    'petal_length',
    'petal_width',
    'species']

# type: pd.DataFrame
result = ...