5.13. DataFrame NA¶

pd.NA and np.nan Represents missing values
pd.NA will be used in future, but for now, there are function which does not support it yet
.isna()
.dropna(how='any|all', axis='rows|columns')
.any()
.all()
.fillna(value|dict)
.ffill()
.bfill()
.interpolate() - works only with np.nan (not pd.NA)

A floating-point 'not a number' (NaN) value. Equivalent to the output of float('nan'). Due to the requirements of the IEEE-754 standard, math.nan and float('nan') are not considered to equal to any other numeric value, including themselves. To check whether a number is a NaN, use the isnan() function to test for NaNs instead of is or ==. Example [1]:

Python Standard Library:

>>> import math
>>>
>>> math.nan == math.nan
False
>>> float('nan') == float('nan')
False
>>> math.isnan(math.nan)
True
>>> math.isnan(float('nan'))
True

5.13.1. SetUp¶

>>> import pandas as pd
>>> import numpy as np
>>>
>>>
>>> df = pd.DataFrame({
...     'A': [1, 2, np.nan, np.nan, 3, np.nan, 4],
...     'B': [1.1, 2.2, np.nan, np.nan, 3.3, np.nan, 4.4],
...     'C': ['a', 'b', np.nan, np.nan, 'c', np.nan, 'd'],
...     'D': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
... })
>>>
>>> df
     A    B    C   D
0  1.0  1.1    a NaN
1  2.0  2.2    b NaN
2  NaN  NaN  NaN NaN
3  NaN  NaN  NaN NaN
4  3.0  3.3    c NaN
5  NaN  NaN  NaN NaN
6  4.0  4.4    d NaN

5.13.2. Check if Any¶

>>> df.any()
A     True
B     True
C     True
D    False
dtype: bool

5.13.3. Check if All¶

>>> df.all()
A    True
B    True
C    True
D    True
dtype: bool

5.13.4. Check if Null¶

>>> df.isnull()
       A      B      C     D
False  False  False  True
False  False  False  True
 True   True   True  True
 True   True   True  True
False  False  False  True
 True   True   True  True
False  False  False  True

5.13.5. Check if NA¶

>>> df.isna()
       A      B      C     D
False  False  False  True
False  False  False  True
 True   True   True  True
 True   True   True  True
False  False  False  True
 True   True   True  True
False  False  False  True

5.13.6. Fill With Scalar Value¶

>>> df.fillna(0.0)
     A    B    C    D
1.0  1.1    a  0.0
2.0  2.2    b  0.0
0.0  0.0  0.0  0.0
0.0  0.0  0.0  0.0
3.0  3.3    c  0.0
0.0  0.0  0.0  0.0
4.0  4.4    d  0.0

5.13.7. Fill With Dict Values¶

>>> df.fillna({
...     'A': 99,
...     'B': 88,
...     'C': 77
... })
      A     B   C   D
0   1.0   1.1   a NaN
1   2.0   2.2   b NaN
2  99.0  88.0  77 NaN
3  99.0  88.0  77 NaN
4   3.0   3.3   c NaN
5  99.0  88.0  77 NaN
6   4.0   4.4   d NaN

5.13.8. Fill Forwards¶

ffill: propagate last valid observation forward:

>>> df.fillna(method='ffill')
     A    B  C   D
1.0  1.1  a NaN
2.0  2.2  b NaN
2.0  2.2  b NaN
2.0  2.2  b NaN
3.0  3.3  c NaN
3.0  3.3  c NaN
4.0  4.4  d NaN

5.13.9. Fill Backwards¶

bfill: use NEXT valid observation to fill gap:

>>> df.fillna(method='bfill')
     A    B  C   D
1.0  1.1  a NaN
2.0  2.2  b NaN
3.0  3.3  c NaN
3.0  3.3  c NaN
3.0  3.3  c NaN
4.0  4.4  d NaN
4.0  4.4  d NaN

5.13.10. Interpolate¶

>>> df.interpolate()
          A         B    C   D
1.000000  1.100000    a NaN
2.000000  2.200000    b NaN
2.333333  2.566667  NaN NaN
2.666667  2.933333  NaN NaN
3.000000  3.300000    c NaN
3.500000  3.850000  NaN NaN
4.000000  4.400000    d NaN

Table 5.12. Interpolation techniques¶
Method	Description
`linear`	Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes
`time`	Works on daily and higher resolution data to interpolate given length of interval
`index`, `values`	use the actual numerical values of the index.
`pad`	Fill in NA using existing values
`nearest`, `zero`, `slinear`, `quadratic`, `cubic`, `spline`, `barycentric`, `polynomial`	Passed to `scipy.interpolate.interp1d`. These methods use the numerical values of the index. Both `polynomial` and `spline` require that you also specify an `order` (int), e.g. `df.interpolate(method='polynomial', order=5)`
`krogh`, `piecewise_polynomial`, `spline`, `pchip`, `akima`	Wrappers around the SciPy interpolation methods of similar names
`from_derivatives`	Refers to `scipy.interpolate.BPoly.from_derivatives` which replaces `piecewise_polynomial` interpolation method in scipy 0.18.

5.13.11. Drop Rows with NA¶

>>> df.dropna(how='all')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN

>>> df.dropna(how='all', axis='rows')
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN

>>> df.dropna(how='all', axis=0)
     A    B  C   D
0  1.0  1.1  a NaN
1  2.0  2.2  b NaN
4  3.0  3.3  c NaN
6  4.0  4.4  d NaN

>>> df.dropna(how='any')
Empty DataFrame
Columns: [A, B, C, D]
Index: []

>>> df.dropna(how='any', axis=0)
Empty DataFrame
Columns: [A, B, C, D]
Index: []

>>> df.dropna(how='any', axis='rows')
Empty DataFrame
Columns: [A, B, C, D]
Index: []

5.13.12. Drop Columns with NA¶

>>> df.dropna(how='all', axis='columns')
     A    B    C
1.0  1.1    a
2.0  2.2    b
NaN  NaN  NaN
NaN  NaN  NaN
3.0  3.3    c
NaN  NaN  NaN
4.0  4.4    d

>>> df.dropna(how='all', axis=1)
     A    B    C
1.0  1.1    a
2.0  2.2    b
NaN  NaN  NaN
NaN  NaN  NaN
3.0  3.3    c
NaN  NaN  NaN
4.0  4.4    d

>>> df.dropna(how='all', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame

>>> df.dropna(how='any', axis='columns')
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]

>>> df.dropna(how='any', axis=1)
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]

>>> df.dropna(how='any', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame

5.13.13. Recap¶

>>> data = pd.DataFrame({
...     'A': [1,2,3,4,5,6,7,8,9],
...     'B': [1,2,np.nan,np.nan,5,6,7,8,9]
... })
>>>
>>> a = data['A'].isnull()
>>> b = data['B'].isnull()
>>> c = data['B'].isnull().any()
>>> d = data['B'].isnull().all()
>>>
>>> e = data.fillna(0)
>>>
>>> f = data.dropna()
>>> g = data.dropna(how='any')
>>> h = data.dropna(how='any', axis='rows')
>>> i = data.dropna(how='all', axis='columns')
>>>
>>> j = data.ffill()
>>> k = data.bfill()
>>> l = data.interpolate('linear')
>>> m = data.interpolate('quadratic')
>>> n = data.interpolate('polynomial', order=3)

5.13.14. References¶

5.13.15. Assignments¶

Code 5.97. Solution¶

"""
* Assignment: DataFrame NaN
* Complexity: easy
* Lines of code: 10 lines
* Time: 8 min

English:
    1. Read data from `DATA` as `df: pd.DataFrame`
    2. Skip first line with metadata
    3. Rename columns to:
        a. sepal_length
        b. sepal_width
        c. petal_length
        d. petal_width
        e. species
    4. Replace values in column species
        a. 0 -> 'setosa',
        b. 1 -> 'versicolor',
        c. 2 -> 'virginica'
    5. Select values in column 'petal_length' less than 4
    6. Set selected values to `NaN`
    7. Interpolate linearly all `NaN` values
    8. Drop rows with remaining `NaN` values
    9. Define `result` as first two rows
    10. Run doctests - all must succeed

Polish:
    1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
    2. Pomiń pierwszą linię z metadanymi
    3. Zmień nazwy kolumn na:
        a. sepal_length
        b. sepal_width
        c. petal_length
        d. petal_width
        e. species
    4. Podmień wartości w kolumnie species
        a. 0 -> 'setosa',
        b. 1 -> 'versicolor',
        c. 2 -> 'virginica'
    5. Wybierz wartości w kolumnie 'petal_length' mniejsze od 4
    6. Wybrane wartości ustaw na `NaN`
    7. Interpoluj liniowo wszystkie wartości `NaN`
    8. Usuń wiersze z pozostałymi wartościami `NaN`
    9. Zdefiniuj `result` jako dwa pierwsze wiersze
    10. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> pd.set_option('display.width', 500)
    >>> pd.set_option('display.max_columns', 10)
    >>> pd.set_option('display.max_rows', 10)

    >>> assert result is not Ellipsis, \
    'Assign result to variable: `result`'
    >>> assert type(result) is pd.DataFrame, \
    'Variable `result` must be a `pd.DataFrame` type'

    >>> result  # doctest: +NORMALIZE_WHITESPACE
       sepal_length  sepal_width  petal_length  petal_width     species
    1           5.9          3.0           5.1          1.8   virginica
    2           6.0          3.4           4.5          1.6  versicolor
"""

import pandas as pd
import numpy as np


DATA = 'https://python3.info/_static/iris-dirty.csv'
COLUMNS = [
    'sepal_length',
    'sepal_width',
    'petal_length',
    'petal_width',
    'species']

# type: pd.DataFrame
result = ...