4.13. DataFrame NA¶
pd.NA
andnp.nan
Represents missing valuespd.NA
will be used in future, but for now, there are function which does not support it yet.isna()
.dropna(how='any|all', axis='rows|columns')
.any()
.all()
.fillna(value|dict)
.ffill()
.bfill()
.interpolate()
- works only withnp.nan
(notpd.NA
)
A floating-point 'not a number' (NaN) value. Equivalent to the output of
float('nan')
. Due to the requirements of the IEEE-754 standard,
math.nan
and float('nan')
are not considered to equal to any other
numeric value, including themselves. To check whether a number is a NaN
,
use the isnan()
function to test for NaNs
instead of is
or
==
. Example 1:
Python Standard Library:
>>> import math
>>>
>>> math.nan == math.nan
False
>>> float('nan') == float('nan')
False
>>> math.isnan(math.nan)
True
>>> math.isnan(float('nan'))
True
4.13.1. SetUp¶
>>> import pandas as pd
>>> import numpy as np
>>>
>>>
>>> df = pd.DataFrame({
... 'A': [1, 2, np.nan, np.nan, 3, np.nan, 4],
... 'B': [1.1, 2.2, np.nan, np.nan, 3.3, np.nan, 4.4],
... 'C': ['a', 'b', np.nan, np.nan, 'c', np.nan, 'd'],
... 'D': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
... })
>>>
>>> df
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 3.0 3.3 c NaN
5 NaN NaN NaN NaN
6 4.0 4.4 d NaN
4.13.2. Check if Any¶
>>> df.any()
A True
B True
C True
D False
dtype: bool
4.13.3. Check if All¶
>>> df.all()
A True
B True
C True
D True
dtype: bool
4.13.4. Check if Null¶
>>> df.isnull()
A B C D
0 False False False True
1 False False False True
2 True True True True
3 True True True True
4 False False False True
5 True True True True
6 False False False True
4.13.5. Check if NA¶
>>> df.isna()
A B C D
0 False False False True
1 False False False True
2 True True True True
3 True True True True
4 False False False True
5 True True True True
6 False False False True
4.13.6. Fill With Scalar Value¶
>>> df.fillna(0.0)
A B C D
0 1.0 1.1 a 0.0
1 2.0 2.2 b 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 3.0 3.3 c 0.0
5 0.0 0.0 0.0 0.0
6 4.0 4.4 d 0.0
4.13.7. Fill With Dict Values¶
>>> df.fillna({
... 'A': 99,
... 'B': 88,
... 'C': 77
... })
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
2 99.0 88.0 77 NaN
3 99.0 88.0 77 NaN
4 3.0 3.3 c NaN
5 99.0 88.0 77 NaN
6 4.0 4.4 d NaN
4.13.8. Fill Forwards¶
ffill
: propagate last valid observation forward:
>>> df.fillna(method='ffill')
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
2 2.0 2.2 b NaN
3 2.0 2.2 b NaN
4 3.0 3.3 c NaN
5 3.0 3.3 c NaN
6 4.0 4.4 d NaN
4.13.9. Fill Backwards¶
bfill
: use NEXT valid observation to fill gap:
>>> df.fillna(method='bfill')
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
2 3.0 3.3 c NaN
3 3.0 3.3 c NaN
4 3.0 3.3 c NaN
5 4.0 4.4 d NaN
6 4.0 4.4 d NaN
4.13.10. Interpolate¶
>>> df.interpolate()
A B C D
0 1.000000 1.100000 a NaN
1 2.000000 2.200000 b NaN
2 2.333333 2.566667 NaN NaN
3 2.666667 2.933333 NaN NaN
4 3.000000 3.300000 c NaN
5 3.500000 3.850000 NaN NaN
6 4.000000 4.400000 d NaN
Method |
Description |
---|---|
|
Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes |
|
Works on daily and higher resolution data to interpolate given length of interval |
|
use the actual numerical values of the index. |
|
Fill in NA using existing values |
|
Passed to |
|
Wrappers around the SciPy interpolation methods of similar names |
|
Refers to |
4.13.11. Drop Rows with NA¶
>>> df.dropna(how='all')
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
4 3.0 3.3 c NaN
6 4.0 4.4 d NaN
>>> df.dropna(how='all', axis='rows')
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
4 3.0 3.3 c NaN
6 4.0 4.4 d NaN
>>> df.dropna(how='all', axis=0)
A B C D
0 1.0 1.1 a NaN
1 2.0 2.2 b NaN
4 3.0 3.3 c NaN
6 4.0 4.4 d NaN
>>> df.dropna(how='any')
Empty DataFrame
Columns: [A, B, C, D]
Index: []
>>> df.dropna(how='any', axis=0)
Empty DataFrame
Columns: [A, B, C, D]
Index: []
>>> df.dropna(how='any', axis='rows')
Empty DataFrame
Columns: [A, B, C, D]
Index: []
4.13.12. Drop Columns with NA¶
>>> df.dropna(how='all', axis='columns')
A B C
0 1.0 1.1 a
1 2.0 2.2 b
2 NaN NaN NaN
3 NaN NaN NaN
4 3.0 3.3 c
5 NaN NaN NaN
6 4.0 4.4 d
>>> df.dropna(how='all', axis=1)
A B C
0 1.0 1.1 a
1 2.0 2.2 b
2 NaN NaN NaN
3 NaN NaN NaN
4 3.0 3.3 c
5 NaN NaN NaN
6 4.0 4.4 d
>>> df.dropna(how='all', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame
>>> df.dropna(how='any', axis='columns')
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]
>>> df.dropna(how='any', axis=1)
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6]
>>> df.dropna(how='any', axis=-1)
Traceback (most recent call last):
ValueError: No axis named -1 for object type DataFrame
4.13.13. Recap¶
>>> data = pd.DataFrame({
... 'A': [1,2,3,4,5,6,7,8,9],
... 'B': [1,2,np.nan,np.nan,5,6,7,8,9]
... })
>>>
>>> a = data['A'].isnull()
>>> b = data['B'].isnull()
>>> c = data['B'].isnull().any()
>>> d = data['B'].isnull().all()
>>>
>>> e = data.fillna(0)
>>>
>>> f = data.dropna()
>>> g = data.dropna(how='any')
>>> h = data.dropna(how='any', axis='rows')
>>> i = data.dropna(how='all', axis='columns')
>>>
>>> j = data.ffill()
>>> k = data.bfill()
>>> l = data.interpolate('linear')
>>> m = data.interpolate('quadratic')
>>> n = data.interpolate('polynomial', order=3)
4.13.14. References¶
4.13.15. Assignments¶
"""
* Assignment: DataFrame NaN
* Complexity: easy
* Lines of code: 10 lines
* Time: 8 min
English:
TODO: English Translation
X. Run doctests - all must succeed
Polish:
1. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
2. Pomiń pierwszą linię z metadanymi
3. Zmień nazwy kolumn na:
a. Sepal length
b. Sepal width
c. Petal length
d. Petal width
e. Species
4. Podmień wartości w kolumnie species
a. 0 -> 'setosa',
b. 1 -> 'versicolor',
c. 2 -> 'virginica'
5. Wybierz wartości w kolumnie 'Petal length' mniejsze od 4
6. Wybrane wartości ustaw na `NaN`
7. Interpoluj liniowo wszystkie wartości `NaN`
8. Usuń wiersze z pozostałymi wartościami `NaN`
9. Zdefiniuj `result` jako dwa pierwsze wiersze
10. Uruchom doctesty - wszystkie muszą się powieść
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result # doctest: +NORMALIZE_WHITESPACE
Sepal length Sepal width Petal length Petal width Species
1 5.9 3.0 5.1 1.8 virginica
2 6.0 3.4 4.5 1.6 versicolor
"""
import pandas as pd
import numpy as np
DATA = 'https://python.astrotech.io/_static/iris-dirty.csv'
COLUMNS = [
'Sepal length',
'Sepal width',
'Petal length',
'Petal width',
'Species']
# type: pd.DataFrame
result = ...