2.2. Pandas Read CSV¶
File paths works also with URLs
2.2.1. SetUp¶
>>> import pandas as pd
>>>
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
2.2.2. Header¶
>>> DATA = 'https://python3.info/_static/iris-clean.csv'
>>>
>>> header = pd.read_csv(DATA, nrows=0).columns
Label Encoder:
>>> DATA = 'https://python3.info/_static/iris-clean.csv'
>>>
>>> header = pd.read_csv(DATA, nrows=0)
>>> nrows, ncols, *class_labels = header.columns
>>> label_encoder = dict(enumerate(class_labels))
2.2.3. Content¶
>>> DATA = 'https://python3.info/_static/iris-clean.csv'
>>>
>>> df = pd.read_csv(DATA)
>>> df.head(3)
sepal_length sepal_width petal_length petal_width species
0 5.4 3.9 1.3 0.4 setosa
1 5.9 3.0 5.1 1.8 virginica
2 6.0 3.4 4.5 1.6 versicolor
2.2.4. Rename Columns¶
>>> DATA = 'https://python3.info/_static/iris-dirty.csv'
>>>
>>> COLUMNS = ['sepal_length', 'sepal_width',
... 'petal_length', 'petal_width', 'species']
>>>
>>> df = pd.read_csv(DATA)
>>> df.head(3)
150 4 setosa versicolor virginica
0 5.4 3.9 1.3 0.4 0
1 5.9 3.0 5.1 1.8 2
2 6.0 3.4 4.5 1.6 1
>>>
>>> df = pd.read_csv(DATA, skiprows=1, names=COLUMNS)
>>> df.head(3)
sepal_length sepal_width petal_length petal_width species
0 5.4 3.9 1.3 0.4 0
1 5.9 3.0 5.1 1.8 2
2 6.0 3.4 4.5 1.6 1
>>>
>>> df['species'].replace({
... 0: 'setosa',
... 1: 'versicolor',
... 2: 'virginica',
... }, inplace=True)
>>>
>>> df.head(n=3)
sepal_length sepal_width petal_length petal_width species
0 5.4 3.9 1.3 0.4 setosa
1 5.9 3.0 5.1 1.8 virginica
2 6.0 3.4 4.5 1.6 versicolor
2.2.5. Compressed¶
If the extension is
.gz
,.bz2
,.zip
, and.xz
, the corresponding compression method is automatically selected
>>> df = pd.read_csv('sample_file.zip', compression='zip')
>>> df = pd.read_csv('sample_file.gz', compression='infer')
2.2.6. Use Case - 0x01¶
>>> DATA = 'https://python3.info/_static/iris-dirty.csv'
>>>
>>> COLUMNS = ['sepal_length', 'sepal_width',
... 'petal_length', 'petal_width', 'species']
>>> header = pd.read_csv(DATA, nrows=0)
>>> nrows, ncols, *class_labels = header.columns
>>> label_encoder = dict(enumerate(class_labels))
>>>
>>> label_encoder
{0: 'setosa', 1: 'versicolor', 2: 'virginica'}
>>> df = pd.read_csv(DATA, names=COLUMNS, skiprows=1)
>>> df['species'].replace(label_encoder, inplace=True)
>>> df.head(n=5)
sepal_length sepal_width petal_length petal_width species
0 5.4 3.9 1.3 0.4 setosa
1 5.9 3.0 5.1 1.8 virginica
2 6.0 3.4 4.5 1.6 versicolor
3 7.3 2.9 6.3 1.8 virginica
4 5.6 2.5 3.9 1.1 versicolor
2.2.7. Use Case - 0x02¶
>>> DATA = 'https://python3.info/_static/martian-en.csv'
>>> pd.read_csv(DATA)
firstname lastname born gender ssn email phone
0 Mark Watney October 12 1994 male 94101212345 mwatney@nasa.gov +1 (234) 555-0000
1 Melissa Lewis July 15 1995 female 95071512345 mlewis@nasa.gov +1 (234) 555-0001
2 Rick Martinez January 21 1996 male 96012112345 rmartinez@nasa.gov +1 (234) 555-0010
3 Alex Vogel November 15 1994 male 94111512345 avogel@esa.int +49 (234) 555-0011
4 Beth Johanssen May 9 2006 female 6250912345 bjohanssen@nasa.gov +1 (234) 555-0100
5 Chris Beck August 2 1999 male 99080212345 cbeck@nasa.gov +1 (234) 555-0101
>>> pd.read_csv(DATA, parse_dates=['born'])
firstname lastname born gender ssn email phone
0 Mark Watney 1994-10-12 male 94101212345 mwatney@nasa.gov +1 (234) 555-0000
1 Melissa Lewis 1995-07-15 female 95071512345 mlewis@nasa.gov +1 (234) 555-0001
2 Rick Martinez 1996-01-21 male 96012112345 rmartinez@nasa.gov +1 (234) 555-0010
3 Alex Vogel 1994-11-15 male 94111512345 avogel@esa.int +49 (234) 555-0011
4 Beth Johanssen 2006-05-09 female 6250912345 bjohanssen@nasa.gov +1 (234) 555-0100
5 Chris Beck 1999-08-02 male 99080212345 cbeck@nasa.gov +1 (234) 555-0101
2.2.8. Assignments¶
"""
* Assignment: Pandas Read CSV Dates
* Complexity: easy
* Lines of code: 1 lines
* Time: 3 min
English:
1. Read data from `DATA` to `result: pd.DataFrame`
2. Parse dates in "born" column
3. Run doctests - all must succeed
Polish:
1. Wczytaj dane z `DATA` do `result: pd.DataFrame`
2. Sparsuj daty w kolumnie "born"
3. Uruchom doctesty - wszystkie muszą się powieść
Hints:
* `parse_dates`
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> result[['firstname', 'lastname', 'born']]
firstname lastname born
0 Mark Watney 1994-10-12
1 Melissa Lewis 1995-07-15
2 Rick Martinez 1996-01-21
3 Alex Vogel 1994-11-15
4 Beth Johanssen 2006-05-09
5 Chris Beck 1999-08-02
"""
import pandas as pd
DATA = 'https://python3.info/_static/martian-en.csv'
# Read DATA and parse dates in "born" column
# type: pd.DataFrame
result = ...
"""
* Assignment: Pandas Read CSV Replace
* Complexity: easy
* Lines of code: 5 lines
* Time: 8 min
English:
1. Read data from `DATA` to `result: pd.DataFrame`
2. Use provided column names in `COLUMNS`
3. Read labels from the first row
4. Replace data in `label` column with values extracted above
5. Define `result: pd.DataFrame` with 25 first rows
6. Run doctests - all must succeed
Polish:
1. Wczytaj dane z `DATA` do `result: pd.DataFrame`
2. Użyj podanych w `COLUMNS` nazw kolumn
3. Wczytaj nazwy labeli z pierwszego wiersza
4. Podmień dane w kolumnie `label` na wartości wyciągnięte powyżej
5. Zdefiniuj `result: pd.DataFrame` z 25 pierwszymi wierszami
6. Uruchom doctesty - wszystkie muszą się powieść
Hints:
* `class_labels = pd.read_csv(DATA, nrows=0).columns[2:]`
* `label_encoder = dict(enumerate(class_labels))`
* `pd.Series.replace()`
Tests:
>>> import sys; sys.tracebacklimit = 0
>>> assert result is not Ellipsis, \
'Assign result to variable: `result`'
>>> assert type(result) is pd.DataFrame, \
'Variable `result` must be a `pd.DataFrame` type'
>>> assert len(result) == 25, \
'Select only 25 first rows'
>>> result.loc[[0,1,2,3,4,5], ['mean radius', 'mean texture', 'label']]
mean radius mean texture label
0 17.99 10.38 malignant
1 20.57 17.77 malignant
2 19.69 21.25 malignant
3 11.42 20.38 malignant
4 20.29 14.34 malignant
5 12.45 15.70 malignant
>>> result['label'].value_counts()
label
malignant 22
benign 3
Name: count, dtype: int64
"""
import pandas as pd
DATA = 'https://python3.info/_static/breast-cancer.csv'
COLUMNS = ['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness', 'mean compactness', 'mean concavity',
'mean concave points', 'mean symmetry', 'mean fractal dimension',
'radius error', 'texture error', 'perimeter error', 'area error',
'smoothness error', 'compactness error', 'concavity error',
'concave points error', 'symmetry error',
'fractal dimension error', 'worst radius', 'worst texture',
'worst perimeter', 'worst area', 'worst smoothness',
'worst compactness', 'worst concavity', 'worst concave points',
'worst symmetry', 'worst fractal dimension', 'label']
# Read DATA, substitute column names, and labels, select 25 rows
# type: pd.DataFrame
result = ...