4.16. DataFrame Mapping¶
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(
columns = ['Morning', 'Noon', 'Evening', 'Midnight'],
index = pd.date_range('1999-12-30', periods=7),
data = np.random.randn(7, 4))
df
# Morning Noon Evening Midnight
# 1999-12-30 1.764052 0.400157 0.978738 2.240893
# 1999-12-31 1.867558 -0.977278 0.950088 -0.151357
# 2000-01-01 -0.103219 0.410599 0.144044 1.454274
# 2000-01-02 0.761038 0.121675 0.443863 0.333674
# 2000-01-03 1.494079 -0.205158 0.313068 -0.854096
# 2000-01-04 -2.552990 0.653619 0.864436 -0.742165
# 2000-01-05 2.269755 -1.454366 0.045759 -0.187184
4.16.1. Map¶
.map()
works element-wise on a Series
df['Morning'].map(lambda value: round(value, 2))
# 1999-12-30 1.76
# 1999-12-31 1.87
# 2000-01-01 -0.10
# 2000-01-02 0.76
# 2000-01-03 1.49
# 2000-01-04 -2.55
# 2000-01-05 2.27
# Freq: D, Name: Morning, dtype: float64
df['Morning'].map(int)
# 1999-12-30 1
# 1999-12-31 1
# 2000-01-01 0
# 2000-01-02 0
# 2000-01-03 1
# 2000-01-04 -2
# 2000-01-05 2
# Freq: D, Name: Morning, dtype: int64
4.16.2. Apply¶
.apply()
works on a row / column basis of a DataFrame
df['Morning'].apply(int)
# 1999-12-30 1
# 1999-12-31 1
# 2000-01-01 0
# 2000-01-02 0
# 2000-01-03 1
# 2000-01-04 -2
# 2000-01-05 2
# Freq: D, Name: Morning, dtype: int64
df['Morning'].apply(lambda value: round(value, 2))
# 1999-12-30 1.76
# 1999-12-31 1.87
# 2000-01-01 -0.10
# 2000-01-02 0.76
# 2000-01-03 1.49
# 2000-01-04 -2.55
# 2000-01-05 2.27
4.16.3. Applymap¶
.applymap()
works element-wise on a DataFrame
4.16.4. Summary¶
Series.map
works element-wise on a SeriesSeries.map
operate on one element at timeSeries.map
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.htmlSeries.apply
operate on one element at timeSeries.apply
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.htmlDataFrame.apply
works on a row / column basis of a DataFrameDataFrame.apply
operates on entire rows or columns at a timeDataFrame.apply
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.htmlDataFrame.applymap
works element-wise on a DataFrameDataFrame.applymap
operate on one element at timeDataFrame.applymap
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html
First major difference: DEFINITION
map
is defined on Series ONLY
applymap
is defined on DataFrames ONLY
apply
is defined on BOTH
Second major difference: ARGUMENT TYPE
map
acceptsdict``s, ``Series
, or callable
applymap
andapply
accept callables only
Third major difference: BEHAVIOR
map
is elementwise for Series
applymap
is elementwise for DataFrames
apply
also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.
Fourth major difference (the most important one): USE CASE
map
is meant for mapping values from one domain to another, so is optimised for performance (e.g.,df['A'].map({1:'a', 2:'b', 3:'c'})
)
applymap
is good for elementwise transformations across multiple rows/columns (e.g.,df[['A', 'B', 'C']].applymap(str.strip)
)
apply
is for applying any function that cannot be vectorised (e.g.,df['sentences'].apply(nltk.sent_tokenize)
)
Footnotes:
map
when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as NaN in the output.
applymap
in more recent versions has been optimised for some operations. You will findapplymap
slightly faster thanapply
in some cases. My suggestion is to test them both and use whatever works better.
map
is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance.
Series.apply
returns a scalar for aggregating operations, Series otherwise. Similarly forDataFrame.apply
. Note thatapply
also has fastpaths when called with certain NumPy functions such asmean
,sum
, etc.

4.16.5. Cleaning User Input¶
80% of machine learning and data science is cleaning data
4.16.6. Is This the Same Address?¶
This is a dump of distinct records of a single address
Which one of the below is a true address?
'ul. Jana III Sobieskiego'
'ul Jana III Sobieskiego'
'ul.Jana III Sobieskiego'
'ulicaJana III Sobieskiego'
'Ul. Jana III Sobieskiego'
'UL. Jana III Sobieskiego'
'ulica Jana III Sobieskiego'
'Ulica. Jana III Sobieskiego'
'os. Jana III Sobieskiego'
'Jana 3 Sobieskiego'
'Jana 3ego Sobieskiego'
'Jana III Sobieskiego'
'Jana Iii Sobieskiego'
'Jana IIi Sobieskiego'
'Jana lll Sobieskiego' # three small letters 'L'
4.16.7. Spelling and Abbreviations¶
'ul'
'ul.'
'Ul.'
'UL.'
'ulica'
'Ulica'
'os'
'os.'
'Os.'
'osiedle'
'oś'
'oś.'
'Oś.'
'ośedle'
'pl'
'pl.'
'Pl.'
'plac'
'al'
'al.'
'Al.'
'aleja'
'aleia'
'alei'
'aleii'
'aleji'
4.16.8. House and Apartment Number¶
'1/2'
'1 / 2'
'1/ 2'
'1 /2'
'3/5/7'
'1 m. 2'
'1 m 2'
'1 apt 2'
'1 apt. 2'
'180f/8f'
'180f/8'
'180/8f'
'13d bud. A'
4.16.9. Phone Numbers¶
+48 (12) 355 5678
+48 123 555 678
123 555 678
+48 12 355 5678
+48 123-555-678
+48 123 555 6789
+1 (123) 555-6789
+1 (123).555.6789
+1 800-python
+48123555678
+48 123 555 678 wew. 1337
+48 123555678,1
+48 123555678,1,2,3
4.16.10. Conversion¶
LETTERS_EN = 'abcdefghijklmnopqrstuvwxyz'
LETTERS_PL = 'aąbcćdeęfghijklłmnńoóprsśtuwyzżź'
LETTERS_PLEN = {'ą': 'a', 'ć': 'c', 'ę': 'e',
'ł': 'l', 'ń': 'n', 'ó': 'o',
'ś': 's', 'ż': 'z', 'ź': 'z'}
MONTHS_EN = ['January', 'February', 'March', 'April',
'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December']
MONTHS_PL = ['styczeń', 'luty', 'marzec', 'kwiecień',
'maj', 'czerwiec', 'lipiec', 'sierpień',
'wrzesień', 'październik', 'listopad', 'grudzień']
MONTHS_PLEN = {'styczeń': 'January',
'luty': 'February',
'marzec': 'March',
'kwiecień': 'April',
'maj': 'May',
'czerwiec': 'June',
'lipiec': 'July',
'sierpień': 'August',
'wrzesień': 'September',
'październik': 'October',
'listopad': 'November',
'grudzień': 'December'}
MONTHS_ENPL = {'January': 'styczeń',
'February': 'luty',
'March': 'marzec',
'April': 'kwiecień',
'May': 'maj',
'June': 'czerwiec',
'July': 'lipiec',
'August': 'sierpień',
'September': 'wrzesień',
'October': 'październik',
'November': 'listopad',
'December': 'grudzień'}
4.16.11. Assignments¶
"""
* Assignment: DataFrame Mapping Split
* Complexity: easy
* Lines of code: 5 lines
* Time: 13 min
English:
1. Use data from "Given" section (see below)
2. Read data from `DATA` as `df: pd.DataFrame`
3. Parse data in `date` column as `datetime` object
4. Split column `date` with into two separate: date and time columns
Polish:
1. Użyj danych z sekcji "Given" (patrz poniżej)
2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
3. Sparsuj dane w kolumnie `date` jako obiekty `datetime`
4. Podziel kolumnę z `date` na dwie osobne: datę i czas
Hints:
* `pd.Series.dt.date`
* `pd.Series.dt.time`
Tests:
>>> type(result) is pd.DataFrame
True
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> result # doctest: +NORMALIZE_WHITESPACE
id period datetime network item type duration date time
0 0 1999-11 1999-10-15 06:58:00 T-Mobile data data 34.5 1999-10-15 06:58:00
1 1 1999-11 1999-10-15 06:58:00 Orange call mobile 13.0 1999-10-15 06:58:00
2 2 1999-11 1999-10-15 14:46:00 Play call mobile 23.0 1999-10-15 14:46:00
3 3 1999-11 1999-10-15 14:48:00 Plus call mobile 4.0 1999-10-15 14:48:00
4 4 1999-11 1999-10-15 17:27:00 T-Mobile call mobile 4.0 1999-10-15 17:27:00
.. ... ... ... ... ... ... ... ... ...
825 825 2000-03 2000-03-13 00:38:00 AT&T sms international 1.0 2000-03-13 00:38:00
826 826 2000-03 2000-03-13 00:39:00 Orange sms mobile 1.0 2000-03-13 00:39:00
827 827 2000-03 2000-03-13 06:58:00 Orange data data 34.5 2000-03-13 06:58:00
828 828 2000-03 2000-03-14 00:13:00 AT&T sms international 1.0 2000-03-14 00:13:00
829 829 2000-03 2000-03-14 00:16:00 AT&T sms international 1.0 2000-03-14 00:16:00
<BLANKLINE>
[830 rows x 9 columns]
"""
# Given
import pandas as pd
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/phones-pl.csv'
result = ...
"""
* Assignment: DataFrame Mapping Translate
* Complexity: easy
* Lines of code: 5 lines
* Time: 13 min
English:
1. Use data from "Given" section (see below)
2. Read data from `DATA` as `df: pd.DataFrame`
3. Set header and index to data from file
4. Convert Polish month names to English
5. Parse dates to `datetime` objects
Polish:
1. Użyj danych z sekcji "Given" (patrz poniżej)
2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
3. Ustaw nagłówek i index na dane zaczytane z pliku
4. Przekonwertuj polskie nazwy miesięcy na angielskie
5. Sparsuj daty do obiektów `datetime`
Hints:
* `pd.Series.replace(regex=True)`
* `pd.to_datetime()`
Tests:
>>> type(result) is pd.DataFrame
True
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> result # doctest: +NORMALIZE_WHITESPACE
id First Name Last Name Mission Date
0 1 Jan Twardowski 1988-01-05
1 2 Mark Watney 1969-07-21
2 3 Ivan Ivanovich 1961-04-12
3 4 Melissa Lewis 1970-01-01
4 5 Alex Vogel 1968-12-25
"""
# Given
import pandas as pd
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/martian-pl.csv'
MONTHS_PLEN = {'styczeń': 'January',
'luty': 'February',
'marzec': 'March',
'kwiecień': 'April',
'maj': 'May',
'czerwiec': 'June',
'lipiec': 'July',
'sierpień': 'August',
'wrzesień': 'September',
'październik': 'October',
'listopad': 'November',
'grudzień': 'December'}
result = ...
"""
* Assignment: DataFrame Mapping Month
* Complexity: easy
* Lines of code: 10 lines
* Time: 13 min
English:
1. Use data from "Given" section (see below)
2. Read data from `DATA` as `df: pd.DataFrame`
3. Add column `year` and `month` by parsing `period` column
4. Month name must be a string month name, not a number (i.e.: 'January', 'May')
Polish:
1. Użyj danych z sekcji "Given" (patrz poniżej)
2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
3. Dodaj kolumnę `year` i `month` poprzez sparsowanie kolumny `period`
4. Nazwa miesiąca musi być ciągiem znaków, a nie liczbą (i.e. 'January', 'May')
:Example:
* if `period` column is "2015-01"
* `year`: 2015
* `month`: January
Hints:
* `Series.str.split(expand=True)`
* `df[ ['A', 'B'] ] = ...`
Tests:
>>> type(result) is pd.DataFrame
True
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> result # doctest: +NORMALIZE_WHITESPACE
period datetime network item type duration year month
id
0 1999-11 1999-10-15 06:58:00 T-Mobile data data 34.5 1999 November
1 1999-11 1999-10-15 06:58:00 Orange call mobile 13.0 1999 November
2 1999-11 1999-10-15 14:46:00 Play call mobile 23.0 1999 November
3 1999-11 1999-10-15 14:48:00 Plus call mobile 4.0 1999 November
4 1999-11 1999-10-15 17:27:00 T-Mobile call mobile 4.0 1999 November
.. ... ... ... ... ... ... ... ...
825 2000-03 2000-03-13 00:38:00 AT&T sms international 1.0 2000 March
826 2000-03 2000-03-13 00:39:00 Orange sms mobile 1.0 2000 March
827 2000-03 2000-03-13 06:58:00 Orange data data 34.5 2000 March
828 2000-03 2000-03-14 00:13:00 AT&T sms international 1.0 2000 March
829 2000-03 2000-03-14 00:16:00 AT&T sms international 1.0 2000 March
<BLANKLINE>
[830 rows x 8 columns]
"""
# Given
import pandas as pd
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/phones-pl.csv'
MONTHS_EN = ['January', 'February', 'March', 'April',
'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December']
MONTHS = dict(enumerate(MONTHS_EN, start=1))
result = ...
"""
* Assignment: DataFrame Mapping Substitute
* Complexity: medium
* Lines of code: 10 lines
* Time: 13 min
English:
1. Use data from "Given" section (see below)
2. Read data from `DATA` as `df: pd.DataFrame`
3. Select `Polish` spreadsheet
4. Set header and index to data from file
5. Mind the encoding
6. Substitute Polish Diacritics to English alphabet letters
7. Compare `df.replace(regex=True)` with `df.applymap()`
Polish:
1. Użyj danych z sekcji "Given" (patrz poniżej)
2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
3. Wybierz arkusz `Polish`
4. Ustaw nagłówek i index na dane zaczytane z pliku
5. Zwróć uwagę na encoding
6. Podmień polskie znaki diakrytyczne na litery z alfabetu angielskiego
7. Porównaj `df.replace(regex=True)` z `df.applymap()`
Tests:
>>> type(result) is pd.DataFrame
True
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 3)
>>> pd.set_option('display.max_rows', 10)
>>> result # doctest: +NORMALIZE_WHITESPACE
Definicja ... Kryteria wyjsciowe
TRL ...
1 Zaobserwowanie i opisanie podstawowych zasad d... ... Zweryfikowane publikacja badania lezacych u po...
2 Sformulowanie koncepcji technologicznej lub pr... ... Udokumentowany opis aplikacji / koncepcji, kto...
3 Przeprowadzanie eksperymentalnie i analityczni... ... Udokumentowane wyniki analityczne / eksperymen...
4 Przeprowadzenie weryfikacji komponentow techno... ... Udokumentowane wyniki testow potwierdzajace zg...
5 Przeprowadzenie weryfikacji komponentow techno... ... Udokumentowane wyniki testow potwierdzajace zg...
6 Dokonanie demonstracji technologii w srodowisk... ... Udokumentowane wyniki testow potwierdzajace zg...
7 Dokonanie demonstracji prototypu systemu w oto... ... Udokumentowane wyniki testow potwierdzajace zg...
8 Zakonczenie badan i demonstracja ostatecznej f... ... Udokumentowane wyniki testow weryfikujacych pr...
9 Weryfikacja technologii w srodowisku operacyjn... ... Udokumentowane wyniki operacyjne misji.
<BLANKLINE>
[9 rows x 4 columns]
"""
# Given
import pandas as pd
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/xlsx/astro-trl.xlsx'
LETTERS_PLEN = {'ą': 'a', 'ć': 'c', 'ę': 'e',
'ł': 'l', 'ń': 'n', 'ó': 'o',
'ś': 's', 'ż': 'z', 'ź': 'z'}
result = ...
"""
* Assignment: Pandas Read JSON OpenAPI
* Complexity: easy
* Lines of code: 5 lines
* Time: 5 min
English:
1. Use data from "Given" section (see below)
2. Read data from `DATA` as `df: pd.DataFrame`
3. Use `requests` library
4. Transpose data
5. If cell is a `dict`, then extract value for `summary`
6. If cell is empty, leave `pd.NA`
Polish:
1. Użyj danych z sekcji "Given" (patrz poniżej)
2. Wczytaj dane z `DATA` jako `df: pd.DataFrame`
3. Użyj biblioteki `requests`
4. Transponuj dane
5. Jeżeli komórka jest `dict`, to wyciągnij wartość dla `summary`
6. Jeżeli komórka jest pusta, pozostaw `pd.NA`
Tests:
>>> type(result) is pd.DataFrame
True
>>> pd.set_option('display.width', 500)
>>> pd.set_option('display.max_columns', 10)
>>> pd.set_option('display.max_rows', 10)
>>> len(result) > 0
True
>>> list(result.columns)
['put', 'post', 'get', 'delete']
>>> list(result.index) # doctest: +NORMALIZE_WHITESPACE
['/pet', '/pet/findByStatus', '/pet/findByTags', '/pet/{petId}', '/pet/{petId}/uploadImage',
'/store/inventory', '/store/order', '/store/order/{orderId}',
'/user', '/user/createWithList', '/user/login', '/user/logout', '/user/{username}']
"""
# Given
import pandas as pd
import requests
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/json/openapi.json'
result = ...