4.6. Str Clean

4.6.1. Rationale

  • Using str methods for cleaning user input

  • 80% of machine learning and data science is cleaning data

  • Is This the Same Address?

  • This is a dump of distinct records of a single address

  • Which one of the below is a true address?

4.6.2. Numbers

When comparing age, height, temperature etc, the following numbers has the same meaning. Therefore after converting to float() it will be exactly the same.

>>> age = 21
>>> age = 21.0
>>> age = 21.00
>>> age = '21'
>>> age = '21.0'
>>> age = '21.00'

However, when those values indicates for example a pattern to find in text their meaning will be different. Pattern 21 and '21.00' will be a completely different object, so it should not be treated exactly the same.

>>> pattern = 21
>>> pattern = 21.0
>>> pattern = 21.00
>>> pattern = '21'
>>> pattern = '21.0'
>>> pattern = '21.00'

4.6.3. Addresses

The following code is an output from real customer relationship management (CRM) system, that I wrote in 2000s for a swimming pool in Poznan, Poland. The output is a result of a SELECT DISTINCT(address) result in SQL.

Note to english speaking users:

  • os. - stands for osiedle, a projects of blocks of flats

  • ul. - stands for ulica, street

Is this the same address?

>>> street = 'os. Jana III Sobieskiego'
>>> street = 'ul. Jana III Sobieskiego'
>>> street = 'ul Jana III Sobieskiego'
>>> street = 'ul.Jana III Sobieskiego'
>>> street = 'ulicaJana III Sobieskiego'
>>> street = 'Ul. Jana III Sobieskiego'
>>> street = 'UL. Jana III Sobieskiego'
>>> street = 'ulica Jana III Sobieskiego'
>>> street = 'Ulica. Jana III Sobieskiego'
>>> street = 'Jana 3 Sobieskiego'
>>> street = 'Jana 3ego Sobieskiego'
>>> street = 'Jana III Sobieskiego'
>>> street = 'Jana Iii Sobieskiego'
>>> street = 'Jana IIi Sobieskiego'
>>> street = 'Jana lll Sobieskiego'  # three small letters 'L'

Yes, this is the same address. Despite having information about two different geographical entities (osiedle and ulica), this is the same address. Why? It is just a simple mistake from people who entered data.

SELECT DISTINCT(address) won't show you the number of occurrences for each result. What seems to be a high error rate at the first glance, in further analysis happens to be a superbly few mistakes. How come? Number of results for os. Jana III Sobieskiego was around 50 thousands. The other results was one or two at most. So, few mistakes from 50k results. That's really good result.

Why we had those errors? Browser autocomplete. User error while imputing data. And simple shortcuts during conversation: 'Where do you live?', 'at Sobieskiego the IIIrd'. There is only one place in Poznan, Poland with that name, so it was precise during the conversation. But, receiving party put that incorrectly to the database assuming that it was ulica which is far more common then osiedle addresses.

Address prefix (street, road, court, place, etc.):

>>> prefix = 'ul'
>>> prefix = 'ul.'
>>> prefix = 'Ul.'
>>> prefix = 'UL.'
>>> prefix = 'ulica'
>>> prefix = 'Ulica'
>>>
>>> prefix = 'os'
>>> prefix = 'os.'
>>> prefix = 'Os.'
>>> prefix = 'osiedle'
>>> prefix = 'oś'
>>> prefix = 'oś.'
>>> prefix = 'Oś.'
>>> prefix = 'ośedle'
>>>
>>> prefix = 'pl'
>>> prefix = 'pl.'
>>> prefix = 'Pl.'
>>> prefix = 'plac'
>>>
>>> prefix = 'al'
>>> prefix = 'al.'
>>> prefix = 'Al.'
>>> prefix = 'aleja'
>>> prefix = 'aleia'
>>> prefix = 'alei'
>>> prefix = 'aleii'
>>> prefix = 'aleji'

House and apartment number:

>>> address = 'Ćwiartki 3/4'
>>> address = 'Ćwiartki 3 / 4'
>>> address = 'Ćwiartki 3 m. 4'
>>> address = 'Ćwiartki 3 m 4'
>>> address = 'Brighton Beach 1st apt 2'
>>> address = 'Brighton Beach 1st apt. 2'
>>> address = 'Myśliwiecka 3/5/7'
>>>
>>> address = 'Jana Twardowskiego 180f/8f'
>>> address = 'Jana Twardowskiego 180f/8'
>>> address = 'Jana Twardowskiego 180/8f'
>>>
>>> address = 'Jana Twardowskiego III 3 m. 3'
>>> address = 'Jana Twardowskiego 13d bud. A piętro II sala 3'

4.6.4. Phone Numbers:

Which one is mobile, and which is landline?

>>> phone = '+48 (12) 355 5678'
>>> phone = '+48 123 555 678'

Note, the numbers. They are completely the same. Your

>>> phone = '123 555 678'
>>> phone = '123555678'
>>> phone = '+48123555678'
>>> phone = '+48 12 355 5678'
>>> phone = '+48 123-555-678'
>>> phone = '+48 123 555 6789'
>>> phone = '+1 (123) 555-6789'
>>> phone = '+1 (123).555.6789'
>>>
>>> phone = '+1 800-python'
>>> phone = '+1 800-798466'
>>>
>>> phone = '+48 123 555 678 wew. 1337'
>>> phone = '+48 123555678,1'
>>> phone = '+48 123555678,1,,2'

4.6.5. Date and Time

>>> date = '1961-04-12'
>>> date = '12.4.1961'
>>> date = '12.04.1961'
>>> date = '12-04-1961'
>>> date = '12/04/1961'
>>> date = '4/12/61'
>>> date = '4.12.1961'
>>> date = 'Apr 12, 1961'
>>> date = 'Apr 12th, 1961'
>>> time = '12:00:00'
>>> time = '12:00'
>>> time = '12:00 pm'
>>> duration = '04:30:00'
>>> duration = '4h 30m'
>>> duration = '4 hours 30 minutes'

4.6.6. Assignments

Code 4.5. Solution
"""
* Assignment: Type String Clean
* Required: yes
* Complexity: easy
* Lines of code: 8 lines
* Time: 13 min

English:
    1. Expected value is `Jana III Sobieskiego`
    2. Use only `str` methods to clean each variable
    3. Discuss how to create generic solution which fit all cases
    4. Implementation of such generic function will be in `Function Arguments Clean` chapter
    5. Run doctests - all must succeed

Polish:
    1. Oczekiwana wartość `Jana III Sobieskiego`
    2. Wykorzystaj tylko metody `str` do oczyszczenia każdej zmiennej
    3. Przeprowadź dyskusję jak zrobić rozwiązanie generyczne pasujące do wszystkich przypadków
    4. Implementacja takiej generycznej funkcji będzie w rozdziale `Function Arguments Clean`
    5. Uruchom doctesty - wszystkie muszą się powieść

Tests:
    >>> import sys; sys.tracebacklimit = 0

    >>> assert example is not Ellipsis, 'Assignment solution must be in `example` instead of ... (Ellipsis)'
    >>> assert a is not Ellipsis, 'Assignment solution must be in `a` instead of ... (Ellipsis)'
    >>> assert b is not Ellipsis, 'Assignment solution must be in `b` instead of ... (Ellipsis)'
    >>> assert c is not Ellipsis, 'Assignment solution must be in `c` instead of ... (Ellipsis)'
    >>> assert d is not Ellipsis, 'Assignment solution must be in `d` instead of ... (Ellipsis)'
    >>> assert e is not Ellipsis, 'Assignment solution must be in `e` instead of ... (Ellipsis)'
    >>> assert f is not Ellipsis, 'Assignment solution must be in `f` instead of ... (Ellipsis)'
    >>> assert g is not Ellipsis, 'Assignment solution must be in `g` instead of ... (Ellipsis)'
    >>> assert h is not Ellipsis, 'Assignment solution must be in `h` instead of ... (Ellipsis)'
    >>> assert i is not Ellipsis, 'Assignment solution must be in `i` instead of ... (Ellipsis)'
    >>> assert type(example) is str, 'Variable `example` has invalid type, should be str'
    >>> assert type(a) is str, 'Variable `a` has invalid type, should be str'
    >>> assert type(b) is str, 'Variable `b` has invalid type, should be str'
    >>> assert type(c) is str, 'Variable `c` has invalid type, should be str'
    >>> assert type(d) is str, 'Variable `d` has invalid type, should be str'
    >>> assert type(e) is str, 'Variable `e` has invalid type, should be str'
    >>> assert type(f) is str, 'Variable `f` has invalid type, should be str'
    >>> assert type(g) is str, 'Variable `g` has invalid type, should be str'
    >>> assert type(h) is str, 'Variable `h` has invalid type, should be str'
    >>> assert type(i) is str, 'Variable `i` has invalid type, should be str'

    >>> example
    'Jana Twardowskiego III'
    >>> a
    'Jana III Sobieskiego'
    >>> b
    'Jana III Sobieskiego'
    >>> c
    'Jana III Sobieskiego'
    >>> d
    'Jana III Sobieskiego'
    >>> e
    'Jana III Sobieskiego'
    >>> f
    'Jana III Sobieskiego'
    >>> g
    'Jana III Sobieskiego'
    >>> h
    'Jana III Sobieskiego'
    >>> i
    'Jana III Sobieskiego'
"""

example = 'UL. jana \tTWArdoWskIEGO 3'
a = 'ul Jana III SobIESkiego'
b = '\tul. Jana trzeciego Sobieskiego'
c = 'ulicaJana III Sobieskiego'
d = 'UL. JANA 3 \nSOBIESKIEGO'
e = 'UL. jana III SOBiesKIEGO'
f = 'ULICA JANA III SOBIESKIEGO  '
g = 'ULICA. JANA III SOBIeskieGO'
h = ' Jana 3 Sobieskiego  '
i = 'Jana III\tSobieskiego '

example = example.upper().replace('UL. ', '').replace('\t', '').strip().title().replace('3', 'III')
a = a  # str: Jana III Sobieskiego
b = b  # str: Jana III Sobieskiego
c = c  # str: Jana III Sobieskiego
d = d  # str: Jana III Sobieskiego
e = e  # str: Jana III Sobieskiego
f = f  # str: Jana III Sobieskiego
g = g  # str: Jana III Sobieskiego
h = h  # str: Jana III Sobieskiego
i = i  # str: Jana III Sobieskiego