8.1. CSV About¶

CSV - Comma/Character Separated Values
No CSV formal standard, just a practice
Flat file (2D) without relations
Relations has to be flatten (serialization, additional columns, etc...)
Typically first line (header) represents column names
Rarely first line can also have a structure (nrows, ncols)
Internationalization: encoding
Localization: decimal separator, thousands separator, date format
Parameters: delimiter, quotechar, quoting, lineterminator, dialect

Example CSV file:

sepal_length,sepal_width,petal_length,petal_width,species
8,2.7,5.1,1.9,virginica
1,3.5,1.4,0.2,setosa
7,2.8,4.1,1.3,versicolor
3,2.9,5.6,1.8,virginica
4,3.2,4.5,1.5,versicolor
7,3.2,1.3,0.2,setosa
0,3.2,4.7,1.4,versicolor
6,3.0,6.6,2.1,virginica
9,3.0,1.4,0.2,setosa
9,2.5,4.5,1.7,virginica

8.1.1. Header¶

File without header:

8,2.7,5.1,1.9,virginica
1,3.5,1.4,0.2,setosa
7,2.8,4.1,1.3,versicolor
3,2.9,5.6,1.8,virginica
4,3.2,4.5,1.5,versicolor
7,3.2,1.3,0.2,setosa
0,3.2,4.7,1.4,versicolor
6,3.0,6.6,2.1,virginica
9,3.0,1.4,0.2,setosa
9,2.5,4.5,1.7,virginica

First line is a header:

sepal_length,sepal_width,petal_length,petal_width,species
8,2.7,5.1,1.9,virginica
1,3.5,1.4,0.2,setosa
7,2.8,4.1,1.3,versicolor
3,2.9,5.6,1.8,virginica
4,3.2,4.5,1.5,versicolor
7,3.2,1.3,0.2,setosa
0,3.2,4.7,1.4,versicolor
6,3.0,6.6,2.1,virginica
9,3.0,1.4,0.2,setosa
9,2.5,4.5,1.7,virginica

First line is a structure: number of rows (nrows) and columns (ncols):

10,5
8,2.7,5.1,1.9,virginica
1,3.5,1.4,0.2,setosa
7,2.8,4.1,1.3,versicolor
3,2.9,5.6,1.8,virginica
4,3.2,4.5,1.5,versicolor
7,3.2,1.3,0.2,setosa
0,3.2,4.7,1.4,versicolor
6,3.0,6.6,2.1,virginica
9,3.0,1.4,0.2,setosa
9,2.5,4.5,1.7,virginica

First line is a structure: number of rows (nrows) and features (nfeatures), followed by label_encoder values for label column:

10,4,virginica,setosa,versicolor
8,2.7,5.1,1.9,0
1,3.5,1.4,0.2,1
7,2.8,4.1,1.3,2
3,2.9,5.6,1.8,0
4,3.2,4.5,1.5,2
7,3.2,1.3,0.2,1
0,3.2,4.7,1.4,2
6,3.0,6.6,2.1,0
9,3.0,1.4,0.2,1
9,2.5,4.5,1.7,0

8.1.2. Delimiter¶

csv module expects delimeter to be 1-character in length

delimiter=',':

sepal_length,sepal_width,petal_length,petal_width,species
8,2.7,5.1,1.9,virginica
1,3.5,1.4,0.2,setosa
7,2.8,4.1,1.3,versicolor

delimiter=';':

sepal_length;sepal_width;petal_length;petal_width;species
8;2.7;5.1;1.9;virginica
1;3.5;1.4;0.2;setosa
7;2.8;4.1;1.3;versicolor

delimiter=':':

root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
nobody:x:99:99:Nobody:/:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
watney:x:1000:1000:Mark Watney:/home/watney:/bin/bash
lewis:x:1001:1001:Melissa Lewis:/home/lewis:/bin/bash
martinez:x:1002:1002:Rick Martinez:/home/martinez:/bin/bash

delimiter='|':

| Firstname | Lastname | Role      |
|-----------|----------|-----------|
| Mark      | Watney   | Botanist  |
| Melissa   | Lewis    | Commander |
| Rick      | Martinez | Pilot     |

delimiter='\t':

sepal_length        sepal_width     petal_length    petal_width     species
8 2.7     5.1     1.9     virginica
1 3.5     1.4     0.2     setosa
7 2.8     4.1     1.3     versicolor

8.1.3. Quotechar¶

" - quote char (best)
' - apostrophe

quotechar='"':

"sepal_length","sepal_width","petal_length","petal_width","species"
"5.8","2.7","5.1","1.9","virginica"
"5.1","3.5","1.4","0.2","setosa"
"5.7","2.8","4.1","1.3","versicolor"

quotechar="'":

'sepal_length','sepal_width','petal_length','petal_width','species'
'5.8','2.7','5.1','1.9','virginica'
'5.1','3.5','1.4','0.2','setosa'
'5.7','2.8','4.1','1.3','versicolor'

quotechar='|':

|sepal_length|,|sepal_width|,|petal_length|,|petal_width|,|species|
|5.8|,|2.7|,|5.1|,|1.9|,|virginica|
|5.1|,|3.5|,|1.4|,|0.2|,|setosa|
|5.7|,|2.8|,|4.1|,|1.3|,|versicolor|

quotechar='/':

/sepal_length/,/sepal_width/,/petal_length/,/petal_width/,/species/
/5.8/,/2.7/,/5.1/,/1.9/,/virginica/
/5.1/,/3.5/,/1.4/,/0.2/,/setosa/
/5.7/,/2.8/,/4.1/,/1.3/,/versicolor/

8.1.4. Quoting¶

csv.QUOTE_ALL (safest)
csv.QUOTE_MINIMAL
csv.QUOTE_NONE
csv.QUOTE_NONNUMERIC

quoting=csv.QUOTE_ALL:

"sepal_length","sepal_width","petal_length","petal_width","species"
"5.8","2.7","5.1","1.9","virginica"
"5.1","3.5","1.4","0.2","setosa"
"5.7","2.8","4.1","1.3","versicolor"

quoting=csv.QUOTE_MINIMAL:

sepal_length,sepal_width,petal_length,petal_width,species
8,2.7,5.1,1.9,virginica
1,3.5,1.4,0.2,setosa
7,2.8,4.1,1.3,versicolor

quoting=csv.QUOTE_NONE:

sepal_length,sepal_width,petal_length,petal_width,species
8,2.7,5.1,1.9,virginica
1,3.5,1.4,0.2,setosa
7,2.8,4.1,1.3,versicolor

quoting=csv.QUOTE_NONNUMERIC:

"sepal_length","sepal_width","petal_length","petal_width","species"
8,2.7,5.1,1.9,"virginica"
1,3.5,1.4,0.2,"setosa"
7,2.8,4.1,1.3,"versicolor"

8.1.5. Lineterminator¶

\r\n - New line on Windows
\n - New line on *nix
*nix operating systems: Linux, macOS, BSD and other POSIX compliant OSes (excluding Windows)

8.1.6. Decimal Separator¶

0.1 - Decimal point
0,1 - Decimal comma

../../_images/l10n-decimal-separator.png

sepal_length,sepal_width,petal_length,petal_width,species
8;2.7;5.1;1.9;virginica
1;3.5;1.4;0.2;setosa
7;2.8;4.1;1.3;versicolor

sepal_length,sepal_width,petal_length,petal_width,species
5,8;2,7;5,1;1,9;virginica
5,1;3,5;1,4;0,2;setosa
5,7;2,8;4,1;1,3;versicolor

8.1.7. Thousands Separator¶

1000000 - None
1'000'000 - Apostrophe
1 000 000 - Space, the internationally recommended thousands separator
1.000.000 - Period, used in many non-English speaking countries
1,000,000 - Comma, used in most English-speaking countries

8.1.8. Date and Time¶

>>> date = '1961-04-12'
>>> date = '12.4.1961'
>>> date = '12.04.1961'
>>> date = '12-04-1961'
>>> date = '12/04/1961'
>>> date = '4/12/61'
>>> date = '4.12.1961'
>>> date = 'Apr 12, 1961'
>>> date = 'Apr 12th, 1961'

>>> time = '12:00:00'
>>> time = '12:00'
>>> time = '12:00 pm'

>>> duration = '04:30:00'
>>> duration = '4h 30m'
>>> duration = '4 hours 30 minutes'

8.1.9. Encoding¶

utf-8 - international standard (should be always used!)
iso-8859-1 - ISO standard for Western Europe and USA
iso-8859-2 - ISO standard for Central Europe (including Poland)
cp1250 or windows-1250 - Central European encoding on Windows
cp1251 or windows-1251 - Eastern European encoding on Windows
cp1252 or windows-1252 - Western European encoding on Windows
ASCII - ASCII characters only

with open(FILE, encoding='utf-8') as file:
    ...

8.1.10. Dialects¶

import csv

csv.list_dialects()
# ['excel', 'excel-tab', 'unix']

Microsoft Excel 2016-2020:
- quoting=csv.QUOTE_MINIMAL
- quotechar='"'
- delimiter=',' or delimiter=';' depending on Windows locale decimal separator
- lineterminator='\r\n'
- encoding='...' - depends on Windows locale typically windows-*
Microsoft Excel macOS:
- quoting=csv.QUOTE_MINIMAL
- quotechar='"'
- delimiter=','
- lineterminator='\r\n'
- encoding='utf-8'
Microsoft export options:

$ file utf8.csv
utf8.csv: CSV text

$ cat utf8.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,zażółć gęślą jaźń
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment

$ file standard.csv
standard.csv: CSV text

$ cat standard.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment

$ file dos.csv
dos.csv: CSV text

$ cat dos.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment

$ file macintosh.csv
macintosh.csv: Non-ISO extended-ASCII text, with CR line terminators

$ cat macintosh.csv
,,"21,5",Some; Comment

8.1.11. Good Practices¶

Always specify:

delimiter=',' to csv.DictReader() object

quotechar='"' to csv.DictReader() object

quoting=csv.QUOTE_ALL to csv.DictReader() object

lineterminator='\n' to csv.DictReader() object

encoding='utf-8' to open() function (especially when working with Microsoft Excel)