8.1. CSV About¶
CSV - Comma/Character Separated Values
No CSV formal standard, just a practice
Flat file (2D) without relations
Relations has to be flatten (serialization, additional columns, etc...)
Typically first line (header) represents column names
Rarely first line can also have a structure (nrows, ncols)
Internationalization: encoding
Localization: decimal separator, thousands separator, date format
Parameters: delimiter, quotechar, quoting, lineterminator, dialect
Example CSV file:
sepal_length,sepal_width,petal_length,petal_width,species
5.8,2.7,5.1,1.9,virginica
5.1,3.5,1.4,0.2,setosa
5.7,2.8,4.1,1.3,versicolor
6.3,2.9,5.6,1.8,virginica
6.4,3.2,4.5,1.5,versicolor
4.7,3.2,1.3,0.2,setosa
7.0,3.2,4.7,1.4,versicolor
7.6,3.0,6.6,2.1,virginica
4.9,3.0,1.4,0.2,setosa
4.9,2.5,4.5,1.7,virginica
8.1.1. Header¶
File without header:
5.8,2.7,5.1,1.9,virginica
5.1,3.5,1.4,0.2,setosa
5.7,2.8,4.1,1.3,versicolor
6.3,2.9,5.6,1.8,virginica
6.4,3.2,4.5,1.5,versicolor
4.7,3.2,1.3,0.2,setosa
7.0,3.2,4.7,1.4,versicolor
7.6,3.0,6.6,2.1,virginica
4.9,3.0,1.4,0.2,setosa
4.9,2.5,4.5,1.7,virginica
First line is a header:
sepal_length,sepal_width,petal_length,petal_width,species
5.8,2.7,5.1,1.9,virginica
5.1,3.5,1.4,0.2,setosa
5.7,2.8,4.1,1.3,versicolor
6.3,2.9,5.6,1.8,virginica
6.4,3.2,4.5,1.5,versicolor
4.7,3.2,1.3,0.2,setosa
7.0,3.2,4.7,1.4,versicolor
7.6,3.0,6.6,2.1,virginica
4.9,3.0,1.4,0.2,setosa
4.9,2.5,4.5,1.7,virginica
First line is a structure: number of rows (nrows) and columns (ncols):
10,5
5.8,2.7,5.1,1.9,virginica
5.1,3.5,1.4,0.2,setosa
5.7,2.8,4.1,1.3,versicolor
6.3,2.9,5.6,1.8,virginica
6.4,3.2,4.5,1.5,versicolor
4.7,3.2,1.3,0.2,setosa
7.0,3.2,4.7,1.4,versicolor
7.6,3.0,6.6,2.1,virginica
4.9,3.0,1.4,0.2,setosa
4.9,2.5,4.5,1.7,virginica
First line is a structure: number of rows (nrows) and features (nfeatures), followed by label_encoder values for label column:
10,4,virginica,setosa,versicolor
5.8,2.7,5.1,1.9,0
5.1,3.5,1.4,0.2,1
5.7,2.8,4.1,1.3,2
6.3,2.9,5.6,1.8,0
6.4,3.2,4.5,1.5,2
4.7,3.2,1.3,0.2,1
7.0,3.2,4.7,1.4,2
7.6,3.0,6.6,2.1,0
4.9,3.0,1.4,0.2,1
4.9,2.5,4.5,1.7,0
8.1.2. Delimiter¶
csv
module expects delimeter to be 1-character in length
delimiter=','
:
sepal_length,sepal_width,petal_length,petal_width,species
5.8,2.7,5.1,1.9,virginica
5.1,3.5,1.4,0.2,setosa
5.7,2.8,4.1,1.3,versicolor
delimiter=';'
:
sepal_length;sepal_width;petal_length;petal_width;species
5.8;2.7;5.1;1.9;virginica
5.1;3.5;1.4;0.2;setosa
5.7;2.8;4.1;1.3;versicolor
delimiter=':'
:
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
nobody:x:99:99:Nobody:/:/sbin/nologin
sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
watney:x:1000:1000:Mark Watney:/home/watney:/bin/bash
lewis:x:1001:1001:Melissa Lewis:/home/lewis:/bin/bash
martinez:x:1002:1002:Rick Martinez:/home/martinez:/bin/bash
delimiter='|'
:
| Firstname | Lastname | Role |
|-----------|----------|-----------|
| Mark | Watney | Botanist |
| Melissa | Lewis | Commander |
| Rick | Martinez | Pilot |
delimiter='\t'
:
sepal_length sepal_width petal_length petal_width species
5.8 2.7 5.1 1.9 virginica
5.1 3.5 1.4 0.2 setosa
5.7 2.8 4.1 1.3 versicolor
8.1.3. Quotechar¶
"
- quote char (best)'
- apostrophe
quotechar='"'
:
"sepal_length","sepal_width","petal_length","petal_width","species"
"5.8","2.7","5.1","1.9","virginica"
"5.1","3.5","1.4","0.2","setosa"
"5.7","2.8","4.1","1.3","versicolor"
quotechar="'"
:
'sepal_length','sepal_width','petal_length','petal_width','species'
'5.8','2.7','5.1','1.9','virginica'
'5.1','3.5','1.4','0.2','setosa'
'5.7','2.8','4.1','1.3','versicolor'
quotechar='|'
:
|sepal_length|,|sepal_width|,|petal_length|,|petal_width|,|species|
|5.8|,|2.7|,|5.1|,|1.9|,|virginica|
|5.1|,|3.5|,|1.4|,|0.2|,|setosa|
|5.7|,|2.8|,|4.1|,|1.3|,|versicolor|
quotechar='/'
:
/sepal_length/,/sepal_width/,/petal_length/,/petal_width/,/species/
/5.8/,/2.7/,/5.1/,/1.9/,/virginica/
/5.1/,/3.5/,/1.4/,/0.2/,/setosa/
/5.7/,/2.8/,/4.1/,/1.3/,/versicolor/
8.1.4. Quoting¶
csv.QUOTE_ALL
(safest)csv.QUOTE_MINIMAL
csv.QUOTE_NONE
csv.QUOTE_NONNUMERIC
quoting=csv.QUOTE_ALL
:
"sepal_length","sepal_width","petal_length","petal_width","species"
"5.8","2.7","5.1","1.9","virginica"
"5.1","3.5","1.4","0.2","setosa"
"5.7","2.8","4.1","1.3","versicolor"
quoting=csv.QUOTE_MINIMAL
:
sepal_length,sepal_width,petal_length,petal_width,species
5.8,2.7,5.1,1.9,virginica
5.1,3.5,1.4,0.2,setosa
5.7,2.8,4.1,1.3,versicolor
quoting=csv.QUOTE_NONE
:
sepal_length,sepal_width,petal_length,petal_width,species
5.8,2.7,5.1,1.9,virginica
5.1,3.5,1.4,0.2,setosa
5.7,2.8,4.1,1.3,versicolor
quoting=csv.QUOTE_NONNUMERIC
:
"sepal_length","sepal_width","petal_length","petal_width","species"
5.8,2.7,5.1,1.9,"virginica"
5.1,3.5,1.4,0.2,"setosa"
5.7,2.8,4.1,1.3,"versicolor"
8.1.5. Lineterminator¶
\r\n
- New line on Windows\n
- New line on*nix
*nix
operating systems: Linux, macOS, BSD and other POSIX compliant OSes (excluding Windows)
8.1.6. Decimal Separator¶
0.1
- Decimal point0,1
- Decimal comma
sepal_length,sepal_width,petal_length,petal_width,species
5.8;2.7;5.1;1.9;virginica
5.1;3.5;1.4;0.2;setosa
5.7;2.8;4.1;1.3;versicolor
sepal_length,sepal_width,petal_length,petal_width,species
5,8;2,7;5,1;1,9;virginica
5,1;3,5;1,4;0,2;setosa
5,7;2,8;4,1;1,3;versicolor
8.1.7. Thousands Separator¶
1000000
- None1'000'000
- Apostrophe1 000 000
- Space, the internationally recommended thousands separator1.000.000
- Period, used in many non-English speaking countries1,000,000
- Comma, used in most English-speaking countries
8.1.8. Date and Time¶
>>> date = '1961-04-12'
>>> date = '12.4.1961'
>>> date = '12.04.1961'
>>> date = '12-04-1961'
>>> date = '12/04/1961'
>>> date = '4/12/61'
>>> date = '4.12.1961'
>>> date = 'Apr 12, 1961'
>>> date = 'Apr 12th, 1961'
>>> time = '12:00:00'
>>> time = '12:00'
>>> time = '12:00 pm'
>>> duration = '04:30:00'
>>> duration = '4h 30m'
>>> duration = '4 hours 30 minutes'
8.1.9. Encoding¶
utf-8
- international standard (should be always used!)iso-8859-1
- ISO standard for Western Europe and USAiso-8859-2
- ISO standard for Central Europe (including Poland)cp1250
orwindows-1250
- Central European encoding on Windowscp1251
orwindows-1251
- Eastern European encoding on Windowscp1252
orwindows-1252
- Western European encoding on WindowsASCII
- ASCII characters only
with open(FILE, encoding='utf-8') as file:
...
8.1.10. Dialects¶
import csv
csv.list_dialects()
# ['excel', 'excel-tab', 'unix']
Microsoft Excel 2016-2020:
quoting=csv.QUOTE_MINIMAL
quotechar='"'
delimiter=','
ordelimiter=';'
depending on Windows locale decimal separatorlineterminator='\r\n'
encoding='...'
- depends on Windows locale typicallywindows-*
Microsoft Excel macOS:
quoting=csv.QUOTE_MINIMAL
quotechar='"'
delimiter=','
lineterminator='\r\n'
encoding='utf-8'
Microsoft export options:
$ file utf8.csv
utf8.csv: CSV text
$ cat utf8.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,zażółć gęślą jaźń
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file standard.csv
standard.csv: CSV text
$ cat standard.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file dos.csv
dos.csv: CSV text
$ cat dos.csv
Firstname,Lastname,Age,Comment
Mark,Watney,21,za_?__ g__l_ ja__
Melissa,Lewis,21.5,"Some, comment"
,,"21,5",Some; Comment
$ file macintosh.csv
macintosh.csv: Non-ISO extended-ASCII text, with CR line terminators
$ cat macintosh.csv
,,"21,5",Some; Comment
8.1.11. Good Practices¶
Always specify:
delimiter=','
tocsv.DictReader()
object
quotechar='"'
tocsv.DictReader()
object
quoting=csv.QUOTE_ALL
tocsv.DictReader()
object
lineterminator='\n'
tocsv.DictReader()
object
encoding='utf-8'
toopen()
function (especially when working with Microsoft Excel)