6.3. Pandas Workflow¶
6.3.1. Rationale¶

6.3.2. Working with Excel file¶
import pandas as pd
df = pd.read_excel(
io='filename.xls',
sheet_name=['Sheet 1'],
encoding='utf-8',
skiprows=1,
skip_blank_lines=True,
parse_dates=['from', 'to'],
)
# Rename Columns to match database columns
df.rename(columns={
'from': 'date_start',
'to': 'date_end',
}, inplace=True)
# Drop all records where "Name" is empty (NaN)
df.dropna(subset=['name'], how='all', inplace=True)
# Add column ``blacklist`` with data
df['blacklist'] = [True, False, True, False]
# Change NaN to 0
df.fillna(0, inplace=True)
# Select columns
columns = ['name', 'date_start', 'date_end', 'blacklist']
# Print results
print( df[columns] )
6.3.3. Working with dirty CSV¶
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/iris-dirty.csv'
nrows, nfeatures, *species = pd.read_csv(DATA, nrows=0).columns
species = dict(enumerate(species))
df = pd.read_csv(url, skiprows=1, names=['sepal_length', 'sepal_width',
'petal_length', 'petal_width', 'species'])
df['species'].replace(species, inplace=True)
df.plot(kind='density')
6.3.4. Working with CSV¶
import pandas as pd
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/iris-clean.csv'
# Read data
df = pd.read_csv(DATA)
# Rename Columns
df.columns = [
'Sepal length',
'Sepal width',
'Petal length',
'Petal width',
'Species'
]
# Get first ``n`` records
df.head(5)
# Sepal length Sepal width Petal length Petal width Species
# 0 5.1 3.5 1.4 0.2 0
# 1 4.9 3.0 1.4 0.2 0
# 2 4.7 3.2 1.3 0.2 0
# 3 4.6 3.1 1.5 0.2 0
# 4 5.0 3.6 1.4 0.2 0
# Get last ``n`` records
df.tail(3)
# Sepal length Sepal width Petal length Petal width Species
# 147 6.5 3.0 5.2 2.0 2
# 148 6.2 3.4 5.4 2.3 2
# 149 5.9 3.0 5.1 1.8 2
# Change column Species values
df['Species'].replace({
0: 'setosa',
1: 'versicolor',
2: 'virginica'
}, inplace=True)
# Shuffle columns and reset indexes (drop column with old index)
df = df.sample(frac=1.0).reset_index(drop=True)
# Sepal length Sepal width ... Petal width Species
# 0 5.0 2.0 ... 1.0 versicolor
# 1 6.4 2.7 ... 1.9 virginica
# 2 5.6 3.0 ... 1.5 versicolor
# 3 5.7 2.6 ... 1.0 versicolor
# 4 6.4 3.1 ... 1.8 virginica
# 5 4.6 3.6 ... 0.2 setosa
# 6 5.9 3.0 ... 1.5 versicolor
# Descriptive Statistics
df.describe()
# Sepal length Sepal width Petal length Petal width
# count 150.000000 150.000000 150.000000 150.000000
# mean 5.843333 3.057333 3.758000 1.199333
# std 0.828066 0.435866 1.765298 0.762238
# min 4.300000 2.000000 1.000000 0.100000
# 25% 5.100000 2.800000 1.600000 0.300000
# 50% 5.800000 3.000000 4.350000 1.300000
# 75% 6.400000 3.300000 5.100000 1.800000
# max 7.900000 4.400000 6.900000 2.500000
Function |
Description |
---|---|
|
Number of non-null observations |
|
Sum of values |
|
Mean of values |
|
Mean absolute deviation |
|
Arithmetic median of values |
|
Minimum |
|
Maximum |
|
Mode |
|
Absolute Value |
|
Product of values |
|
Unbiased standard deviation |
|
Unbiased variance |
|
Unbiased standard error of the mean |
|
Unbiased skewness (3rd moment) |
|
Unbiased kurtosis (4th moment) |
|
Sample quantile (value at %) |
|
Cumulative sum |
|
Cumulative product |
|
Cumulative maximum |
|
Cumulative minimum |
6.3.5. Plots¶
6.3.6. Hist¶
import matplotlib.pyplot as plt
import pandas as pd
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/iris-clean.csv'
df = pd.read_csv(DATA)
df.hist()
plt.show()

Figure 6.8. Visualization using hist¶
6.3.7. Density¶
import matplotlib.pyplot as plt
import pandas as pd
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/iris-clean.csv'
df = pd.read_csv(DATA)
df.plot(kind='density', subplots=True, layout=(2,2), sharex=False)
plt.show()

Figure 6.9. Visualization using density¶
6.3.8. Box¶
import matplotlib.pyplot as plt
import pandas as pd
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/iris-clean.csv'
df = pd.read_csv(DATA)
df.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

Figure 6.10. Visualization using density¶
6.3.9. Scatter matrix¶
The in
pandas
version0.22
plotting module has been moved frompandas.tools.plotting
topandas.plotting
As of version
0.19
, thepandas.plotting
library did not exist
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix
DATA = 'https://raw.githubusercontent.com/AstroMatt/book-python/master/_data/csv/iris-clean.csv'
df = pd.read_csv(DATA)
scatter_matrix(df)
plt.show()

Figure 6.11. Visualization using density¶