11. Decision Tree

../_images/decision-tree.png

Fig. 11.2. Drzewo decyzyjne

from sklearn import tree

# Imput to the classifier
# as of Scikit-learn uses real-valued features, we use:
# 0: bumpy
# 1: smooth
#
# features = [
#    [140, 'smooth'],
#    [130, 'smooth'],
#    [150, 'bumpy'],
#    [170, 'bumpy'],
# ]

features = [
    [140, 1],
    [130, 1],
    [150, 0],
    [170, 0],
]

# Output that we want from classifier
# as of Scikit-learn uses real-valued features, we use:
# 0: apple
# 1: orange
#
# labels = ['apple', 'apple', 'orange', 'orange']
labels = [0, 0, 1, 1]


# create decision tree
clf = tree.DecisionTreeClassifier()

# fit - synonim to "find patterns in data"
clf = clf.fit(features, labels)

# use classifier to predict
output = clf.predict([[160, 0]])

print(output)
# should be: [1]
from sklearn import datasets
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

iris = datasets.load_iris()

# Features
x = iris.data

# Labels
y = iris.target

# Split dataset into test and training set in half
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)

# Create classifier
decision_tree = tree.DecisionTreeClassifier()

# Train classifier using training data
decision_tree.fit(x_train, y_train)

# Predict
predictions = decision_tree.predict(x_test)

# How accurate was classifier on testing set
# Because of some variation for each run, it might give different results
output = accuracy_score(y_test, predictions)
print(output)
# Output: 0.96

Note

Note identical API for classifiers!

11.1. Visualizing a Decision Tree

import numpy
from sklearn.datasets import load_iris
from sklearn import tree

iris = load_iris()

# select test indexes
# dataset is ordered so 0, 50, 100 is a first of each kind
test_idx = [0, 50, 100]


# training data
train_target = numpy.delete(iris.target, test_idx)
train_data = numpy.delete(iris.data, test_idx, axis=0)

# testing data
test_target = iris.target[test_idx]
test_data = iris.data[test_idx]

# create and train classifier
clf = tree.DecisionTreeClassifier()
clf.fit(train_data, train_target)


print(test_target)
# Output: [0 1 2]

output = clf.predict(test_data)
print(output)
# Output: [0 1 2]


print(test_data[0], test_target[0])
# output: [ 5.1  3.5  1.4  0.2] 0


print(iris.feature_names)
# output: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print(iris.target_names)
# output: ['setosa' 'versicolor' 'virginica']


# Vizualization of Decision Tree Classifier
from sklearn.externals.six import StringIO
import pydotplus

dot_data = StringIO()
tree.export_graphviz(
    decision_tree=clf,
    out_file=dot_data,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    filled=True,
    rounded=True,
    impurity=True
)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf('/tmp/iris.pdf')
../_images/iris-decistion-tree.png

Fig. 11.3. Vizualization of Decision Tree Classifier

11.2. Zadania praktyczne

11.2.1. Prosta klasyfikacja na podstawie features i labels

Stwórz Classifier dla podanych poniżej danych testowych:

Gender Height Weight Foot Size
male 6.00 180 12
male 5.92 190 11
male 5.58 170 12
male 5.92 165 10
female 5.00 100 6
female 5.50 150 8
female 5.42 130 7
female 5.75 150 9

Jaką płeć ma osoba o parametrach?:

  • Height: 6
  • Weight: 130
  • Foot Size: 8

Zwizualizuj drzewo decyzyjne

Hints:
  • preprocessing.LabelEncoder()

11.2.2. Banknoty

Na podstawie datasetu 1372 banknotów z 5 wartościami numerycznymi pobranymi z obrazu odpowiedz czy dany banknot jest prawdziwy wykorzystując algorymt drzewa decyzyjnego.

  • variance of Wavelet Transformed image (continuous).
  • skewness of Wavelet Transformed image (continuous).
  • kurtosis of Wavelet Transformed image (continuous).
  • entropy of image (continuous).
  • class (integer).
Dataset: