Instances(input)(实例)
Concept(Function)
Target Concept(Answer)
Hypothesis Class(all function)
Sample(training set)
Candidate(concept looks like target concept)
Testing Set
ID3
Loop
- A <-- best attribute
- Assign A as decision attribute for Node
- For each value of A: create a descendent of node
- sort training examples to leaves
- if examples perfectly classified, stop;
- else iterate due leaves
best attribute:
ID3 bias
Inductive bias 归纳偏差
restriction bias: Hypothesis
Preference bias -- inductive bias 优选偏差
- good puts at top
- correct over incorrect
- shorter trees
In sklearn, we can check document Dectsion Trees
def classify(features_train, labels_train):
### your code goes here--should return a trained decision tree classifer
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
return clf
import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
import numpy as np
import pylab as pl
features_train, labels_train, features_test, labels_test = makeTerrainData()
#################################################################################
########################## DECISION TREE #################################
#### your code goes here
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)
acc = clf.score(features_test, labels_test)### you fill this in!
### be sure to compute the accuracy on the test set
def submitAccuracies():
return {"acc":round(acc,3)}
criterion(标准)
splitter(分割器)
max_depth(最大深度)
min_samples_split(分割所需的最小样本数量)
min__samples__leaf(树叶中的最少样本数量)
增大min_samples_split减少overfitting
import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData
import matplotlib.pyplot as plt
import numpy as np
import pylab as pl
features_train, labels_train, features_test, labels_test = makeTerrainData()
########################## DECISION TREE #################################
### your code goes here--now create 2 decision tree classifiers,
### one with min_samples_split=2 and one with min_samples_split=50
### compute the accuracies on the testing data and store
### the accuracy numbers to acc_min_samples_split_2 and
### acc_min_samples_split_50, respectively
from sklearn.tree import DecisionTreeClassifier
clf1 = DecisionTreeClassifier(min_samples_split=2)
clf1.fit(features_train, labels_train)
clf2 = DecisionTreeClassifier(min_samples_split=50)
clf2.fit(features_train, labels_train)
acc_min_samples_split_2 = clf1.score(features_test, labels_test)
acc_min_samples_split_50 = clf2.score(features_test, labels_test)
def submitAccuracies():
return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
"acc_min_samples_split_50":round(acc_min_samples_split_50,3)}
In sklearn, the criterion use "gini", we can also use "entropy"