Instances(input)(实例)

Concept(Function)

Target Concept(Answer)

Hypothesis Class(all function)

Sample(training set)

Candidate(concept looks like target concept)

Testing Set

ID3

Loop

  • A <-- best attribute
  • Assign A as decision attribute for Node
  • For each value of A: create a descendent of node
  • sort training examples to leaves
  • if examples perfectly classified, stop;
  • else iterate due leaves

best attribute:

GAIN(S,A)=Entropy(S)vSvSEntropy(Sv)GAIN(S,A)=Entropy(S)-\displaystyle\sum_{v}{\frac{|S_{v}|}{|S|}Entropy(S_{v})}

Entropy=vp(v)log2p(v)Entropy = \displaystyle\sum_{v}-p(v)\log_2{p(v)}

ID3 bias

Inductive bias 归纳偏差

restriction bias: Hypothesis

Preference bias -- inductive bias 优选偏差

  • good puts at top
  • correct over incorrect
  • shorter trees

In sklearn, we can check document Dectsion Trees

def classify(features_train, labels_train):

    ### your code goes here--should return a trained decision tree classifer
    from sklearn.tree import DecisionTreeClassifier
    clf = DecisionTreeClassifier()
    clf.fit(features_train, labels_train)
    return clf
import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()



#################################################################################


########################## DECISION TREE #################################



#### your code goes here

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(features_train, labels_train)

acc = clf.score(features_test, labels_test)### you fill this in!
### be sure to compute the accuracy on the test set



def submitAccuracies():
  return {"acc":round(acc,3)}

DecsionTreeClassifier

criterion(标准)

splitter(分割器)

max_depth(最大深度)

min_samples_split(分割所需的最小样本数量)

min__samples__leaf(树叶中的最少样本数量)

增大min_samples_split减少overfitting

import sys
from class_vis import prettyPicture
from prep_terrain_data import makeTerrainData

import matplotlib.pyplot as plt
import numpy as np
import pylab as pl

features_train, labels_train, features_test, labels_test = makeTerrainData()



########################## DECISION TREE #################################


### your code goes here--now create 2 decision tree classifiers,
### one with min_samples_split=2 and one with min_samples_split=50
### compute the accuracies on the testing data and store
### the accuracy numbers to acc_min_samples_split_2 and
### acc_min_samples_split_50, respectively

from sklearn.tree import DecisionTreeClassifier
clf1 = DecisionTreeClassifier(min_samples_split=2)
clf1.fit(features_train, labels_train)
clf2 = DecisionTreeClassifier(min_samples_split=50)
clf2.fit(features_train, labels_train)
acc_min_samples_split_2 = clf1.score(features_test, labels_test)
acc_min_samples_split_50 = clf2.score(features_test, labels_test)

def submitAccuracies():
  return {"acc_min_samples_split_2":round(acc_min_samples_split_2,3),
          "acc_min_samples_split_50":round(acc_min_samples_split_50,3)}

In sklearn, the criterion use "gini", we can also use "entropy"

results matching ""

    No results matching ""