|
|
| (96 intermediate revisions by the same user not shown) |
| Line 1: |
Line 1: |
| − | ==Decision Trees==
| |
| − | Noel's lecture (10/06): https://drive.google.com/drive/folders/1TW494XF-laGGJiLFApz8bJfMstG4Yqk_
| |
| | | | |
| − | StatQuest: https://www.youtube.com/watch?v=7VeUPuFGJHk https://www.youtube.com/watch?v=wpNl-JwwplA https://www.youtube.com/watch?v=g9c66TUylZ4
| |
| − |
| |
| − |
| |
| − | [[File:Decision_Trees_terminology1.png|400px|thumb|right|]]
| |
| − |
| |
| − | [[File:Decision_Trees_terminology2.png|400px|thumb|right|]]
| |
| − |
| |
| − | [[File:Decision_Trees_terminology3.png|400px|thumb|right|]]
| |
| − |
| |
| − |
| |
| − | DT is an predictive algorithm that build models in the form of a tree structure, that is composed of a series of branching Boolean tests (tests for which the answer is true or false). The princpel is to use these boolean tests to split the data into smaller and smaller subsets to identify patterns that can be used for prediction. [Noel Cosgrave slides]
| |
| − |
| |
| − | In a DT, each '''internal node''' represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each '''branch''' represents the outcome of the test, and each '''leaf node''' represents a class label (decision taken after computing all attributes). The paths from root to leaf are called '''decision rules/classification rules'''. https://en.wikipedia.org/wiki/Decision_tree
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − | '''A dataset can have many possible decision trees'''
| |
| − |
| |
| − | In practice, we want small & accurate trees
| |
| − |
| |
| − | * What is the inductive bias in Decision Tree learning?
| |
| − | :* Shorter trees are preferred over longer trees: Smaller trees are more general, usually more accurate, and easier to understand by humans (and to communicate!)
| |
| − |
| |
| − | :* Prefer trees that place high information gain attributes close to the root
| |
| − |
| |
| − |
| |
| − | * More succinct hypotheses are preferred. Why?
| |
| − | :* There are fewer succinct hypotheses than complex ones
| |
| − | :* If a succinct hypothesis explains the data this is unlikely to be a coincidence
| |
| − | :* However, not every succinct hypothesis is a reasonable one.
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − | '''Example 1:'''
| |
| − |
| |
| − | A decision tree model can be used to decide whether or not to provide a loan to a customer (wheather or not a custoer is likely to pay a loan)
| |
| − |
| |
| − | {|
| |
| − | <!-- |[[File:DecisionTree-NoelCosgraveSlides_3.png|350px|thumb|center|]] -->
| |
| − | |[[File:Decision_tree_ex1.jpeg|550px|thumb|center|]]
| |
| − | | style="vertical-align: text-top" |
| |
| − | * <math>16</math> training examples
| |
| − | * The <math>x/y</math> values mean that <math>x</math> out of <math>y</math> training examples that reach this leaf node has the class of the leaf. This is the confidence
| |
| − | * The <math>x</math> value is the support count
| |
| − | * <math>x/total</math> training examples is the support
| |
| − | |}
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − | '''Example 2:'''
| |
| − |
| |
| − | A model for predicting the future success of a movie:
| |
| − |
| |
| − | {|
| |
| − | |
| |
| − | {| class="wikitable"
| |
| − | !Movie
| |
| − | !Number of celebrities
| |
| − | !Budget
| |
| − | !Success (True label)
| |
| − | !Success (Predicted label)
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 1
| |
| − | |low
| |
| − | |hight
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 2
| |
| − | |low
| |
| − | |low
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 3
| |
| − | |low
| |
| − | |hight
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 4
| |
| − | |low
| |
| − | |low
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 5
| |
| − | |low
| |
| − | |hight
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 6
| |
| − | |low
| |
| − | |low
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 7
| |
| − | |low
| |
| − | |hight
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 8
| |
| − | |low
| |
| − | |low
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 9
| |
| − | |low
| |
| − | |hight
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#ff794d"
| |
| − | |Movie 10
| |
| − | |low
| |
| − | |low
| |
| − | |boxoffice flop
| |
| − | |boxoffice flop
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |Movie 11
| |
| − | |hight
| |
| − | |hight
| |
| − | |mainstream hit
| |
| − | |mainstream hit
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |Movie 12
| |
| − | |hight
| |
| − | |hight
| |
| − | |mainstream hit
| |
| − | |mainstream hit
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |Movie 13
| |
| − | |hight
| |
| − | |hight
| |
| − | |mainstream hit
| |
| − | |mainstream hit
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |'''Movie 14'''
| |
| − | |'''low'''
| |
| − | |'''hight'''
| |
| − | |'''mainstream hit'''
| |
| − | |style="background-color:#ff794d"|'''boxoffice flop'''
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |Movie 15
| |
| − | |hight
| |
| − | |hight
| |
| − | |mainstream hit
| |
| − | |mainstream hit
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |Movie 16
| |
| − | |hight
| |
| − | |hight
| |
| − | |mainstream hit
| |
| − | |mainstream hit
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |Movie 17
| |
| − | |hight
| |
| − | |hight
| |
| − | |mainstream hit
| |
| − | |mainstream hit
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |Movie 18
| |
| − | |hight
| |
| − | |hight
| |
| − | |mainstream hit
| |
| − | |mainstream hit
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |Movie 19
| |
| − | |hight
| |
| − | |hight
| |
| − | |mainstream hit
| |
| − | |mainstream hit
| |
| − | |- style="background-color:#66a3ff"
| |
| − | |Movie 20
| |
| − | |hight
| |
| − | |hight
| |
| − | |mainstream hit
| |
| − | |mainstream hit
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |Movie 21
| |
| − | |hight
| |
| − | |low
| |
| − | |critical success
| |
| − | |critical success
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |Movie 22
| |
| − | |hight
| |
| − | |low
| |
| − | |critical success
| |
| − | |critical success
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |Movie 23
| |
| − | |hight
| |
| − | |low
| |
| − | |critical success
| |
| − | |critical success
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |'''Movie 24'''
| |
| − | |'''low'''
| |
| − | |'''hight'''
| |
| − | |'''critical success'''
| |
| − | |style="background-color:#ff794d"|'''boxoffice flop'''
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |Movie 25
| |
| − | |hight
| |
| − | |low
| |
| − | |critical success
| |
| − | |critical success
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |'''Movie 26'''
| |
| − | |'''hight'''
| |
| − | |'''hight'''
| |
| − | |'''critical success'''
| |
| − | |style="background-color:#66a3ff"|'''mainstream hit'''
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |Movie 27
| |
| − | |hight
| |
| − | |low
| |
| − | |critical success
| |
| − | |critical success
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |Movie 28
| |
| − | |hight
| |
| − | |low
| |
| − | |critical success
| |
| − | |critical success
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |Movie 29
| |
| − | |hight
| |
| − | |low
| |
| − | |critical success
| |
| − | |critical success
| |
| − | |- style="background-color:#00e6b0"
| |
| − | |Movie 30
| |
| − | |hight
| |
| − | |low
| |
| − | |critical success
| |
| − | |critical success
| |
| − | |}
| |
| − | |
| |
| − | [[File:DecisionTree-NoelCosgraveSlides_2.png|500px|thumb|center|]]
| |
| − |
| |
| − | [[File:DecisionTree-NoelCosgraveSlides_1.png|500px|thumb|center|]]
| |
| − | |}
| |
| − |
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − | ===The algorithm===
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − | ====Basic explanation of the algorithm====
| |
| − | https://www.youtube.com/watch?v=7VeUPuFGJHk
| |
| − |
| |
| − |
| |
| − | This is a basic but nice explanation of the algorithm.
| |
| − |
| |
| − |
| |
| − | In this example, we want to create a tree that uses '''chest pain''', '''good blood circulation''', and '''blocked artery status''' to predict whether or not a patient has heart disease:
| |
| − |
| |
| − |
| |
| − | <center>
| |
| − | <div style="width: 420pt;">{{#ev:youtube|https://www.youtube.com/watch?v=7VeUPuFGJHk|||||start=110}}
| |
| − | </div>
| |
| − | </center>
| |
| − |
| |
| − |
| |
| − | [[File:Decision_tree-heart_disease_example.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
| |
| − |
| |
| − |
| |
| − | '''The algorithm to build the model is based on the following steps:'''
| |
| − |
| |
| − | * '''We first need to determine which will be the Root:'''
| |
| − |
| |
| − | :* The attribute (Chest pain, Good blood circulation, Blocked arteries) that determines better whether a Patient has '''Heart Disease''' or not will be chosen as the Root.
| |
| − |
| |
| − | :* To do so, we need to evaluate the three attributes by calculating what is known as '''Impurity''', which is a measure of how bad the Attribute is able to separate (determine) our label attribute (Heart Disease in our case).
| |
| − |
| |
| − | :*There are many ways to measure Impurity, one of the popular ones is to calculate the '''Gini''' impurity.
| |
| − |
| |
| − |
| |
| − | :* '''So, let's calculate the Gini impurity for our «Chest pain» attribute:'''
| |
| − |
| |
| − | ::* We look at chest pain for all 303 patients in our data:
| |
| − |
| |
| − | [[File:Decision_trees-StatQuest1.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
| |
| − |
| |
| − |
| |
| − | ::* We calculate the Gini impurity for each branch (True: The patient has chest pain | False: The patient does not have chest pain):
| |
| − |
| |
| − | :::<div style='font-size:15pt'><math>\color{blue}{
| |
| − | Gini\ Impurity = 1 - \sum_{i=1}^{N} P_{i}^{2}
| |
| − | }
| |
| − | </math>
| |
| − | </div>
| |
| − |
| |
| − |
| |
| − | ::* For the '''True''' branch:
| |
| − |
| |
| − | :::<math>
| |
| − | Gini\ Impurity_{True} = 1 - (Probability\ of\ Yes)^{2} + (Probability\ of\ No)^{2}
| |
| − | </math>
| |
| − |
| |
| − | :::<math>
| |
| − | Gini\ Impurity_{True} = 1 - \biggl(\frac{105}{105 + 39}\biggl)^2 - \biggl(\frac{39}{105 + 39}\biggl)^2 = 0.395
| |
| − | </math>
| |
| − |
| |
| − |
| |
| − | ::* For the '''False''' branch:
| |
| − |
| |
| − | :::<math>
| |
| − | Gini\ Impurity_{False} = 0.336
| |
| − | </math>
| |
| − |
| |
| − |
| |
| − | ::*<div style='font-size:13pt'><math>\color{blue}{
| |
| − | Total\ Gini\ Impurity = Weighted\ average\ of\ Gini\ impurities\ for\ the\ leaf\ nodes\ (branches)}
| |
| − | </math></div>
| |
| − |
| |
| − | :::<math>
| |
| − | Gini\ Impurity_{Chest pain} = \biggl( \frac{144}{144 + 159} \biggl)0.395\ +\ \biggl( \frac{159}{144 + 159} \biggl)0.336 = 0.364
| |
| − | </math>
| |
| − |
| |
| − |
| |
| − | :* '''Good Blood Circulation'''
| |
| − | ::<math>
| |
| − | Gini\ Impurity_{Good\ Blood\ Circulation} = 0.360
| |
| − | </math>
| |
| − |
| |
| − |
| |
| − | :* '''Blocked Arteries'''
| |
| − | ::<math>
| |
| − | Gini\ Impurity_{Blocked\ Arteries} = 0.381
| |
| − | </math>
| |
| − |
| |
| − |
| |
| − | [[File:Decision_trees-StatQuest2.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
| |
| − |
| |
| − |
| |
| − | * '''Then, we follow the same method to determine the following nodes'''
| |
| − |
| |
| − |
| |
| − | * '''A node becomes a leaf when the Gini impurity for a node before is lower than the Gini impurity after using a new attribute'''
| |
| − |
| |
| − | [[File:Decision_trees-StatQuest3.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
| |
| − |
| |
| − |
| |
| − | * '''At the end, our DT looks like this:'''
| |
| − |
| |
| − | [[File:Decision_trees-StatQuest4.png|600px|thumb|center|Taken from https://www.youtube.com/watch?v=7VeUPuFGJHk]]
| |
| − |
| |
| − |
| |
| − |
| |
| − | * '''For numeric data:'''
| |
| − | : Ex.: Patient weight
| |
| − |
| |
| − |
| |
| − | * '''Ranked data and Multiple choice data:'''
| |
| − | : Ranked data: For example, "Rank my jokes on a scale of 1 to 4".
| |
| − | : Multiple choice data: For example, "Which color do you like: red, blue or green".
| |
| − |
| |
| − |
| |
| − |
| |
| − | <br /> <br />
| |
| − | Decision Trees, Part 2 - Feature Selection and Missing Data: https://www.youtube.com/watch?v=wpNl-JwwplA
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − | Regression Trees: https://www.youtube.com/watch?v=g9c66TUylZ4
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − |
| |
| − | ====Algorithms addressed in Noel s Lecture====
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − | =====The ID3 algorithm=====
| |
| − | ID3 (Quinlan, 1986) is an early algorithm for learning Decision Trees. The learning of the tree is top-down. The algorithm is greedy, it looks at a single attribute and gain in each step. This may fail when a combination of attributes is needed to improve the purity of a node.
| |
| − |
| |
| − |
| |
| − | At each split, the question is "which attribute should be tested next? Which logical test gives us more information?". This is determined by the measures of '''entropy''' and '''information gain'''. These are discussed later.
| |
| − |
| |
| − |
| |
| − | A new decision node is then created for each outcome of the test and examples are partitioned according to this value. The process is repeated for each new node until all the examples are classified correctly or there are no attributes left.
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − | =====The C5.0 algorithm=====
| |
| − | C5.0 (Quinlan, 1993) is a refinement of C4.5 which in itself improved upon ID3. '''It is the industry standard for producing decision trees'''. It performs well for most problems out-of-the-box. Unlike other machine-learning techniques, it has high explanatory power, it can be understood and explained.
| |
| − |
| |
| − | .
| |
| − |
| |
| − | .
| |
| − |
| |
| − | .
| |
| − |
| |
| − |
| |
| − | <br />
| |
| − |
| |
| − | ===Example in RapidMiner===
| |
| − | <div style="text-align: center;">
| |
| − |
| |
| − | [[File:Decision_tree_RapidMiner_example-Iris_data.mp4|1500px|thumb|center|]]
| |
| − |
| |
| − |
| |
| − | {| style="margin: 0 auto;" |
| |
| − | |[[File:decision_tree_RapidMiner_example-Iris_data3.png|300px|thumb|center|]]
| |
| − | |[[File:decision_tree_RapidMiner_example-Iris_data2.png|500px|thumb|center|]]
| |
| − | |-
| |
| − | |<img style='width:300px' src='https://upload.wikimedia.org/wikipedia/commons/7/78/Petal-sepal.jpg' />
| |
| − | |[[File:decision_tree_RapidMiner_example-Iris_data1.png|500px|thumb|center|]]
| |
| − | |}
| |
| − |
| |
| − |
| |
| − | <pdf width="810" height="500">File:Decision_tree_Example_RapidMiner.pdf</pdf>
| |
| − |
| |
| − | This is the guide provided by Noel [[File:Decision_tree_Example_RapidMiner.pdf]]
| |
| − |
| |
| − | </div>
| |
| − |
| |
| − | <br />
| |