Página de pruebas
- Noel class (10/06) https://drive.google.com/drive/folders/1TW494XF-laGGJiLFApz8bJfMstG4Yqk_
DT is an predictive algorithm that build models in the form of a tree structure, that is composed of a series of branching Boolean tests (tests for which the answer is true or false). The princpel is to use these boolean tests to split the data into smaller and smaller subsets to identify patterns that can be used for prediction. [Noel Cosgrave slides]
In a DT, each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf are called decision rules/classification rules. https://en.wikipedia.org/wiki/Decision_tree
- A dataset can have many possible decision trees
- In practice, we want small & accurate trees
- What is the inductive bias in Decision Tree learning?
- Shorter trees are preferred over longer trees: Smaller trees are more general, usually more accurate, and easier to understand by humans (and to communicate!)
- Prefer trees that place high information gain attributes close to the root
- More succinct hypotheses are preferred. Why?
- There are fewer succinct hypotheses than complex ones
- If a succinct hypothesis explains the data this is unlikely to be a coincidence
- However, not every succinct hypothesis is a reasonable one.
The model for predicting the future success of a movie can be represented as a simple tree:
This decision tree is used to decide whether or not to provide a loan to a customer
This is a basic but nice explanation of the algorithm.
The algorithm described above is something like this:
- We first need to determine which will be the Root:
- The attribute (Chest pain, Good blood circulation, Blocked arteries) that determines better whether a Patient has Heart Disease or not will be chosen as the Root.
- To do so, we need to evaluate the three attributes by calculating what is known as Impurity, which is a measure of how bad the Attribute is able to separate (determine) our label attribute (Heart Disease in our case).
- There are many ways to measure Impurity, one of the popular ones is to calculate the Gini impurity.
- So, let's calculate the Gini impurity for our «Chest pain» attribute:
- We look at chest pain for all 303 patients in our data:
- We calculate the Gini impurity for each branch (True: The patient has chest pain | False: The patient does not have chest pain):
- For the True branch:
- For the False branch:
- Good Blood Circulation
- Blocked Arteries
- Then, we follow the same method to determine the following nodes
- A node becomes a leaf when the Gini impurity for a node before is lower than the Gini impurity after using a new attribute
- At the end, our DT looks like this:
- For numeric data:
- Ex.: Patient weight
- Ranked data and Multiple choice data:
- Ranked data: For example, "Rank my jokes on a scale of 1 to 4".
- Multiple choice data: For example, "Which color do you like: red, blue or green".
Decision Trees, Part 2 - Feature Selection and Missing Data: https://www.youtube.com/watch?v=wpNl-JwwplA
Regression Trees: https://www.youtube.com/watch?v=g9c66TUylZ4
The ID3 algorithm
ID3 (Quinlan, 1986) is an early algorithm for learning Decision Trees. The learning of the tree is top-down. The algorithm is greedy, it looks at a single attribute and gain in each step. This may fail when a combination of attributes is needed to improve the purity of a node.
At each split, the question is "which attribute should be tested next? Which logical test gives us more information?". This is determined by the measures of entropy and information gain. These are discussed later.
A new decision node is then created for each outcome of the test and examples are partitioned according to this value. The process is repeated for each new node until all the examples are classified correctly or there are no attributes left.
The C5.0 algorithm
C5.0 (Quinlan, 1993) is a refinement of C4.5 which in itself improved upon ID3. It is the industry standard for producing decision trees. It performs well for most problems out-of-the-box. Unlike other machine-learning techniques, it has high explanatory power, it can be understood and explained.
Example in RapidMiner