This introduction provides a stylized example of the capabilities of the R package data.tree. The code for the implementation of the (already a bit outdated) ID3 algorithm was written in less than an hour. This is possible because, thanks to the data.tree package, the implementation of the training algorithm follows the algorithm’s pseudo code almost line by line.

While preparing this example, I asked my nine-year-old daughter, “Anaïs, imagine you have a basket full of mushrooms. How would you go about finding out which ones are poisonous?”

“Dad, why would I want to know?” she replies, rolling her eyes. “No matter if they are poisonous or not, they aredis-gust-ing.”

“OK, fair enough”, I sigh, “wrong daughter.”

“Naomi”, I turn to my seven year old, “how could you find out if a mushroom is poisonous?”

“Well”, she says. “I’d just eat it. If I die, it’s poisonous.”

“But, come on, there must be a better way! Think, please!”

“Think? Can’t you see that I’mplaying? Ask mom if you must!” She groans. So do I.

“Mushrooms? Where did you buy them?” my wife asks.

“Good question, but – … – it doesn’t matter.”

“No Sir, itdoesmatter!”

“Why?” I ask dubiously.

“I’d have to to know if they areorganic, obviously.”

Obviously. I play along, “Well, whatever, assume I found them in the forest, so they are all organic.”

“Aha!”, she says triumphantly. “Then it’s easy! If they are organic, they simplymustbe good for you. Hence, theycannotbe poisonous!”

## Introduction

### About this vignette

This introduction provides a stylized example of the capabilities of the R package data.tree. The code for this post was written in less than an hour. This is possible because, thanks to the data.tree package, the implementation of the training algorithm follows the algorithm’s pseudo code almost line by line.

The code is based on data.tree 0.1.6

### What is ID3?

If you are not a blind believer in organic food, and if you are older than 9, there is a non-zero probability that you’d tackle this problem differently than my dear family members. Still, there are dozens of different methods to choose from, but chances are you’d come across a class of models that are called classification trees. These models let you *classify* observations (e.g. things, outcomes) according to the observations’ qualities, called *features*. Essentially, all of these models consist of creating a *tree*, where each *node* acts as a *router*. You insert your mushroom *instance* at the *root* of the tree, and then, depending on the mushroom’s *features* (size, points, color, etc.), you follow along a different *path*, until a *leaf* node spits out your mushroom’s *class*, i.e. whether it’s edibel or not.

You might have guessed already that there are two different steps involved in using such a model: *training* (i.e. constructing the tree), and *predicting* (i.e. using the tree to predict whether a given mushroom is poisonous). This vignette provides code to do both, using one of the very early algorithms to classify data according to discrete features: ID3.

We will not go into more details about ID3. You will find lots of documentation on this and more refined algorithms on the internet. For example, lecture notes that build on similar data can be found here.

Also, be assured that this example is by no means meant to be used in the real world. It is overly simplistic, and far too little data is used for training. Nor do we claim that this is a complete discussion of the ID3 algorithm, let alone classification models.

On the contrary, the only reason why we chose this example is to provide a simple to grasp application of the `data.tree`

package, in order to demonstrate how easy it is to build hierarchical models with it.

## Feature Selection

As mentioned above, during the prediction step, each node routes our mushroom according to a feature. But how do we chose the feature? Should we first separate our set according to color or size? That is where classification models differ.

In ID3, we pick at each node the feature with the highest *Information Gain*. In a nutshell, this is the feature which splits the sample in the possibly *purest* subsets. For example, in the case of mushrooms, “dots” might be a more sensible feature than “organic”.

### Purity and Entropy

We define a subset to be *completely pure* if it contains only a *single* class. For example, if a subset contains only poisonous mushrooms, it is completely pure. In R, assuming that the last column contains the class (i.e. the category to be predicted), this can be written as:

1 2 3 |
IsPure <- function(data) { length(unique(data[,ncol(data)])) == 1 } |

The *entropy* is a measure of the purity of a dataset.

1 2 3 4 5 |
Entropy <- function( vls ) { res <- vls/sum(vls) * log2(vls/sum(vls)) res[vls == 0] <- 0 -sum(res) } |

If a dataset is *completely pure*, then it has entropy 0:

1 |
Entropy(c(10, 0)) |

1 |
## [1] 0 |

1 |
Entropy(c(0, 10)) |

1 |
## [1] 0 |

If a set contains 5 poisonous and 5 edible mushrooms, the entropy becomes 1, as the purity is at its lowest:

1 |
Entropy(c(5, 5)) |

1 |
## [1] 1 |

We can plot the entropy as a function of the number of edible mushrooms in a set of, say, 100 mushrooms:

1 2 3 |
entropy <- function(edible) Entropy(c(edible, 100 - edible)) entropy <- Vectorize(entropy) curve( entropy, from = 0, to = 100, xname = 'edible') |

### Information Gain

Mathematically, the information gain IG is defined as:

\[ IG(T,a) = H(T)-\sum_{v\in vals(a)}\frac{|\{\textbf{x}\in T|x_a=v\}|}{|T|} \cdot H(\{\textbf{x}\in T|x_a=v\}) \]

In words, the information gain measures the *difference* between the entropy *before the split*, and the weighted sum of the entropies *after the split*:

So, let’s rewrite that in R:

1 2 3 4 5 6 7 8 |
InformationGain <- function( tble ) { tble <- as.data.frame.matrix(tble) entropyBefore <- Entropy(colSums(tble)) s <- rowSums(tble) entropyAfter <- sum (s / sum(s) * apply(tble, MARGIN = 1, FUN = Entropy )) informationGain <- entropyBefore - entropyAfter return (informationGain) } |

For example, using the `mushroom`

data set:

1 2 3 4 |
library(data.tree) data(mushroom) tble <- table(mushroom[,c('color', 'edibility')]) tble |

1 2 3 4 5 |
## edibility ## color edible toxic ## brown 2 0 ## green 1 0 ## red 1 1 |

1 |
InformationGain(tble) |

1 |
## [1] 0.3219281 |

1 |
InformationGain(table(mushroom[,c('size', 'edibility')])) |

1 |
## [1] 0.1709506 |

1 |
InformationGain(table(mushroom[,c('points', 'edibility')])) |

1 |
## [1] 0.3219281 |

## The ID3 algorithm

### Pseudo code

We are all set for the ID3 training algorithm. We start with the entire training data, and with a root. Then:

- if the data-set is pure (e.g. all toxic), then
- construct a leaf having the name of the class (e.g. ‘toxic’)

- else
- choose the feature with the highest information gain (e.g. ‘color’)
- for each value of that feature (e.g. ‘red’, ‘brown’, ‘green’)
- take the subset of the data-set having that feature value
- construct a child node having the name of that feature value (e.g. ‘red’)
- call the algorithm recursively on the child node and the subset

### Implementation in R with the data.tree package

For the following implementation, we assume that the classifying features are in columns 1 to n-1, whereas the class (the edibility) is in the last column.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
TrainID3 <- function(node, data) { node$obsCount <- nrow(data) #if the data-set is pure (e.g. all toxic), then if (IsPure(data)) { #construct a leaf having the name of the pure feature (e.g. 'toxic') child <- node$AddChild(unique(data[,ncol(data)])) node$feature <- tail(names(data), 1) child$obsCount <- nrow(data) child$feature <- '' } else { #chose the feature with the highest information gain (e.g. 'color') ig <- sapply(colnames(data)[-ncol(data)], function(x) InformationGain( table(data[,x], data[,ncol(data)]) ) ) feature <- names(ig)[ig == max(ig)][1] node$feature <- feature #take the subset of the data-set having that feature value childObs <- split(data[,!(names(data) %in% feature)], data[,feature], drop = TRUE) for(i in 1:length(childObs)) { #construct a child having the name of that feature value (e.g. 'red') child <- node$AddChild(names(childObs)[i]) #call the algorithm recursively on the child and the subset TrainID3(child, childObs[[i]]) } } } |

### Training with data

We are ready to run the function:

1 2 3 |
tree <- Node$new("mushroom") TrainID3(tree, mushroom) print(tree, "feature", "obsCount") |

1 2 3 4 5 6 7 8 9 10 11 |
## levelName feature obsCount ## 1 mushroom color 5 ## 2 ¦--brown edibility 2 ## 3 ¦ °--edible 2 ## 4 ¦--green edibility 1 ## 5 ¦ °--edible 1 ## 6 °--red size 2 ## 7 ¦--large edibility 1 ## 8 ¦ °--edible 1 ## 9 °--small edibility 1 ## 10 °--toxic 1 |

## Prediction

### The prediction method

Now, let’s predict some classes. For this, we need a predict function, which will route data through our tree:

1 2 3 4 5 |
Predict <- function(tree, features) { if (tree$children[[1]]$isLeaf) return (tree$children[[1]]$name) child <- tree$children[[features[[tree$feature]]]] return ( Predict(child, features)) } |

### Using the prediction method

And now we use it to predict:

1 2 3 4 |
Predict(tree, c(color = 'red', size = 'large', points = 'yes') ) |

1 |
## [1] "edible" |

Oops! Maybe *organic* wasn’t such a bad predictor, after all 🙂

Hi! Thanks for the article, it’s very useful.

I’m new in R and is giving me an error:

“Error in .subset2(x, i, exact = exact) :

attempt to select less than one element in integerOneIndex”

This happens because childObs is null.

At the leaf iteration the “data” represent only a column with data about the prediction and don’t enter on the “if clause” because in my case: unique(data[,ncol(data)]) ==2

There are 3 possible classes in my data, can this be a problem for this algorithm?

Is this a possible solution? (Is the if inside the first if)

TrainID3 <- function(node, data) {

node$obsCount <- nrow(data)

#if the data-set is pure (e.g. all toxic), then

if (IsPure(data) || ncol(data)==1 ) {

#construct a leaf having the name of the pure feature (e.g. 'toxic')

if (ncol(data)==1 ) {

child <- node$AddChild(head(unique(data[,ncol(data)]),1))

node$feature <- head(names(data), 1)

child$obsCount <- nrow(data)

child$feature <- ''}

else{

child <- node$AddChild(unique(data[,ncol(data)]))

node$feature <- tail(names(data), 1)

child$obsCount <- nrow(data)

child$feature <- ''

}

} else {

#calculate the information gain

ig <- sapply(colnames(data)[-ncol(data)],

function(x) InformationGain(

table(data[,x], data[,ncol(data)])

)

)

#chose the feature with the highest information gain (e.g. 'color')

#if more than one feature have the same information gain, then take

#the first one

feature <- names(which.max(ig))

node$feature <- feature

#take the subset of the data-set having that feature value

childObs <- split(data[ ,names(data) != feature, drop = FALSE],

data[ ,feature],

drop = TRUE)

for(i in 1:length(childObs)) {

#construct a child having the name of that feature value (e.g. 'red')

child <- node$AddChild(names(childObs)[i])

#call the algorithm recursively on the child and the subset

TrainID3(child, childObs[[i]])

}

}

}

It worked but it gives-me too many nodes.

Do you have functions to prune too?

Thank you for your time.

While unfortunately I do not have time to look at the details, I can confirm that there are Prune functions in data.tree. Check out

`?Prune`

.Note that the algorithm was implemented by me merely to illustrate one of the use cases of data.tree. For real world cases, you might consider more efficient algorithms, or to refine this. See for example the partykit package. Also, for more insight, you might check out over at crossvalidated (http://stats.stackexchange.com/)