ID3 Classification using data.tree 7

This introduction provides a stylized example of the capabilities of the R package data.tree. The code for the implementation of the (already a bit outdated) ID3 algorithm was written in less than an hour. This is possible because, thanks to the data.tree package, the implementation of the training algorithm follows the algorithm’s pseudo code almost line by line.

While preparing this example, I asked my nine-year-old daughter, “Anaïs, imagine you have a basket full of mushrooms. How would you go about finding out which ones are poisonous?”
“Dad, why would I want to know?” she replies, rolling her eyes. “No matter if they are poisonous or not, they are dis-gust-ing.”
“OK, fair enough”, I sigh, “wrong daughter.”
“Naomi”, I turn to my seven year old, “how could you find out if a mushroom is poisonous?”
“Well”, she says. “I’d just eat it. If I die, it’s poisonous.”
“But, come on, there must be a better way! Think, please!”
“Think? Can’t you see that I’m playing? Ask mom if you must!” She groans. So do I.
“Mushrooms? Where did you buy them?” my wife asks.
“Good question, but – … – it doesn’t matter.”
“No Sir, it does matter!”
“Why?” I ask dubiously.
“I’d have to to know if they are organic, obviously.”
Obviously. I play along, “Well, whatever, assume I found them in the forest, so they are all organic.”
“Aha!”, she says triumphantly. “Then it’s easy! If they are organic, they simply must be good for you. Hence, they cannot be poisonous!”


About this vignette

This introduction provides a stylized example of the capabilities of the R package data.tree. The code for this post was written in less than an hour. This is possible because, thanks to the data.tree package, the implementation of the training algorithm follows the algorithm’s pseudo code almost line by line.

The code is based on data.tree 0.1.6

What is ID3?

If you are not a blind believer in organic food, and if you are older than 9, there is a non-zero probability that you’d tackle this problem differently than my dear family members. Still, there are dozens of different methods to choose from, but chances are you’d come across a class of models that are called classification trees. These models let you classify observations (e.g. things, outcomes) according to the observations’ qualities, called features. Essentially, all of these models consist of creating a tree, where each node acts as a router. You insert your mushroom instance at the root of the tree, and then, depending on the mushroom’s features (size, points, color, etc.), you follow along a different path, until a leaf node spits out your mushroom’s class, i.e. whether it’s edibel or not.

You might have guessed already that there are two different steps involved in using such a model: training (i.e. constructing the tree), and predicting (i.e. using the tree to predict whether a given mushroom is poisonous). This vignette provides code to do both, using one of the very early algorithms to classify data according to discrete features: ID3.

We will not go into more details about ID3. You will find lots of documentation on this and more refined algorithms on the internet. For example, lecture notes that build on similar data can be found here.

Also, be assured that this example is by no means meant to be used in the real world. It is overly simplistic, and far too little data is used for training. Nor do we claim that this is a complete discussion of the ID3 algorithm, let alone classification models.

On the contrary, the only reason why we chose this example is to provide a simple to grasp application of the data.tree package, in order to demonstrate how easy it is to build hierarchical models with it.

Feature Selection

As mentioned above, during the prediction step, each node routes our mushroom according to a feature. But how do we chose the feature? Should we first separate our set according to color or size? That is where classification models differ.

In ID3, we pick at each node the feature with the highest Information Gain. In a nutshell, this is the feature which splits the sample in the possibly purest subsets. For example, in the case of mushrooms, “dots” might be a more sensible feature than “organic”.

Purity and Entropy

We define a subset to be completely pure if it contains only a single class. For example, if a subset contains only poisonous mushrooms, it is completely pure. In R, assuming that the last column contains the class (i.e. the category to be predicted), this can be written as:

The entropy is a measure of the purity of a dataset.

If a dataset is completely pure, then it has entropy 0:

If a set contains 5 poisonous and 5 edible mushrooms, the entropy becomes 1, as the purity is at its lowest:

We can plot the entropy as a function of the number of edible mushrooms in a set of, say, 100 mushrooms:


Information Gain

Mathematically, the information gain IG is defined as:

\[ IG(T,a) = H(T)-\sum_{v\in vals(a)}\frac{|\{\textbf{x}\in T|x_a=v\}|}{|T|} \cdot H(\{\textbf{x}\in T|x_a=v\}) \]

In words, the information gain measures the difference between the entropy before the split, and the weighted sum of the entropies after the split:

So, let’s rewrite that in R:

For example, using the mushroom data set:

The ID3 algorithm

Pseudo code

We are all set for the ID3 training algorithm. We start with the entire training data, and with a root. Then:

  1. if the data-set is pure (e.g. all toxic), then
    1. construct a leaf having the name of the class (e.g. ‘toxic’)
  2. else
    1. choose the feature with the highest information gain (e.g. ‘color’)
    2. for each value of that feature (e.g. ‘red’, ‘brown’, ‘green’)
      1. take the subset of the data-set having that feature value
      2. construct a child node having the name of that feature value (e.g. ‘red’)
      3. call the algorithm recursively on the child node and the subset

Implementation in R with the data.tree package

For the following implementation, we assume that the classifying features are in columns 1 to n-1, whereas the class (the edibility) is in the last column.

Training with data

We are ready to run the function:


The prediction method

Now, let’s predict some classes. For this, we need a predict function, which will route data through our tree:

Using the prediction method

And now we use it to predict:

Oops! Maybe organic wasn’t such a bad predictor, after all 🙂


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

7 thoughts on “ID3 Classification using data.tree

  • Joao

    Hi! Thanks for the article, it’s very useful.
    I’m new in R and is giving me an error:
    “Error in .subset2(x, i, exact = exact) :
    attempt to select less than one element in integerOneIndex”

    This happens because childObs is null.
    At the leaf iteration the “data” represent only a column with data about the prediction and don’t enter on the “if clause” because in my case: unique(data[,ncol(data)]) ==2

    There are 3 possible classes in my data, can this be a problem for this algorithm?

    Is this a possible solution? (Is the if inside the first if)

    TrainID3 <- function(node, data) {

    node$obsCount <- nrow(data)

    #if the data-set is pure (e.g. all toxic), then
    if (IsPure(data) || ncol(data)==1 ) {
    #construct a leaf having the name of the pure feature (e.g. 'toxic')
    if (ncol(data)==1 ) {
    child <- node$AddChild(head(unique(data[,ncol(data)]),1))
    node$feature <- head(names(data), 1)
    child$obsCount <- nrow(data)
    child$feature <- ''}
    child <- node$AddChild(unique(data[,ncol(data)]))
    node$feature <- tail(names(data), 1)
    child$obsCount <- nrow(data)
    child$feature <- ''

    } else {
    #calculate the information gain
    ig <- sapply(colnames(data)[-ncol(data)],
    function(x) InformationGain(
    table(data[,x], data[,ncol(data)])
    #chose the feature with the highest information gain (e.g. 'color')
    #if more than one feature have the same information gain, then take
    #the first one
    feature <- names(which.max(ig))
    node$feature <- feature

    #take the subset of the data-set having that feature value
    childObs <- split(data[ ,names(data) != feature, drop = FALSE],
    data[ ,feature],
    drop = TRUE)

    for(i in 1:length(childObs)) {
    #construct a child having the name of that feature value (e.g. 'red')
    child <- node$AddChild(names(childObs)[i])

    #call the algorithm recursively on the child and the subset
    TrainID3(child, childObs[[i]])



    It worked but it gives-me too many nodes.
    Do you have functions to prune too?

    Thank you for your time.

    • gluc Post author

      While unfortunately I do not have time to look at the details, I can confirm that there are Prune functions in data.tree. Check out ?Prune.
      Note that the algorithm was implemented by me merely to illustrate one of the use cases of data.tree. For real world cases, you might consider more efficient algorithms, or to refine this. See for example the partykit package. Also, for more insight, you might check out over at crossvalidated (

  • Jignasha

    hello, I need to do prediction using entire test set file instead of only one record…and also need to display prediction accuracy by comparing with actual training data…can anyone help me?