Banner

Abraham Lincoln once said, "Give me six hours to chop down a tree and I will spend the first four sharpening the axe."
Aunt Margaret used to say, "If you dream of a forest, you'd better learn how to plant a tree."
data.tree says, "No matter if you are a lumberjack or a tree hugger. I will be your sanding block, and I will be your seed."

Introduction

Trees

Trees are ubiquitous in mathematics, computer science, data sciences, finance, and in many other fields. Trees are especially useful when we are facing hierarchical data. For example, trees are used:

  • in decision theory (cf. decision trees)
  • in machine learning (e.g. classification trees)
  • in finance, e.g. to classify financial instruments into asset classes
  • in routing algorithms
  • in computer science and programming (e.g. binary search trees, XML)
  • e.g. for family trees

For more details, see the applications vignette by typing vignette("applications", package = "data.tree")

Trees in R

Tree-like structures are already used in R. For example, environments can be seen as nodes in a tree. And CRAN provides numerous packages that deal with tree-like structures, especially in the area of decision theory. Yet, there is no general purpose hierarchical data structure that could be used as conveniently and generically as, say, data.frame.

As a result, people often try to resolve hierarchical problems in a tabular fashion, for instance with data.frames. But often, hierarchies don’t marry with tables, and various workarounds are usually required.

Trees in data.tree

This package offers an alternative. The data.tree package lets you create hierarchies, called data.tree structures. The building block of theses structures are Node objects. The package provides basic traversal, search, and sort operations, and an infrastructure for recursive tree programming. You can decorate Nodes with your own fields and methods, so as to extend the package to your needs.

The package also provides convenience methods for neatly printing and plotting trees. It supports conversion from and to data.frames, lists, and other tree structures such as dendrogram, phylo objects from the ape package, igraph, and other packages.

Technically, data.tree structures are bi-directional, ordered trees. Bi-directional means that you can navigate from parent to chidren and vice versa. Ordered means that the sort order of the children of a parent node is well-defined.

data.tree basics

Definitions

  • data.tree structure: a tree, consisting of multiple Node objects. Often, the entry point to a data.tree structure is the root Node
  • Node: both a class and the basic building block of data.tree structures
  • attribute: an active, a field, or a method. **Not to be confused with standard R attributes, c.f. ?attr, which have a different meaning. Many methods and functions have an attribute arg, which can refer to a an active, a field or a method. For example, see ?Get
  • active (sometimes called property): a field on a Node that can be called like an attribute, but behaves like a function without arguments. For example: node$position
  • field: a named value on a Node, e.g. node$cost <- 2500
  • method: a function acting on an object (on a Node in this context). Many methods are available in OO style (e.g. node$Revert()) or in traditional style (Revert(node))
  • inheritance: in this context, inheritance refers to a situation in which a child Node inherits e.g. an attribute from one of its ancestors. For example, see ?Get, ?SetNodeStyle

Tree creation

There are different ways to create a data.tree structure. For example, you can create a tree programmatically, by conversion from other R objects, or from a file.

Create a tree programmatically

Let’s start by creating a tree programmatically. We do this by creating Node objects, and linking them together so as to define the parent-child relationships.

In this example, we are looking at a company, Acme Inc., and the tree reflects its organisational structure. The root (level 1) is the company. On level 2, the nodes represent departments, and the leaves of the tree represent projects that the company is considering for next year:

library(data.tree)

acme <- Node$new("Acme Inc.")
  accounting <- acme$AddChild("Accounting")
    software <- accounting$AddChild("New Software")
    standards <- accounting$AddChild("New Accounting Standards")
  research <- acme$AddChild("Research")
    newProductLine <- research$AddChild("New Product Line")
    newLabs <- research$AddChild("New Labs")
  it <- acme$AddChild("IT")
    outsource <- it$AddChild("Outsource")
    agile <- it$AddChild("Go agile")
    goToR <- it$AddChild("Switch to R")

print(acme)
##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to R

As you can see from the previous example, each Node is identified by its name, i.e. the argument you pass into the Node$new(name) constructor. The name needs to be unique among siblings, such that paths to Nodes are unambiguous.

Node inherits from R6 reference class. This has the following implications:

  1. You can call methods on a Node in OO style, e.g. acme$Get("name")
  2. Node exhibits reference semantics. Thus, multiple variables in R can point to the same Node, and modifying a Node will modify it for all referencing variables. In the above code example, both acme$IT and it reference the same object. This is different from the value semantics, which is much more widely used in R.

Create a tree from a data.frame

Creating a tree programmatically is useful especially in the context of algorithms. However, most times you will create a tree by conversion. This could be by conversion from a nested list-of-lists, by conversion from another R tree-structure (e.g. an ape phylo), or by conversion from a data.frame. For more details on all the options, type ?as.Node and refer to the See Also section.

One of the most common conversions is the one from a data.frame in table format. The following code illustrates this. We load the GNI2010 data from the treemap package. This data.frame is in table format, meaning that each row will represent a leaf in the data.tree structure:

library(treemap)
data(GNI2010)
head(GNI2010)
##   iso3              country     continent population  GNI
## 1  ABW                Aruba North America        108    0
## 2  AFG          Afghanistan          Asia      34385  410
## 3  AGO               Angola        Africa      19082 3960
## 4  ALB              Albania        Europe       3205 3960
## 5  ARE United Arab Emirates          Asia       7512    0
## 6  ARG            Argentina South America      40412 8620

Let’s convert that into a data.tree structure! We start by defining a pathString. The pathString describes the hierarchy by defining a path from the root to each leaf. In this example, the hierarchy comes very naturally:

GNI2010$pathString <- paste("world", 
                            GNI2010$continent, 
                            GNI2010$country, 
                            sep = "/")

Once our pathString is defined, conversion to Node is very easy:

population <- as.Node(GNI2010)
print(population, "iso3", "population", "GNI", limit = 20)
##                        levelName iso3 population   GNI
## 1  world                                      NA    NA
## 2   ¦--North America                          NA    NA
## 3   ¦   ¦--Aruba                  ABW        108     0
## 4   ¦   ¦--Antigua and Barbuda    ATG         88 13280
## 5   ¦   ¦--Bahamas                BHS        343 22240
## 6   ¦   ¦--Belize                 BLZ        345  3810
## 7   ¦   ¦--Bermuda                BMU         65     0
## 8   ¦   ¦--Barbados               BRB        274     0
## 9   ¦   ¦--Canada                 CAN      34126 43250
## 10  ¦   ¦--Costa Rica             CRI       4659  6810
## 11  ¦   ¦--Cuba                   CUB      11258     0
## 12  ¦   ¦--Curacao                CUW        143     0
## 13  ¦   ¦--Cayman Islands         CYM         56     0
## 14  ¦   ¦--Dominica               DMA         68  6740
## 15  ¦   ¦--Dominican Republic     DOM       9927  5030
## 16  ¦   ¦--Grenada                GRD        104  6960
## 17  ¦   ¦--Greenland              GRL         57     0
## 18  ¦   ¦--Guatemala              GTM      14389  2740
## 19  ¦   ¦--Honduras               HND       7600  1870
## 20  ¦   °--... 16 nodes w/ 0 sub              NA    NA
## 21  °--... 5 nodes w/ 190 sub                 NA    NA

This is a simple example, and more options are available. Type ?FromDataFrameTable for all the details.

Create a tree from a file

Often, trees are created from one of many file formats. When developing this package, We opted for a multi-step approach, meaning that you first import the file into one of the well-known R data structures. Then you convert these into a data.tree structure. For example, typical import patterns could be:

  • csv -> data.frame in table format (?read.csv) -> data.tree (?as.Node.data.frame)
  • Newick -> ape phylo (?ape::read.tree) -> data.tree (?as.Node.phylo )
  • csv -> data.frame in network format (?read.csv) -> data.tree (c.f. ?FromDataFrameNetwork)
  • yaml -> list of lists (?yaml::yaml.load) -> data.tree (?as.Node.list)
  • json -> list of lists (e.g. ?jsonlite::fromJSON) -> data.tree (?as.Node.list)

If you have a choice, we recommend you consider yaml format to store and share your hierarchies. It is concise, human-readable, and very easy to convert to a data.tree. An example is provided here for illustration. The data represents what platforms and OS versions a group of students use:

library(yaml)
yaml <- "
name: OS Students 2014/15
OS X:
  Yosemite:
    users: 16
  Leopard:
    users: 43
Linux:
  Debian:
    users: 27
  Ubuntu:
    users: 36
Windows:
  W7:
    users: 31
  W8:
    users: 32
  W10:
    users: 4
"

osList <- yaml.load(yaml)
osNode <- as.Node(osList)
print(osNode, "users")
##              levelName users
## 1  OS Students 2014/15    NA
## 2   ¦--OS X               NA
## 3   ¦   ¦--Yosemite       16
## 4   ¦   °--Leopard        43
## 5   ¦--Linux              NA
## 6   ¦   ¦--Debian         27
## 7   ¦   °--Ubuntu         36
## 8   °--Windows            NA
## 9       ¦--W7             31
## 10      ¦--W8             32
## 11      °--W10             4

Node methods

As seen above, a data.tree structure is composed of Node objects, and the entry point to a data.tree structure is always a Node, often the root Node of a tree.

There are different types of methods:

  • OO-style actives (sometimes called properties) on Nodes, such as e.g. Node$isRoot
  • OO-style methods on Nodes, such as e.g. Node$Prune(pruneFun)
  • Classical R methods, such as e.g. Clone(node). There are at least two reasons why these exist (sometimes in co-existence with their OO-style counterpart):
    • performance: OO-style methods maintain their own copy of the function in each Node instance, they are inflating memory usage of the Nodes. For this reason, OO-style methods are kept to a minimum
    • documentation: OO-style methods are difficult to document. So, to find documentation for, say, Node$Prune, you can just type ?Prune. Be aware, however, that this documents the classical R method. Thus, the first argument (node) needs to be omitted when called in OO-style. In short, you can call either mynode$Prune(myPruneFunction) or Prune(mynode, myPruneFunction). The effect is exactly the same.
    • debugging: OO-style methods are difficult to debug. So, in this packge, they just re-direct to their classical R counterpart, which is very easy to debug.

Actives Examples

Actives look and feel like fields, but they are dynamically evaluated. They are documented in the Node documentation, which is accessed by typing ?Node.

Remember our population example:

print(population, limit = 15)
##                        levelName
## 1  world                        
## 2   ¦--North America            
## 3   ¦   ¦--Aruba                
## 4   ¦   ¦--Antigua and Barbuda  
## 5   ¦   ¦--Bahamas              
## 6   ¦   ¦--Belize               
## 7   ¦   ¦--Bermuda              
## 8   ¦   ¦--Barbados             
## 9   ¦   ¦--Canada               
## 10  ¦   ¦--Costa Rica           
## 11  ¦   ¦--Cuba                 
## 12  ¦   ¦--Curacao              
## 13  ¦   ¦--Cayman Islands       
## 14  ¦   ¦--Dominica             
## 15  ¦   °--... 21 nodes w/ 0 sub
## 16  °--... 5 nodes w/ 195 sub
population$isRoot
## [1] TRUE
population$height
## [1] 3
population$count
## [1] 6
population$totalCount
## [1] 214
population$fields
## character(0)
population$fieldsAll
## [1] "GNI"        "continent"  "country"    "iso3"       "population"
population$averageBranchingFactor
## [1] 30.42857

The naming convention of the package is that fields and actives are lower case, whereas methods are upper / CamelCase. RStudio and other IDEs work well with data.tree. If you have a Node, simply type myNode$ + SPACE to get a list of available fields, actives and methods.

OO-Style Methods Examples

Examples of OO-Style methods

You will find more information on these examples below.

Get will traverse the tree and collect specific values for the Nodes it traverses:

sum(population$Get("population", filterFun = isLeaf))
## [1] 6727766

Prune traverses the tree and keeps only the subtrees for which the pruneFun returns TRUE.

population$Prune(pruneFun = function(x) !x$isLeaf || x$population > 1000000)
## [1] 205

Note that the Prune function has side-effects, as it acts on the original population object. The population sum is now smaller:

sum(population$Get("population", filterFun = isLeaf), na.rm = TRUE)
## [1] 2562915

Traditional R Methods

popClone <- Clone(acme)

Traditional S3 generics are available especially for conversion:

as.data.frame(acme)
##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to R

Though there is also a more specialised non-generic version:

ToDataFrameNetwork(acme)
##          from                       to
## 1   Acme Inc.               Accounting
## 2   Acme Inc.                 Research
## 3   Acme Inc.                       IT
## 4  Accounting             New Software
## 5  Accounting New Accounting Standards
## 6    Research         New Product Line
## 7    Research                 New Labs
## 8          IT                Outsource
## 9          IT                 Go agile
## 10         IT              Switch to R

Climbing a tree (tree navigation)

To climb a tree means to navigate to a specific Node in the data.tree structure.

Custom fields

Just as with, say, a list, we can add any custom field to any Node in a data.tree structure. Let’s go back to our acme company:

acme
##                           levelName
## 1  Acme Inc.                       
## 2   ¦--Accounting                  
## 3   ¦   ¦--New Software            
## 4   ¦   °--New Accounting Standards
## 5   ¦--Research                    
## 6   ¦   ¦--New Product Line        
## 7   ¦   °--New Labs                
## 8   °--IT                          
## 9       ¦--Outsource               
## 10      ¦--Go agile                
## 11      °--Switch to R

We now add costs and probabilities to the projects in each department:

software$cost <- 1000000
standards$cost <- 500000
newProductLine$cost <- 2000000
newLabs$cost <- 750000
outsource$cost <- 400000
agile$cost <- 250000
goToR$cost <- 50000

software$p <- 0.5
standards$p <- 0.75
newProductLine$p <- 0.25
newLabs$p <- 0.9
outsource$p <- 0.2
agile$p <- 0.05
goToR$p <- 1
print(acme, "cost", "p")
##                           levelName    cost    p
## 1  Acme Inc.                             NA   NA
## 2   ¦--Accounting                        NA   NA
## 3   ¦   ¦--New Software             1000000 0.50
## 4   ¦   °--New Accounting Standards  500000 0.75
## 5   ¦--Research                          NA   NA
## 6   ¦   ¦--New Product Line         2000000 0.25
## 7   ¦   °--New Labs                  750000 0.90
## 8   °--IT                                NA   NA
## 9       ¦--Outsource                 400000 0.20
## 10      ¦--Go agile                  250000 0.05
## 11      °--Switch to R                50000 1.00

Note that there is a list of reserved names you cannot use as Node fields:

NODE_RESERVED_NAMES_CONST
##  [1] "AddChild"               "AddChildNode"          
##  [3] "AddSibling"             "AddSiblingNode"        
##  [5] "averageBranchingFactor" "children"              
##  [7] "Climb"                  "clone"                 
##  [9] "count"                  "Do"                    
## [11] "fields"                 "fieldsAll"             
## [13] "Get"                    "GetAttribute"          
## [15] "height"                 "initialize"            
## [17] "isBinary"               "isLeaf"                
## [19] "isRoot"                 "leafCount"             
## [21] "leaves"                 "level"                 
## [23] "levelName"              "name"                  
## [25] "parent"                 "path"                  
## [27] "pathString"             "position"              
## [29] "Prune"                  "Revert"                
## [31] "RemoveAttribute"        "RemoveChild"           
## [33] "root"                   "Set"                   
## [35] "siblings"               "Sort"                  
## [37] "totalCount"             ".*"

Custom fields in constructor

An alternative, often convenient way to assign custom fields is in the constructor, or in the Node$AddChild method:

birds <- Node$new("Aves", vulgo = "Bird")
birds$AddChild("Neognathae", vulgo = "New Jaws", species = 10000)
birds$AddChild("Palaeognathae", vulgo = "Old Jaws", species = 60)
print(birds, "vulgo", "species")
##           levelName    vulgo species
## 1 Aves                  Bird      NA
## 2  ¦--Neognathae    New Jaws   10000
## 3  °--Palaeognathae Old Jaws      60

Custom fields as function

Nothin stops you from setting a function as a field. This calculates a value dynamically, i.e. whenever a field is accessed in tree traversal. For example, you can add a new Node to your structure, and the function will reflect this. Think of this as a hierarchical spreadsheet, in which you can set formulas into cells.

Consider the following example:

birds$species <- function(self) sum(sapply(self$children, function(x) x$species))
print(birds, "species")
##           levelName species
## 1 Aves                10060
## 2  ¦--Neognathae      10000
## 3  °--Palaeognathae      60

data.tree maps the self argument to the Node at hand. Thus, you must name the argument self.

Now, let’s assume we discover a new species. Then, the species on the root adjusts dynamically:

birds$Palaeognathae$species <- 61
print(birds, "species")
##           levelName species
## 1 Aves                10061
## 2  ¦--Neognathae      10000
## 3  °--Palaeognathae      61

This, together with the Set method and recursion, becomes a very powerful tool, as we’ll see later.

Printing

Basic Printing

Basic printing is easy, as you surely have noted in the previous sections. print displays a tree in a tree-grid view. On the left, you have the hierarchy. Then you have a column per variable you want to print:

print(acme, "cost", "p")
##                           levelName    cost    p
## 1  Acme Inc.                             NA   NA
## 2   ¦--Accounting                        NA   NA
## 3   ¦   ¦--New Software             1000000 0.50
## 4   ¦   °--New Accounting Standards  500000 0.75
## 5   ¦--Research                          NA   NA
## 6   ¦   ¦--New Product Line         2000000 0.25
## 7   ¦   °--New Labs                  750000 0.90
## 8   °--IT                                NA   NA
## 9       ¦--Outsource                 400000 0.20
## 10      ¦--Go agile                  250000 0.05
## 11      °--Switch to R                50000 1.00

For more advanced printing, you have a few options.

Formatters

You can use formatters to output a variable in a certain way. You can use formatters in two ways:

  • You can set them on a Node using the SetFormat method. If you do this, then the formatter will be picked up as a default formatter whenever you print, Get, convert to data.frame, etc. Formatters can be set on any Node in a data.tree structure act on any descendant. So you can overwrite a formatter for a sub-tree.
  • You can add an explicit ad-hoc formatter to the Get method (see below). This will overwrite default formatters previously set via the SetFormat method. You can also set the formatter to identity to void a default formatter.

Setting a formatter using the SetFormat method:

SetFormat(acme, "p", formatFun = FormatPercent)
SetFormat(acme, "cost", formatFun = function(x) FormatFixedDecimal(x, digits = 2))
print(acme, "cost", "p")
##                           levelName       cost        p
## 1  Acme Inc.                                           
## 2   ¦--Accounting                                      
## 3   ¦   ¦--New Software             1000000.00  50.00 %
## 4   ¦   °--New Accounting Standards  500000.00  75.00 %
## 5   ¦--Research                                        
## 6   ¦   ¦--New Product Line         2000000.00  25.00 %
## 7   ¦   °--New Labs                  750000.00  90.00 %
## 8   °--IT                                              
## 9       ¦--Outsource                 400000.00  20.00 %
## 10      ¦--Go agile                  250000.00   5.00 %
## 11      °--Switch to R                50000.00 100.00 %

Printing using Get

Formatting with the Get method overwrites any formatters found along the path:

data.frame(cost = acme$Get("cost", format = function(x) FormatFixedDecimal(x, 2)),
           p = acme$Get("p", format = FormatPercent))
##                                cost        p
## Acme Inc.                                   
## Accounting                                  
## New Software             1000000.00  50.00 %
## New Accounting Standards  500000.00  75.00 %
## Research                                    
## New Product Line         2000000.00  25.00 %
## New Labs                  750000.00  90.00 %
## IT                                          
## Outsource                 400000.00  20.00 %
## Go agile                  250000.00   5.00 %
## Switch to R                50000.00 100.00 %

Plotting

plot

data.tree is mainly a data structure. As it is easy to convert data.tree structures to other formats, you have access to a large number of tools to plot a data.tree structure. For example, you can plot a data.tree structure as a dendrogram, as an ape tree, as a treeview, etc. Additionally, data.tree also provides its own plotting facility. It is built on GraphViz/DiagrammeR, and you can access these features via the plot and ToGraphViz functions. For example:

plot(acme)

acme

Styling

Similar to formatters for printing, you can style your tree and store the styling directly in the tree, for later use:

SetGraphStyle(acme, rankdir = "TB")
SetEdgeStyle(acme, arrowhead = "vee", color = "grey35", penwidth = 2)
SetNodeStyle(acme, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow", 
            fontname = "helvetica", tooltip = GetDefaultTooltip)
SetNodeStyle(acme$IT, fillcolor = "LightBlue", penwidth = "5px")
plot(acme)

acme

For details on the styling attributes, see http://graphviz.org/Documentation.php .

Note that, by default, most Node style attributes will be inherited. Though, for example, label will not be inherited. However, inheritance can be avoided for all style attributes, as for the Accounting node in the following example:

SetNodeStyle(acme$Accounting, inherit = FALSE, fillcolor = "Thistle", 
             fontcolor = "Firebrick", tooltip = "This is the accounting department")
plot(acme)

acme

Use Do to set style on specific nodes:

Do(acme$leaves, function(node) SetNodeStyle(node, shape = "egg"))
plot(acme)

acme

Other Visualisations

However, there are also endless other possibilities to visualise data.tree structures. There are more examples in the applications vignette. Type vignette('application', package = "data.tree").

Dendrogram

For example, using dendrogram:

plot(as.dendrogram(CreateRandomTree(nodes = 20)), center = TRUE)