Modeling the Growth of Trees

6 min readDec 8, 2020

How do trees grow? What will happen to them in the future? In Prof. Albert Y. Kim’s seminar SDS 390: Ecological Forecasting, we learn about ecology as a predictive science. Throughout the semester, we learned to predict ecology with regression analysis. In the second half of the semester, I and my wonderful teammates Haley and Sophie have been working on a project of predicting the growth of trees based on data of its interspecies competition.

Our goal: create a model that reasonably predicts trees’ growth

Interspecific competition is a form of competition in which individuals of different species compete for the same resources in an ecosystem (e.g. water or living space). It is one of the biggest factors that influence a tree’s growth, as a tree’s surrounding trees compete with it for natural resources. However, digging this a little deeper requires us to use the power of statistics and data science: what factors, specifically, influence a tree’s growth? Our question is: what determines the rate of growth of a certain tree? There are thousands of potential factors, and it is impossible to include all of them — you may argue, for example, that an animal ate some leaves of a tree, and that influenced its growth. We cannot possibly include all the factors that may influence a tree’s growth (the data will be overwhelming, and we cannot get the data on everything…), but we do have some “census” information on the surrounding trees: for example, its species, biomass (expressed in weight), diameter at breast height (height), specific leaf area (the ratio of leaf area to leaf dry mass).

Among much available data, to create an efficient model to predict the growth of a tree based on other trees around it, we cannot include everything. We need to determine which parameters do we include or discard and why.

Differentiate the trees: species to focal groups

We began with a simple model based on one of the assignments, where we utilized the biomass of surrounding trees to predict the growth of maple trees. To get started, we extended the small-scale model on tidy data to large-scaled data from the Smithsonian Conservation Biology Institute (SCBI) forestry reserve. And instead of only predicting the maple trees, we want to predict all trees. So we expanded the model to make it work on a large dataset with all trees as options of dependent variables. In this initial model, we have 70+ parameters, as we used the surrounding trees’ species as parameters. But because the trees in the dataset are diverse, there are too many variables for us to make the model clear and efficient.

Here is a visualization of the forest we study. Each dot represents a tree and its corresponding location in the area, and their colors represent their species. As we can see, there are many species… and it is hard to tell which species are where from the small differences in their corresponding colors.

The diversity of the trees is crucial, and we do not want to treat trees of all species as the same. However, we need a broader way to cluster the trees so that we don’t have that many variations. We decided to group species into focal groups: by mapping the species into focal groups, we are left with around 10 focal groups, which decreased the number of parameters and made the model much clearer. We clustered the species based on this diagram below, credit to Prof. David Allen at Middlebury College.

With this change, we made the model much simpler.

And here’s a glimpse of the model summary:

Note that the coefficients are with respect to tall trees/light wood species. What does the model tell us? It tells us the forecasting of a tree’s growth based on the parameters. For example, the dbh parameter tells us that all else being equal, one additional centimeter of diameter breast height is associated with a tiny increase in the tree growth (~.0005 meter). And a tree in the low SLA group is expected to grow .03 meter less than the baseline group, tall trees/light wood, all else being equal.

Checking linear assumptions

Note: this is a more technical part. Feel free to skip this section :)

Then, we wanted to check if the model is good enough by checking linear assumptions. The residual plot of the model (shown below) has a strong megaphone shape, which means that larger predicted values are associated with larger residuals. Unfortunately, this means that the linear assumption of the model is not met.

As linear associations show that the model is not linear enough, we used log transformation on the dependent variable “DBH” to make the model fit the linear assumptions better. Here is what the residual plot looks like after the log transformation:

The megaphone shape is still there, but the trend is much milder. So residuals of this model fit the linear assumption better.

A very simple model alternative

What if we ignore the species or cluster variations among trees, and only use their aggregated biomass information? As you can see from the visualization below, we also tested out how well a super simple model would perform, when we treat all the trees as the same in terms of their species and only use the cumulative biomass of all competing species.

So the original, relatively complicated model

With R squared adjusted = .302, and AIC = -6779

Becomes this simple model:

With R squared adjusted = .296m and AIC = -6731

A model with a higher R squared adjusted and a lower AIC indicates a better fit. From both numbers, we can see that the old model is a slightly better fit with higher R squared adjusted and a lower AIC. But the simplicity of the second model is its advantage. So it depends on the users’ discretion and specific contexts on which one to use — both models are not completely accurate and cannot precisely predict tree growth, but they may all provide some insights on what matters to a tree’s growth.

Discussion

Another thing we tried to do is to find other clustering methods instead of the dendrogram above. There are several R packages available with relevant information, such as `factoextra` `cluster`, and `klaR`, but they do not match with our species very well. So we stuck around with our current clustering method.

Summary

There is a lot of exciting work involved: thinking about what to include in a model, interpret it, adjust the parameters as we go, create a new model, analyze how good it is, compare it to an alternative… Overall, one big takeaway I learned on picking and choosing models is that “you win some, you lose some”. While a model with 70+ parameters may specifically predict the growth of each tree based on their precise species, such a model would be difficult to visualize and not as efficient; on the other hand, an overly simplified model may not fit the data as well as a more complex one. So it is essential to find a balance between sophistication and simplicity — and that is, in Albert’s word, when it is not much of a science, but an art.