Created By: Delaney Lyman, Anthony Nagygyor, Jayde Thompson

The complete report for this project, which includes the code and equations used for this project, can be found and downloaded at the end of the page.
Jayde is leading the Breast Cancer project. Feel free to contact her if you are interested in joining.


While fires are critical in maintaining the health of many forest ecosystems, the destructive and increasing nature of them cannot be ignored. The consequences of forest fires are especially devastating for the individuals, communities, flora, and fauna directly affected by them. The ecological consequences of forest fires include loss of ecosystem and biodiversity, forest degradation, air pollution, soil degradation, and destruction of watersheds. The economic losses from forest fires primarily come from the cost of keeping the fire under control. In addition, they present a huge risk to human lives in terms of health and well-being.

Therefore, the ability to predict the area burned from forest fires is essential in the hope to mitigate these impacts. The goal of this study is to find the best model to predict forest fire behavior, specifically the area burned, based on time of year.

Data Exploratory

Our analysis uses forest fire data from Montesinho natural park located in the Tras-os-Montes northeastregion of Portugal from January 2000 to December 2003 [2]. The main dependent variable is area – the burned area of the forest (in hectares). The independent variables are as follows: spatial coordinates, moisture and drought codes, temperature, humidity, wind, rain and days of the week.

Figure 1 displays the relationship between the variables used in our model. The boxed components provide numeric ratings of relative potential for wildland fire [3].

Figure 1: Diagram of relationship between variables

Given that the problem statement sought to understand variation in fire prediction based upon the time of the year, the data had to be divided up into modes/segments. Given the need for a fair amount of data to exist in each mode, a histogram was created to understand the distribution of data points over the year.

Figure 2, shows a histogram of the number of data points in each month. It can be clearly seen that some months have virtually no data while others have a significant amount of data. Therefore it was decided to group the data into 4 modes/segments, not based upon the traditional definition of season, but rather proximity and amount of data contained within each month. The sections are as follows: Mode 1: February, March, April; Mode 2: June, July; Mode 3: August; Mode 4: September, October.

Figure 2: Histogram of the number of data points in each month

Methodology: Regression and Classification


The lasso method proved to be an adequate method in regards to variable selection, determining what the most important variables were for each query. The accuracy of the model it produced though is still marginal, as the lasso cannot improve the fit of a data set that appears to be very uncorrelated, however as alluded to it does provide information about the relative importance of the variables.

The results from this model illustrate different lambda values and predictor variables that were used to best predict area burned. The most common day for fires seemed to be Saturday and the most common variable seemed to be temperature as it showed up in Mode 2, Mode 3, Mode 4 and all months.

Step-wise Selection

Similarly to the lasso, stepwise selection did well to predict which variables were the most important. However, also like lasso, it cannot improve the fit of data that is relatively uncorrelated.

The results from this method illustrates the number of variables each model selected for both the adjusted R^2 and the Bayesian Information Criterion. It shows the most important variable for predicting area burned changes with the time of year. And among all variables, its seems temperature is the most common.

Linear Discriminant Analysis (LDA)

The first method used to classify the severity of fire weather was linear discriminant analysis (LDA). LDA is a Bayes Classifier that assumes the probability distribution functions for all the classes are multivariate normal, and that all the classes have the same co-variance matrix.

Performing LDA on the overall data set found a misclassification rate of 81.0% when using equal priors for all 8 classes. It was then decided that given the skew of the data towards the smaller classes that LDA should be performed again but using priors based upon the distribution of the classes in the training set. It turned out that using non-equal priors made the LDA classifier better with a misclassification rate of 54.8%. As expected though, the increase in misclassification rate came as a result of nearly everything being classified in class 0 given its large prior (all days with a fire were classified in class 0). It should be noted therefore that although the non-equal prior has a better misclassification rate that it is in fact an inferior classifier as it never accurately predicts when fires will occur, the whole point of the classifier. Therefore when creating classifiers for the modes, equal priors were used.

K – Nearest Neighbors :

When performing k-Nearest Neighbors classification the number of neighbors to consider was varied between 1 and 6. Figure 3 below shows the misclassification rate as a function of the number of neighbors considered. It can be seen that 5 neighbors seemed to be the optimal number with a misclassification rate of 56.2%, significantly better than the misclassification rate for LDA. It was therefore decided that for the analysis of the modes that all the modes would be compared using 5 neighbors while also looking to see if a different number of neighbors would prove optimal.

Figure 3: Miss-classification rate as a function of neighbors

The analysis of the modes was performed using again a training set consisting of 70% of the data for that mode and a test set consisting of 30% of the data for that mode. Similar to the classification of the overall data set it was found that LDA using equal priors was a poor tool for creating classifiers, with its classification of Mode 1 (spring) particularly poor (a 100% misclassification rate). It is hypothesized that the reason for its particularly poor performance on Mode 1 is due to the particularly small size of the data set for Mode 1. The K-Nearest Neighbors classification was then performed on the modes, using values of 1 through 6 neighbors. Figures 4 – 7 below, shows the plots for all the modes of the misclassification rate as a function of the number of neighbors considered.

Figure 4: Mode 1
Figure 5: Mode 2
Figure 6: Mode 3
Figure 7: Mode 4

As can be seen in Figures 4 – 7 above, the ideal number of neighbors varies quite a bit between modes. It can be hypothesized that the variation is due to seasonal variation in weather and fire behavior, as the variation is significant and the time of year may affect the interactions of each of the variables with fire behavior. It can be seen that LDA is not an appropriate tool for classifying the fire data, however, the k-Nearest method provides an adequate tool given the available data.


In this paper we sought to understand if different times of year require different models for predicting forest fires. We broke the original data set down into four modes based upon data availability. We modeled the data using several different methods, ranging from regression methods to classification methods. 

There is a variation in which variables best predict fire potential depending on the time of year. This should be qualified, however, as the degree of fit is known to be relatively weak for all of the modes based upon the initial linear regression. From all of the regression analysis it is clear that temperature and it being Saturday were the most important variables when predicting the acreage that was going to be burned.

As compared to LDA, k-Nearest Neighbors classification was significantly more robust. For all of the modes and the overall data set it was found that the best classifier (of those considered) was k-Nearest Neighbors. It was also found that the ideal number of neighbors to consider varied based upon mode, again proving that there is some seasonal/modal variation. The misclassification rate for all the modes when using their best k-NeareastNeighbors model was in the range of 50-60%. While the rate for misclassification is relatively high, it is significantly better than random guessing.

Suggestions to Improve the Model

Given the relatively poor results of predicting the size of fires, using both regression and classification, it raises the question of whether the data-set ultimately set us up for failure from the beginning. In essence, the predictor variables were limited to either rudimentary weather readings (temperature, humidity, etc.), functions of them (FWI, DMC, etc.), or temporal variables (day of the week, month).

In order to get a better understanding of whether these variables could realistically be expected to predict fire potential we consulted resources from the United States Forest Service, we determined that several other predictive variables would be needed in order to construct a more robust predictive tool, and better answer the question of whether models for fire prediction should vary by season. These resources suggest the usage of variables that include more historic information leading up to the data point (such as the bulk density, a measurement of the amount of oven dry fuel per cubic foot of fuel bed) and meso-level details (such as topography and the amount of 1000 hour fuels). Future analyses of predicting and modeling forest fires should look to find data that incorporates such details.

Suggestions to Improve the Model