Seasonal TimeSeries Modeling with Gradient Boosted Tree Regression

Seasonal Time Series data can be easily modeled with methods such as Seasonal-ARIMA, GARCH and HoltWinters. These are readily available in Statistical packages like R, STATA and etc. But If you wanted to model a Seasonal Time-Series using Java, there' are only very limited options available. Thus as a solution, here I will be discussion a different approach, where the time series is modeled in java using regression. I will be using Gradient Boosted Tree (GBT) Regression in Spark ML package.


I will use two datasets:

  • Milk-Production data [1]
  • Fancy data [2].

Milk-Production Data

Lets first have a good look at our dataset.

As we can see, the time-series is not stationary, as the mean of the series increases over time. In more simpler terms, it has a clear upwards trend. Therefore, before we start modeling it, we need to make it stationary. 

For this, Im going to using the mechanism called "Differencing". Differencing is simply creating a new series with the difference of tth term and the (t-m)th term in the series. This can be denoted as follows:  
   diff(m):             X't=Xt - Xt-m

We need to use the lowest m which makes the series stationary. Hence I will start with m=1. So the 1st difference becomes:
diff(1):             X' t=Xt - Xt-1

This results a series with (n-1) data points. Lets plot it against the time, and see if it has become stationary.

As we can see, there isn't any trend exist in the series. (only the repeating pattern). Therefore we can conclude that it is stationary.  If the diff(1) still shows some trend, then use a higher order differencing, say diff(2). Further, after differencing, if the series has a stationary mean (no trend) but has a non-stationary variance (range of data changes with time. eg: dataset [2]), then we need to do a transformation to get rid of the non-stationary variance. In such a scenario, logarithmic (log10 , ln or similar) transformation would do the job.

For our case, we don't need a transformation. Hence we can train a GBT regression model for this differenced data. Before be fit a regression model, we are going to need a set of predictor/independent variable. For the moment we only have the response variable (milk-production). Therefore, I introduce four more features to the dataset, which are as follows:
  • index - Instance number (Starts from 0 for both training and testing datasets)
  • time - Instance number from the start of the series (Starts from 0 for training set. Starts from the last value of training set +1, for the test set )
  • month - Encoded value representing month of the year. (Ranges from 1 to 12). Note: This is because it is monthly data. you need to add "week" feature too if its weekly data.
  • year - Encoded value representing the year. (Starts from 1 and continues for training data. Starts from the last value of year of training set +1, for the test set). 

Once these new features were created, lets train a model:
SparkConf sparkConf = new SparkConf();
sparkConf.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(javaSparkContext);
// ======================== Import data ====================================
DataFrame trainDataFrame ="com.databricks.spark.csv")
                                            .option("inferSchema", "true")
                                            .option("header", "true")

// Get predictor variable names
String [] predictors = trainDataFrame.columns();
predictors = ArrayUtils.removeElement(predictors, RESPONSE_VARIABLE);

VectorAssembler vectorAssembler = new VectorAssembler();

GBTRegressor gbt = new GBTRegressor().setLabelCol(RESPONSE_VARIABLE)

Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {vectorAssembler, gbt});
PipelineModel pipelineModel =;

Here I have tuned the hyper-parameters to get the best fitting line. Next, with the trained model, lets use the testing set (only contains the set of newly introduced variables/features), and predict for the future.
DataFrame predictions = pipelineModel.transform(testDataFrame);"prediction").show(300);

Follow is the prediction result:

It shows a very good fit. But the prediction results we get here is for the differenced series. We need to convert it back to the original series to get the actual prediction. Hence, lets inverse-difference the results:

X't = Xt - X t-1
Xt = X' t + X t-1

Where X' is the differenced value and Xt-1 is the original value at the previous step. Here, we have an issue for the very first data point in testing data series, as we do not have t-1 (X0 is unknown/ doesn't exist). Therefore, I made an assumption, that in testing series, X is equivalent to the last value (Xn) of the original training set. With that assumption, follow is the result of our prediction:

Fancy dataset:

Followed the same procedure to the "Fancy" dataset [2] too. If we look at the data we can see that the series is not stationary in both mean and variance wise. Therefore, unlike in previous dataset, we need to do a transformation as well, other than differencing. here I used log10 transformation after differencing. And similarly, when converting the predictions back to actual values, first I took the inverse log of the predicted value, and then did the inverse-differencing, in order to get the actual prediction. Follow is the result:

As we can see, it very closely fits to the validation data, though we may further increase the accuracy by further tuning the hyper-parameters.