We are often asked this question – How much data is required for forecasting? This is indeed an important question as the forecast output depends on the data input. In fact, since most of the times we use time series methods for forecasting, the the accuracy of the forecast largely depends on the length of the data input. Let us examine the importance of data length and its implications on forecasting.
Let’s say we have only 1 month of historical data – how do you think the forecast will look like with this data input? Let’s see in this graph:
Here when we have only 1 data point to rely upon we obviously will not be able to find any patterns in the data. There is no trend or seasonality to look for. The forecast in such a case is a straight line that is equal to history, pretty much like naive forecast.
What if we added 2 more months to history and made it 3 months of sales data – would it change anything?
A lot of companies work on the averages and rely on the last 3 months averages and hence this may sound like a good enough start to them. However, when we look at the above picture it is clear that just looking at last three months will give us no clue of the direction of future except for applying an average of these three months in a straight line but this will not predict whether the next month sale will be high or low as the seasonality is not captured. This type of forecast is mostly unreliable as it does not give a reliable estimate of the future and hence cannot be expected to optimize the inventory.
What about 1-year data?
When we look at one-year sales data, it starts to give relatively better information though not complete to generate a reliable forecast yet. We can determine the local level with this data set and some indications for the trend but certainly none for seasonality. Only way to add seasonal component to this data is with the business knowledge and tweaking the models manually. The system algorithms will not detect any seasonality in this data set automatically.
This domain knowledge based correction is a big challenge when we have hundreds or thousands of items to work with, which we usually do. To automate the forecasting process we need to use a system and for any system to detect seasonality there should be a minimum of 2 data points for every period, i.e., for Jan there should be two data points from two different years to establish seasonality.
So is minimum 2 years of data that we need?
When we have at least 2 years of data, we can determine the seasonality, trend, and level from this set. Once you have 2 year or more of data, usually forecasting software is capable of identifying these components automatically. The important thing to note here is although the patterns are established, in cases where there are some aberration due to external factors which resulted in abnormal sales in any year, then the system may get confused with the seasonality leading to poor forecasts. Moreover, other factors may play role from time to time in shaping the demand other than the season effect, like adhoc promotions, the impact of natural calamities/pandemics, stock-outs etc. Hence, If we have a longer history of data then the system may be able to separate these factors from the data properly and form better forecasts.
Let’s look at longer historical data of 6 years as an example.
When we have 6-7 years of data, it becomes easier to the forecast statistically and find patterns even with visual inspection. The components of the data like level, trend, and seasonality can be easily determined. Along with these components other cyclical trends, events, and random variations in the data can be captured by the system to generate the most reliable forecast separating out the noise.
In a nutshell, more the data better it is for statistical forecasts. The minimum requirement for system generated automatic reliable forecasts is at least 2 years, If you have lesser history then you will need to correct the system forecasts by not only choosing the relevant models but also optimizing the weights in those models based on the business and the product knowledge.
How much more than 2 years depends on the type of industry you are in or the type of product you are forecasting for. For a technology industry, going too much back in time does not help but for a stable consumer good longer may give better results. Knowledge of product life cycle, industry and the economy plays a key role before getting the data to the system for statistical forecasts.