23 분 소요


TimeSeries Analysis 📈A Complete Guide

https://www.kaggle.com/code/andreshg/timeseries-analysis-a-complete-guide

목적

  • 시계열 데이터를 해석하는 법
  • 시계열 데이터를 다루는 법
  • 일반적인 시계열 모델
    • ACF/PACF
    • ARIMA
    • Auto-ARIMA
    • Prophet
    • Augmented Dickey-Fuller (ADF)
!pip install colorama
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
from colorama import Fore

from sklearn.metrics import mean_absolute_error, mean_squared_error
import math

import warnings
warnings.filterwarnings('ignore')

from datetime import datetime, date

np.random.seed(7)
df = pd.read_csv('/content/drive/MyDrive/kaggle/Acea_Smart_Water_Analytics/data/Aquifer_Petrignano.csv')
df
Date Rainfall_Bastia_Umbra Depth_to_Groundwater_P24 Depth_to_Groundwater_P25 Temperature_Bastia_Umbra Temperature_Petrignano Volume_C10_Petrignano Hydrometry_Fiume_Chiascio_Petrignano
0 14/03/2006 NaN -22.48 -22.18 NaN NaN NaN NaN
1 15/03/2006 NaN -22.38 -22.14 NaN NaN NaN NaN
2 16/03/2006 NaN -22.25 -22.04 NaN NaN NaN NaN
3 17/03/2006 NaN -22.38 -22.04 NaN NaN NaN NaN
4 18/03/2006 NaN -22.60 -22.04 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ...
5218 26/06/2020 0.0 -25.68 -25.07 25.7 24.5 -29930.688 2.5
5219 27/06/2020 0.0 -25.80 -25.11 26.2 25.0 -31332.960 2.4
5220 28/06/2020 0.0 -25.80 -25.19 26.9 25.7 -32120.928 2.4
5221 29/06/2020 0.0 -25.78 -25.18 26.9 26.0 -30602.880 2.4
5222 30/06/2020 0.0 -25.91 -25.25 27.3 26.5 -31878.144 2.4

5223 rows × 8 columns

# remove nan rows
df = df[df.Rainfall_Bastia_Umbra.notna()].reset_index(drop=True)
# remove not usefull columns 
df = df.drop(['Depth_to_Groundwater_P24', 'Temperature_Petrignano'], axis=1)
# simplify column names
df.columns = ['date', 'rainfall','depth_to_groundwater','temperature','drainage_volume','river_hydrometry']
targets = ['depth_to_groundwater']
features = [feature for feature in df.columns if feature not in targets]
df.head()
date rainfall depth_to_groundwater temperature drainage_volume river_hydrometry
0 01/01/2009 0.0 -31.14 5.2 -24530.688 2.4
1 02/01/2009 0.0 -31.11 2.3 -28785.888 2.5
2 03/01/2009 0.0 -31.07 4.4 -25766.208 2.4
3 04/01/2009 0.0 -31.05 0.8 -27919.296 2.4
4 05/01/2009 0.0 -31.01 -1.9 -29854.656 2.3
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')
df.head().style.set_properties(subset=['date'], **{'background-color':'dodgerblue'})
  date rainfall depth_to_groundwater temperature drainage_volume river_hydrometry
0 2009-01-01 00:00:00 0.000000 -31.140000 5.200000 -24530.688000 2.400000
1 2009-01-02 00:00:00 0.000000 -31.110000 2.300000 -28785.888000 2.500000
2 2009-01-03 00:00:00 0.000000 -31.070000 4.400000 -25766.208000 2.400000
3 2009-01-04 00:00:00 0.000000 -31.050000 0.800000 -27919.296000 2.400000
4 2009-01-05 00:00:00 0.000000 -31.010000 -1.900000 -29854.656000 2.300000

1. Data Visualization

(features)

  • Rainfall(강우량) indicates the quantity of rain falling (mm)
  • Temperature(기온) indicates the temperature (℃)
  • Drainage Volume(배수량) indicates the volume of water taken from the drinking water treatment plant (㎥)
  • Hydrometry(강 유량계) indicates the groundwater level(m)

(target)

  • Depth to Groundwater(지하수 깊이) indicates the groundwater level (m from the ground floor)
# To complete the date, as naive method, we well use ffill

f, ax = plt.subplots(nrows=5, ncols=1, figsize=(15,25))

for i, column in enumerate(df.drop('date', axis=1).columns):
    sns.lineplot(x=df['date'], y=df[column].fillna(method='ffill'), 
                 ax=ax[i], color='dodgerblue')
    ax[i].set_title('Feature : {}'.format(column), fontsize=14)
    ax[i].set_ylabel(ylabel=column, fontsize=14)
    ax[i].set_xlim([date(2009, 1,1), date(2020, 6, 30)])

output_11_0

2. Data Preprocessing

  • 시간 간격(time interval) 확인
df = df.sort_values(by='date')

# check time intervals
df['delta'] = df['date'] - df['date'].shift(1)
df[['date', 'delta']].head()
date delta
0 2009-01-01 NaT
1 2009-01-02 1 days
2 2009-01-03 1 days
3 2009-01-04 1 days
4 2009-01-05 1 days
df['delta'].sum(), df['delta'].count()
(Timedelta('4198 days 00:00:00'), 4198)
df = df.drop('delta', axis=1)

2.1 Handle Missings

df.isna().sum()
date                     0
rainfall                 0
depth_to_groundwater    27
temperature              0
drainage_volume          1
river_hydrometry         0
dtype: int64
# 일단 이상데이터(0)을 np.nan으로 바꾼 후 fillna를 통해 데이터를 채워줌
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(15,15))

old_hydrometry = df['river_hydrometry'].copy()
df['river_hydrometry'] = df['river_hydrometry'].replace(0, np.nan)
sns.lineplot(x=df['date'], y=old_hydrometry, ax=ax[0], 
             color='darkorange', label='original')
sns.lineplot(x=df['date'], y=df['river_hydrometry'].fillna(np.inf), ax=ax[0],
                  color='dodgerblue', label='modified')
ax[0].set_title('Feature : Hydrometry', fontsize=14)
ax[0].set_ylabel(ylabel='Hydrometry', fontsize=14)
ax[0].set_xlim([date(2009, 1,1), date(2020, 6, 30)])

old_drainage = df['drainage_volume'].copy()
df['drainage_volume'] = df['drainage_volume'].replace(0, np.nan)
sns.lineplot(x=df['date'], y=old_drainage, ax=ax[1], color='darkorange', label='original')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[1], color='dodgerblue', label='modified')
ax[1].set_title('Feature: Drainage', fontsize=14)
ax[1].set_ylabel(ylabel='Drainage', fontsize=14)
ax[1].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
(14245.0, 18443.0)

output_19_1

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(16,5))

sns.heatmap(df.T.isna(), cmap='Blues')
ax.set_title("Missing Values", fontsize=16)

for tick in ax.yaxis.get_major_ticks():
    tick.label.set_fontsize(14)
plt.show()

output_20_0

Missing Values, how to handle

  • Option 1: Fill NaN with Outlier or Zero

In this specific example filling the missing value with an outlier value such as np.inf or 0 seems to be very naive. However, using values like -999, is sometimes a good idea.

  • Option 2: Fill NaN with Mean Value

Filling NaNs with the mean value is also not sufficient and naive, and doesn’t seems to be a good option.

  • Option 3: Fill NaN with Last Value with .ffill()

Filling NaNs with the last value could be bit better.

  • Option 4: Fill NaN with Linearly Interpolated Value with .interpolate()

Filling NaNs with the interpolated values is the best option in this small examlple but it requires knowledge of the neighouring value

f, ax = plt.subplots(nrows=4, ncols=1, figsize=(15, 12))

sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(0), ax=ax[0], color='darkorange', label = 'modified')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[0], color='dodgerblue', label = 'original')
ax[0].set_title('Fill NaN with 0', fontsize=14)
ax[0].set_ylabel(ylabel='Volume C10 Petrignano', fontsize=14)

mean_drainage = df['drainage_volume'].mean()
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(mean_drainage), ax=ax[1], color='darkorange', label = 'modified')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[1], color='dodgerblue', label = 'original')
ax[1].set_title(f'Fill NaN with Mean Value ({mean_drainage:.0f})', fontsize=14)
ax[1].set_ylabel(ylabel='Volume C10 Petrignano', fontsize=14)

sns.lineplot(x=df['date'], y=df['drainage_volume'].ffill(), ax=ax[2], color='darkorange', label = 'modified')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[2], color='dodgerblue', label = 'original')
ax[2].set_title(f'FFill', fontsize=14)
ax[2].set_ylabel(ylabel='Volume C10 Petrignano', fontsize=14)

sns.lineplot(x=df['date'], y=df['drainage_volume'].interpolate(), ax=ax[3], color='darkorange', label = 'modified')
sns.lineplot(x=df['date'], y=df['drainage_volume'].fillna(np.inf), ax=ax[3], color='dodgerblue', label = 'original')
ax[3].set_title(f'Interpolate', fontsize=14)
ax[3].set_ylabel(ylabel='Volume C10 Petrignano', fontsize=14)

for i in range(4):
    ax[i].set_xlim([date(2019, 5, 1), date(2019, 10, 1)])
    
plt.tight_layout()
plt.show()

output_22_0

# interpolate method looks best options

df['drainage_volume'] = df['drainage_volume'].interpolate()
df['river_hydrometry'] = df['river_hydrometry'].interpolate()
df['depth_to_groundwater'] = df['depth_to_groundwater'].interpolate()

2.2 Smoothing data / Resampling

fig, ax = plt.subplots(ncols=2, nrows=3, sharex=True, figsize=(16,12))

sns.lineplot(x=df['date'], y=df['drainage_volume'], color='dodgerblue', ax=ax[0, 0])
ax[0, 0].set_title('Drainage Volume', fontsize=14)

resampled_df = df[['date','drainage_volume']].resample('7D', on='date').sum().reset_index(drop=False)
sns.lineplot(x=resampled_df['date'], y=resampled_df['drainage_volume'], color='dodgerblue', ax=ax[1, 0])
ax[1, 0].set_title('Weekly Drainage Volume', fontsize=14)

resampled_df = df[['date','drainage_volume']].resample('M', on='date').sum().reset_index(drop=False)
sns.lineplot(x=resampled_df['date'], y=resampled_df['drainage_volume'], color='dodgerblue', ax=ax[2, 0])
ax[2, 0].set_title('Monthly Drainage Volume', fontsize=14)

for i in range(3):
    ax[i, 0].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])

sns.lineplot(x=df['date'], y=df['temperature'], color='dodgerblue', ax=ax[0, 1])
ax[0, 1].set_title('Daily Temperature (Acc.)', fontsize=14)

resampled_df = df[['date','temperature']].resample('7D', on='date').mean().reset_index(drop=False)
sns.lineplot(x=resampled_df['date'], y=resampled_df['temperature'], color='dodgerblue', ax=ax[1, 1])
ax[1, 1].set_title('Weekly Temperature (Acc.)', fontsize=14)

resampled_df = df[['date','temperature']].resample('M', on='date').mean().reset_index(drop=False)
sns.lineplot(x=resampled_df['date'], y=resampled_df['temperature'], color='dodgerblue', ax=ax[2, 1])
ax[2, 1].set_title('Monthly Temperature (Acc.)', fontsize=14)

for i in range(3):
    ax[i, 1].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
plt.show()

output_25_0

# As we can see, downsample to weekly could smooth the data and hgelp with analysis
downsample = df[['date',
                 'depth_to_groundwater', 
                 'temperature',
                 'drainage_volume', 
                 'river_hydrometry',
                 'rainfall'
                ]].resample('7D', on='date').mean().reset_index(drop=False)

df = downsample.copy()

2.3. Stationarity

Some time-series models, such as such as ARIMA, assume that the underlying data is stationary. Stationarity describes that the time-series has

  • constant mean and mean is not time-dependent
  • constant variance and variance is not time-dependent
  • constant covariance and covariance is not time-dependent

stationarity

The check for stationarity can be done via three different approaches:

  • visually : plot time series and check for trends or seasonality
  • basic statistics : split time series and compare the mean and variance of each partition
  • statistical test : Augmented Dickey Fuller test
# A year has 52 weeks (52 weeks * 7 days per week) aporx.
rolling_window = 52
f, ax = plt.subplots(nrows=2, ncols=1, figsize=(15, 12))

sns.lineplot(x=df['date'], y=df['drainage_volume'], ax=ax[0], color='dodgerblue')
sns.lineplot(x=df['date'], y=df['drainage_volume'].rolling(rolling_window).mean(), ax=ax[0], color='black', label='rolling mean')
sns.lineplot(x=df['date'], y=df['drainage_volume'].rolling(rolling_window).std(), ax=ax[0], color='orange', label='rolling std')
ax[0].set_title('Depth to Groundwater: Non-stationary \nnon-constant mean & non-constant variance', fontsize=14)
ax[0].set_ylabel(ylabel='Drainage Volume', fontsize=14)
ax[0].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])

sns.lineplot(x=df['date'], y=df['temperature'], ax=ax[1], color='dodgerblue')
sns.lineplot(x=df['date'], y=df['temperature'].rolling(rolling_window).mean(), ax=ax[1], color='black', label='rolling mean')
sns.lineplot(x=df['date'], y=df['temperature'].rolling(rolling_window).std(), ax=ax[1], color='orange', label='rolling std')
ax[1].set_title('Temperature: Non-stationary \nvariance is time-dependent (seasonality)', fontsize=14)
ax[1].set_ylabel(ylabel='Temperature', fontsize=14)
ax[1].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])

plt.tight_layout()
plt.show()

output_30_0

In this visual check, we can see that the features don’t have constant mean and std, but they are close to it.

Unit Root Test

Unit root is a characteristic of a time series that makes it non-stationary. And ADF test belong to the unit root test. Technically , a unit root is said to exist in a time series of value of alpha =1 in below equation.

$Y_{t} = \alpha Y_{t-1} + \beta X_{e} + \epsilon$

where Yt is value of the time series at time ‘t’ and Xe is an exogenous variable .

The presence of a unit root means the time series is non-stationary.

2.3.1 Augmented Dicky-Fuller (ADF)

Augmented Dickey-Fuller (ADF) test is a type of statistical test called a unit root test. Unit roots are a cause for non-stationarity.

  • Null Hypothesis (H0): Time series has a unit root. (Time series is not stationary).

  • Alternate Hypothesis (H1): Time series has no unit root (Time series is stationary).

If the null hypothesis can be rejected, we can conclude that the time series is stationary.

There are two ways to rejects the null hypothesis:

On the one hand, the null hypothesis can be rejected if the p-value is below a set significance level. The defaults significance level is 5%

  • p-value > significance level (default: 0.05): Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.

  • p-value <= significance level (default: 0.05): Reject the null hypothesis (H0), the data does not have a unit root and is stationary.

On the other hand, the null hypothesis can be rejects if the test statistic is less than the critical value.

  • ADF statistic > critical value: Fail to reject the null hypothesis (H0), the data has a unit root and is non-stationary.
  • ADF statistic < critical value: Reject the null hypothesis (H0), the data does not have a unit root and is stationary.
# https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.adfuller.html

from statsmodels.tsa.stattools import adfuller

result = adfuller(df['depth_to_groundwater'].values)
result
(-2.8802016493166605,
 0.047699190920208856,
 7,
 592,
 {'1%': -3.441444394224128,
  '5%': -2.8664345376276454,
  '10%': -2.569376663737217},
 -734.3154255877616)
# Thanks to https://www.kaggle.com/iamleonie for this function!
f, ax = plt.subplots(nrows=3, ncols=2, figsize=(15, 9))

def visualize_adfuller_results(series, title, ax):
    result = adfuller(series)
    significance_level = 0.05
    adf_stat = result[0]
    p_val = result[1]
    crit_val_1 = result[4]['1%']
    crit_val_5 = result[4]['5%']
    crit_val_10 = result[4]['10%']

    if (p_val < significance_level) & ((adf_stat < crit_val_1)):
        linecolor = 'forestgreen' 
    elif (p_val < significance_level) & (adf_stat < crit_val_5):
        linecolor = 'orange'
    elif (p_val < significance_level) & (adf_stat < crit_val_10):
        linecolor = 'red'
    else:
        linecolor = 'purple'
    sns.lineplot(x=df['date'], y=series, ax=ax, color=linecolor)
    ax.set_title(f'ADF Statistic {adf_stat:0.3f}, p-value: {p_val:0.3f}\nCritical Values 1%: {crit_val_1:0.3f}, 5%: {crit_val_5:0.3f}, 10%: {crit_val_10:0.3f}', fontsize=14)
    ax.set_ylabel(ylabel=title, fontsize=14)

visualize_adfuller_results(df['rainfall'].values, 'Rainfall', ax[0, 0])
visualize_adfuller_results(df['temperature'].values, 'Temperature', ax[1, 0])
visualize_adfuller_results(df['river_hydrometry'].values, 'River_Hydrometry', ax[0, 1])
visualize_adfuller_results(df['drainage_volume'].values, 'Drainage_Volume', ax[1, 1])
visualize_adfuller_results(df['depth_to_groundwater'].values, 'Depth_to_Groundwater', ax[2, 0])

f.delaxes(ax[2, 1])
plt.tight_layout()
plt.show()

output_36_0

If the data is not stationary but we want to use a model such as ARIMA (that requires this characteristic), the data has to be transformed.

The two most common methods to transform series into stationarity ones are:

  • Transformation(변환) : e.g. log or square root to stabilize non-constant variance
  • Differencing(차분) : subtracts the current value from the previous

2.3.2 Transforming(변환)

# Log Transform of absolute values
# (Log transoform of negative values will return NaN)
df['depth_to_groundwater_log'] = np.log(abs(df['depth_to_groundwater']))

f, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))
visualize_adfuller_results(df['depth_to_groundwater_log'], 'Transformed \n Depth to Groundwater', ax[0])

sns.distplot(df['depth_to_groundwater_log'], ax=ax[1])
<Axes: xlabel='depth_to_groundwater_log', ylabel='Density'>

output_39_1

2.3.3. Differencing (차분)

# First Order Differencing
ts_diff = np.diff(df['depth_to_groundwater'])
df['depth_to_groundwater_diff_1'] = np.append([0], ts_diff)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 6))
visualize_adfuller_results(df['depth_to_groundwater_diff_1'], 'Differenced (1. Order) \n Depth to Groundwater', ax)

output_41_0

3. Feature Engineering

df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df['day'] = pd.DatetimeIndex(df['date']).day
df['day_of_year'] = pd.DatetimeIndex(df['date']).dayofyear
df['week_of_year'] = pd.DatetimeIndex(df['date']).weekofyear
df['quarter'] = pd.DatetimeIndex(df['date']).quarter
df['season'] = df['month'] % 12 // 3 + 1

df[['date', 'year', 'month', 'day', 'day_of_year', 'week_of_year', 'quarter', 'season']].head()
date year month day day_of_year week_of_year quarter season
0 2009-01-01 2009 1 1 1 1 1 1
1 2009-01-08 2009 1 8 8 2 1 1
2 2009-01-15 2009 1 15 15 3 1 1
3 2009-01-22 2009 1 22 22 4 1 1
4 2009-01-29 2009 1 29 29 5 1 1

3.1 Encoding Cyclical Features

The new time features are cyclical. For example,the feature month cycles between 1 and 12 for every year. While the difference between each month increments by 1 during the year, between two years the month feature jumps from 12 (December) to 1 (January). This results in a -11 difference, which can confuse a lot of models.

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(20, 3))

sns.lineplot(x=df['date'], y=df['month'], color='dodgerblue')
ax.set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
plt.show()

output_46_0

month_in_year = 12
df['month_sin'] = np.sin(2*np.pi*df['month']/month_in_year)
df['month_cos'] = np.cos(2*np.pi*df['month']/month_in_year)

f, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 6))

sns.scatterplot(x=df.month_sin, y=df.month_cos, color='dodgerblue')
plt.show()

output_47_0

3.2 TimeSeries Decomposition

Time series decomposition involves thinking of a series as a combination of level, trend, seasonality, and noise components.

These components are defined as follows:

  • Level: The average value in the series.
  • Trend: The increasing or decreasing value in the series.
  • Seasonality: The repeating short-term cycle in the series.
  • Noise: The random variation in the series.

Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis and forecasting.

All series have a level and noise. The trend and seasonality components are optional.

It is helpful to think of the components as combining either additively or multiplicatively:

  • Additive : $y(t) = Level + Trend + Seasonality + Noise$
  • Multiplicative : $y(t) = Level * Trend * Seasonality * Noise$

In this case we are going to use function seasonal_decompose() from the statsmodels library.

from statsmodels.tsa.seasonal import seasonal_decompose

core_columns =  [
    'rainfall', 'temperature', 'drainage_volume', 
    'river_hydrometry', 'depth_to_groundwater'
]

for column in core_columns:
    decomp = seasonal_decompose(df[column], period=52, model='additive', extrapolate_trend='freq')
    df[f"{column}_trend"] = decomp.trend
    df[f"{column}_seasonal"] = decomp.seasonal
df.columns
Index(['date', 'depth_to_groundwater', 'temperature', 'drainage_volume',
       'river_hydrometry', 'rainfall', 'depth_to_groundwater_log',
       'depth_to_groundwater_diff_1', 'year', 'month', 'day', 'day_of_year',
       'week_of_year', 'quarter', 'season', 'month_sin', 'month_cos',
       'rainfall_trend', 'rainfall_seasonal', 'temperature_trend',
       'temperature_seasonal', 'drainage_volume_trend',
       'drainage_volume_seasonal', 'river_hydrometry_trend',
       'river_hydrometry_seasonal', 'depth_to_groundwater_trend',
       'depth_to_groundwater_seasonal'],
      dtype='object')
fig, ax = plt.subplots(ncols=2, nrows=4, sharex=True, figsize=(16,8))

for i, column in enumerate(['temperature', 'depth_to_groundwater']):
    
    res = seasonal_decompose(df[column], period=52, model='additive', extrapolate_trend='freq')

    ax[0,i].set_title('Decomposition of {}'.format(column), fontsize=16)
    res.observed.plot(ax=ax[0,i], legend=False, color='dodgerblue')
    ax[0,i].set_ylabel('Observed', fontsize=14)

    res.trend.plot(ax=ax[1,i], legend=False, color='dodgerblue')
    ax[1,i].set_ylabel('Trend', fontsize=14)

    res.seasonal.plot(ax=ax[2,i], legend=False, color='dodgerblue')
    ax[2,i].set_ylabel('Seasonal', fontsize=14)
    
    res.resid.plot(ax=ax[3,i], legend=False, color='dodgerblue')
    ax[3,i].set_ylabel('Residual', fontsize=14)

plt.show()

output_52_0

3.3 Lag

We want to calculate each variable with a shift() (lag) to compare the correlationwith the other variables.

weeks_in_month = 4

for column in core_columns:
    df[f'{column}_seasonal_shift_b_2m'] = df[f'{column}_seasonal'].shift(-2 * weeks_in_month)
    df[f'{column}_seasonal_shift_b_1m'] = df[f'{column}_seasonal'].shift(-1 * weeks_in_month)
    df[f'{column}_seasonal_shift_1m'] = df[f'{column}_seasonal'].shift(1 * weeks_in_month)
    df[f'{column}_seasonal_shift_2m'] = df[f'{column}_seasonal'].shift(2 * weeks_in_month)
    df[f'{column}_seasonal_shift_3m'] = df[f'{column}_seasonal'].shift(3 * weeks_in_month)
df.columns
Index(['date', 'depth_to_groundwater', 'temperature', 'drainage_volume',
       'river_hydrometry', 'rainfall', 'depth_to_groundwater_log',
       'depth_to_groundwater_diff_1', 'year', 'month', 'day', 'day_of_year',
       'week_of_year', 'quarter', 'season', 'month_sin', 'month_cos',
       'rainfall_trend', 'rainfall_seasonal', 'temperature_trend',
       'temperature_seasonal', 'drainage_volume_trend',
       'drainage_volume_seasonal', 'river_hydrometry_trend',
       'river_hydrometry_seasonal', 'depth_to_groundwater_trend',
       'depth_to_groundwater_seasonal', 'rainfall_seasonal_shift_b_2m',
       'rainfall_seasonal_shift_b_1m', 'rainfall_seasonal_shift_1m',
       'rainfall_seasonal_shift_2m', 'rainfall_seasonal_shift_3m',
       'temperature_seasonal_shift_b_2m', 'temperature_seasonal_shift_b_1m',
       'temperature_seasonal_shift_1m', 'temperature_seasonal_shift_2m',
       'temperature_seasonal_shift_3m', 'drainage_volume_seasonal_shift_b_2m',
       'drainage_volume_seasonal_shift_b_1m',
       'drainage_volume_seasonal_shift_1m',
       'drainage_volume_seasonal_shift_2m',
       'drainage_volume_seasonal_shift_3m',
       'river_hydrometry_seasonal_shift_b_2m',
       'river_hydrometry_seasonal_shift_b_1m',
       'river_hydrometry_seasonal_shift_1m',
       'river_hydrometry_seasonal_shift_2m',
       'river_hydrometry_seasonal_shift_3m',
       'depth_to_groundwater_seasonal_shift_b_2m',
       'depth_to_groundwater_seasonal_shift_b_1m',
       'depth_to_groundwater_seasonal_shift_1m',
       'depth_to_groundwater_seasonal_shift_2m',
       'depth_to_groundwater_seasonal_shift_3m'],
      dtype='object')

4. Exploratory Data Analysis

f, ax = plt.subplots(nrows=5, ncols=1, figsize=(15, 12))
f.suptitle('Seasonal Components of Features', fontsize=16)

for i, column in enumerate(core_columns):
    sns.lineplot(x=df['date'], y=df[column + '_seasonal'], ax=ax[i], color='dodgerblue', label='P25')
    ax[i].set_ylabel(ylabel=column, fontsize=14)
    ax[i].set_xlim([date(2017, 9, 30), date(2020, 6, 30)])
    
plt.tight_layout()
plt.show()

output_58_0

f, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 8))

corrmat = df[core_columns].corr()

sns.heatmap(corrmat, annot=True, vmin=-1, vmax=1, cmap='coolwarm', ax=ax[0])
ax[0].set_title('Correlation Matrix of Core Features', fontsize=16)

shifted_cols = [
    'depth_to_groundwater_seasonal',         
    'temperature_seasonal_shift_b_2m',
    'drainage_volume_seasonal_shift_2m', 
    'river_hydrometry_seasonal_shift_3m'
]
corrmat = df[shifted_cols].corr()

sns.heatmap(corrmat, annot=True, vmin=-1, vmax=1, cmap='coolwarm', ax=ax[1])
ax[1].set_title('Correlation Matrix of Lagged Features', fontsize=16)


plt.tight_layout()
plt.show()

output_59_0

4.1 Autocorrelation Analysis

ACF and PACF plots: After a time series has been stationarized by differencing, the next step in fitting an ARIMA model is to determine whether AR or MA terms are needed to correct any autocorrelation that remains in the differenced series. Of course, with software like Statgraphics, you could just try some different combinations of terms and see what works best. But there is a more systematic way to do this. By looking at the autocorrelation function (ACF) and partial autocorrelation (PACF) plots of the differenced series, you can tentatively identify the numbers of AR and/or MA terms that are needed.

  • Autocorrelation Function (ACF): q = Periods to lag for eg: (if P= q then we will use the three previous periods of our time series in the autoregressive portion of the calculation) q helps adjust the line that is being fitted to forecast the series. q corresponds with MA parameter

  • Partial Autocorrelation Function (PACF): p = In an ARIMA model we transform a time series into stationary one(series without trend or seasonality) using differencing. p refers to the number of differencing transformations required by the time series to get stationary. p corresponds with AR parameter.

Autocorrelation plots help in detecting seasonality.

from pandas.plotting import autocorrelation_plot

autocorrelation_plot(df['depth_to_groundwater_diff_1'])
plt.show()

output_62_0

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(16, 8))

plot_acf(df['depth_to_groundwater_diff_1'], lags=100, ax=ax[0])
plot_pacf(df['depth_to_groundwater_diff_1'], lags=100, ax=ax[1])

plt.show()

output_63_0

5. Modeling

  1. Modeling 🧩

Time series can be either univariate or multivariate:

  • Univariate time series only has a single time-dependent variable.
  • Multivariate time series have a multiple time-dependent variable.

But, first of all we are going to see how does cross-validation technic works in TimeSeries Analysis.

from sklearn.model_selection import TimeSeriesSplit

N_SPLITS = 3

X = df['date']
y = df['depth_to_groundwater']

folds = TimeSeriesSplit(n_splits=N_SPLITS)
f, ax = plt.subplots(nrows=N_SPLITS, ncols=2, figsize=(16, 9))

for i, (train_index, valid_index) in enumerate(folds.split(X)):
    X_train, X_valid = X[train_index], X[valid_index]
    y_train, y_valid = y[train_index], y[valid_index]

    sns.lineplot(
        x=X_train, 
        y=y_train, 
        ax=ax[i,0], 
        color='dodgerblue', 
        label='train'
    )
    sns.lineplot(
        x=X_train[len(X_train) - len(X_valid):(len(X_train) - len(X_valid) + len(X_valid))], 
        y=y_train[len(X_train) - len(X_valid):(len(X_train) - len(X_valid) + len(X_valid))], 
        ax=ax[i,1], 
        color='dodgerblue', 
        label='train'
    )

    for j in range(2):
        sns.lineplot(x= X_valid, y= y_valid, ax=ax[i, j], color='darkorange', label='validation')
    ax[i, 0].set_title(f"Rolling Window with Adjusting Training Size (Split {i+1})", fontsize=16)
    ax[i, 1].set_title(f"Rolling Window with Constant Training Size (Split {i+1})", fontsize=16)

for i in range(N_SPLITS):
    ax[i, 0].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
    ax[i, 1].set_xlim([date(2009, 1, 1), date(2020, 6, 30)])
    
plt.tight_layout()
plt.show()

output_67_0

The idea with this plot is to understand which train and test set are we using to fit the model in each iteration.

5.1 Models for Univariate Time Series

5.1 Models for Univariate Time Series

First of all, we are going to analize univariate TimeSeries forecasting.

  • Univariate time series: Only one variable is varying over time. For example, data collected from a sensor measuring the temperature of a room every second. Therefore, each second, you will only have a one-dimensional value, which is the temperature.
train_size = int(0.85 * len(df))
test_size = len(df) - train_size

univariate_df = df[['date', 'depth_to_groundwater']].copy()
univariate_df.columns = ['ds', 'y']

train = univariate_df.iloc[:train_size, :]

x_train, y_train = pd.DataFrame(univariate_df.iloc[:train_size, 0]), pd.DataFrame(univariate_df.iloc[:train_size, 1])
x_valid, y_valid = pd.DataFrame(univariate_df.iloc[train_size:, 0]), pd.DataFrame(univariate_df.iloc[train_size:, 1])

print(len(train), len(x_valid))
510 90

5.1.1 Prophet

The first model (which also can handle multivariate problems) we are going to try is Facebook Prophet.

Prophet, or “Facebook Prophet,” is an open-source library for univariate (one variable) time series forecasting developed by Facebook.

Prophet implements what they refer to as an additive time series forecasting model, and the implementation supports trends, seasonality, and holidays.

!pip install prophet --quiet
from sklearn.metrics import mean_absolute_error, mean_squared_error
import math

from prophet import Prophet


# Train the model
model = Prophet()
model.fit(train)

# x_valid = model.make_future_dataframe(periods=test_size, freq='w')

# Predict on valid set
y_pred = model.predict(x_valid)

# Calcuate metrics
score_mae = mean_absolute_error(y_valid, y_pred.tail(test_size)['yhat'])
score_rmse = math.sqrt(mean_squared_error(y_valid, y_pred.tail(test_size)['yhat']))

print(Fore.GREEN + 'RMSE: {}'.format(score_rmse))
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpiz72lvw3/dp_3lfk5.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpiz72lvw3/11aw674z.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.9/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=12253', 'data', 'file=/tmp/tmpiz72lvw3/dp_3lfk5.json', 'init=/tmp/tmpiz72lvw3/11aw674z.json', 'output', 'file=/tmp/tmpiz72lvw3/prophet_modelhri1r5m1/prophet_model-20230423124655.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
12:46:55 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
12:46:56 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing


RMSE: 1.1737898100396633
# Plot the forecast
f, ax = plt.subplots(1)
f.set_figheight(6)
f.set_figwidth(15)

model.plot(y_pred, ax=ax)
sns.lineplot(x=x_valid['ds'], y=y_valid['y'], ax=ax, color='orange', label='Ground truth') #navajowhite

ax.set_title(f'Prediction \n MAE: {score_mae:.2f}, RMSE: {score_rmse:.2f}', fontsize=14)
ax.set_xlabel(xlabel='Date', fontsize=14)
ax.set_ylabel(ylabel='Depth to Groundwater', fontsize=14)

plt.show()

output_76_0

5.1.2 ARIMA

The second model that i want to try is ARIMA.

The Auto-Regressive Integrated Moving Average (ARIMA) model describes the autocorrelations in the data. The model assumes that the time-series is stationary. It consists of three main parts:

  • Auto-Regressive (AR) filter (long term):

$y_t = c + \alpha_{1}y_{t-1} + … + \alpha_{p}y_{t-p} + \epsilon_t = c+\sum_{i=1}^{p}\alpha_{i}y_{t-1}+\epsilon_t$ -> p

  • Integration filter (stochastic trend)

-> d

  • Moving Average (MA) filter (short term):

$y_t = c + \epsilon_t+\beta_1\epsilon_{t-1} + … + \beta_{q}\epsilon_{t-q} = c + \epsilon_{t}+\sum_{i=1}^{q}\beta_i\epsilon_{t-i}$ -> q

ARIMA : $y_t = c+\alpha_{1}y_{t-1} + … + \alpha_{p}y_{t-p}+\epsilon_t+\beta_{1}\epsilon_{t-1}+…+\beta_{q}\epsilon_{t-q}$

ARIMA( p, d, q)

  • p: Lag order (reference PACF in Autocorrelation Analysis)
  • d: Degree of differencing. (reference Differencing in Stationarity)
  • q: Order of moving average (check out ACF in Autocorrelation Analysis)

Steps to analyze ARIMA

  • Step 1 — Check stationarity: If a time series has a trend or seasonality component, it must be made stationary before we can use ARIMA to forecast. .
  • Step 2 — Difference: If the time series is not stationary, it needs to be stationarized through differencing. Take the first difference, then check for stationarity. Take as many differences as it takes. Make sure you check seasonal differencing as well.
  • Step 3 — Filter out a validation sample: This will be used to validate how accurate our model is. Use train test validation split to achieve this
  • Step 4 — Select AR and MA terms: Use the ACF and PACF to decide whether to include an AR term(s), MA term(s), or both.
  • Step 5 — Build the model: Build the model and set the number of periods to forecast to N (depends on your needs).
  • Step 6 — Validate model: Compare the predicted values to the actuals in the validation sample.
from statsmodels.tsa.arima.model import ARIMA

# Fit model
model = ARIMA(y_train, order=(1,1,1))
model_fit = model.fit()

# Prediction with ARIMA
y_pred = model_fit.forecast(90, alpha=0.05)

# Calcuate metrics
score_mae = mean_absolute_error(y_valid, y_pred)
score_rmse = math.sqrt(mean_squared_error(y_valid, y_pred))

print(Fore.GREEN + 'RMSE: {}'.format(score_rmse))
RMSE: 1.402615843711107
y_pred.plot()
<Axes: >

output_80_1

model_fit.plot_diagnostics(figsize=(16,8))
plt.show()

output_81_0

fig, ax = plt.subplots()

ax.plot(y_pred, label='Forecast')
ax.plot(y_valid, label='Ground truth')
ax.legend(loc='upper left')
ax.set_title("ARIMA Result")
Text(0.5, 1.0, 'ARIMA Result')

output_82_1

# f, ax = plt.subplots(1)
# f.set_figheight(6)
# f.set_figwidth(15)

# model_fit.plot_predict(start=1, end=599, ax=ax)
# sns.lineplot(x=x_valid.index, y=y_valid['y'], ax=ax, color='orange', label='Ground truth')

# ax.set_title(f'Prediction \n MAE: {score_mae:.2f}, RMSE: {score_rmse:.2f}', fontsize=14)
# ax.set_xlabel(xlabel='Date', fontsize=14)
# ax.set_ylabel(ylabel='Depth to Groundwater', fontsize=14)

# ax.set_ylim(-35, -18)
# plt.show()

5.1.3 Auto-ARIMA

!pip install pmdarima --quiet
from statsmodels.tsa.arima_model import ARIMA
import pmdarima as pm

model = pm.auto_arima(y_train, start_p=1, start_q=1,
                      test='adf',       # use adftest to find optimal 'd'
                      max_p=3, max_q=3, # maximum p and q
                      m=1,              # frequency of series
                      d=None,           # let model determine 'd'
                      seasonal=False,   # No Seasonality
                      start_P=0, 
                      D=0, 
                      trace=True,
                      error_action='ignore',  
                      suppress_warnings=True, 
                      stepwise=True)

print(model.summary())
Performing stepwise search to minimize aic
 ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=-631.136, Time=1.14 sec
 ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=-242.692, Time=0.17 sec
 ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=-574.047, Time=0.56 sec
 ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=-427.347, Time=0.37 sec
 ARIMA(0,1,0)(0,0,0)[0]             : AIC=-243.054, Time=0.14 sec
 ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=-629.209, Time=2.58 sec
 ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=-629.237, Time=2.10 sec
 ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=-492.779, Time=0.42 sec
 ARIMA(2,1,0)(0,0,0)[0] intercept   : AIC=-611.065, Time=0.25 sec
 ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=-628.351, Time=2.63 sec
 ARIMA(1,1,1)(0,0,0)[0]             : AIC=-632.995, Time=0.54 sec
 ARIMA(0,1,1)(0,0,0)[0]             : AIC=-428.258, Time=0.29 sec
 ARIMA(1,1,0)(0,0,0)[0]             : AIC=-575.735, Time=0.15 sec
 ARIMA(2,1,1)(0,0,0)[0]             : AIC=-631.069, Time=1.78 sec
 ARIMA(1,1,2)(0,0,0)[0]             : AIC=-631.097, Time=0.61 sec
 ARIMA(0,1,2)(0,0,0)[0]             : AIC=-494.001, Time=0.22 sec
 ARIMA(2,1,0)(0,0,0)[0]             : AIC=-612.866, Time=0.17 sec
 ARIMA(2,1,2)(0,0,0)[0]             : AIC=-630.210, Time=0.92 sec

Best model:  ARIMA(1,1,1)(0,0,0)[0]          
Total fit time: 15.143 seconds
                               SARIMAX Results                                
==============================================================================
Dep. Variable:                      y   No. Observations:                  510
Model:               SARIMAX(1, 1, 1)   Log Likelihood                 319.497
Date:                Sun, 23 Apr 2023   AIC                           -632.995
Time:                        12:48:50   BIC                           -620.297
Sample:                             0   HQIC                          -628.016
                                - 510                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.9196      0.021     43.766      0.000       0.878       0.961
ma.L1         -0.4885      0.037    -13.357      0.000      -0.560      -0.417
sigma2         0.0167      0.001     24.810      0.000       0.015       0.018
===================================================================================
Ljung-Box (L1) (Q):                   0.02   Jarque-Bera (JB):               185.01
Prob(Q):                              0.90   Prob(JB):                         0.00
Heteroskedasticity (H):               1.17   Skew:                             0.22
Prob(H) (two-sided):                  0.32   Kurtosis:                         5.92
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
model.plot_diagnostics(figsize=(16,8))
plt.show()

output_87_0

As we saw in the previous Steps, AutoARIMA (auto_arima) validates that (1,1,1) is the best configuration for (p,d,q).

So how to interpret the plot diagnostics?

  • Top left: The residual errors seem to fluctuate around a mean of zero and have a uniform variance between (-4, 4).

  • Top Right: The density plot suggest normal distribution with mean zero.

  • Bottom left: The most part of the blue dots are over the red line, so it seems that the distribution in very low skewed (not skewed for me).

  • Bottom Right: The Correlogram, aka, ACF plot shows the residual errors are not autocorrelated.

5.1.4 LSTM

We are going to use a multi-layered LSTM recurrent neural network to predict the last value of a sequence of values.

The following data pre-processing and feature engineering need to be done before construct the LSTM model.

  • Create the dataset, ensure all data is float.
  • Normalize the features.
  • Split into training and test sets.
  • Convert an array of values into a dataset matrix.
  • Reshape into X=t and Y=t+1.
  • Reshape input to be 3D (num_samples, num_timesteps, num_features).
from sklearn.preprocessing import MinMaxScaler

data = univariate_df.filter(['y'])
#Convert the dataframe to a numpy array
dataset = data.values

scaler = MinMaxScaler(feature_range=(-1, 0))
scaled_data = scaler.fit_transform(dataset)

scaled_data[:10]
array([[-0.81796644],
       [-0.79970385],
       [-0.7745311 ],
       [-0.74679171],
       [-0.73099704],
       [-0.71253702],
       [-0.7023692 ],
       [-0.68410661],
       [-0.66890424],
       [-0.65528134]])
# Defines the rolling window
look_back = 52
# Split into train and test sets
train, test = scaled_data[:train_size-look_back,:], scaled_data[train_size-look_back:,:]

def create_dataset(dataset, look_back=1):
    X, Y = [], []
    for i in range(look_back, len(dataset)):
        a = dataset[i-look_back:i, 0]
        X.append(a)
        Y.append(dataset[i, 0])
    return np.array(X), np.array(Y)

x_train, y_train = create_dataset(train, look_back)
x_test, y_test = create_dataset(test, look_back)

# reshape input to be [samples, time steps, features]
x_train = np.reshape(x_train, (x_train.shape[0], 1, x_train.shape[1]))
x_test = np.reshape(x_test, (x_test.shape[0], 1, x_test.shape[1]))

print(len(x_train), len(x_test))
406 90
from keras.models import Sequential
from keras.layers import Dense, LSTM

#Build the LSTM model
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape=(x_train.shape[1], x_train.shape[2])))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(25))
model.add(Dense(1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

#Train the model
model.fit(x_train, y_train, batch_size=1, epochs=10, validation_data=(x_test, y_test))

model.summary()
Epoch 1/10
406/406 [==============================] - 7s 9ms/step - loss: 0.0085 - val_loss: 0.0024
Epoch 2/10
406/406 [==============================] - 3s 7ms/step - loss: 0.0020 - val_loss: 0.0029
Epoch 3/10
406/406 [==============================] - 2s 5ms/step - loss: 0.0027 - val_loss: 0.0032
Epoch 4/10
406/406 [==============================] - 2s 6ms/step - loss: 0.0011 - val_loss: 3.3838e-04
Epoch 5/10
406/406 [==============================] - 2s 5ms/step - loss: 0.0014 - val_loss: 3.9864e-04
Epoch 6/10
406/406 [==============================] - 2s 5ms/step - loss: 0.0016 - val_loss: 5.4464e-04
Epoch 7/10
406/406 [==============================] - 3s 7ms/step - loss: 0.0020 - val_loss: 3.9537e-04
Epoch 8/10
406/406 [==============================] - 3s 8ms/step - loss: 0.0010 - val_loss: 6.5017e-04
Epoch 9/10
406/406 [==============================] - 2s 5ms/step - loss: 8.9141e-04 - val_loss: 0.0036
Epoch 10/10
406/406 [==============================] - 2s 6ms/step - loss: 9.3876e-04 - val_loss: 2.3816e-04
Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm_10 (LSTM)              (None, 1, 128)            92672     
                                                                 
 lstm_11 (LSTM)              (None, 64)                49408     
                                                                 
 dense_10 (Dense)            (None, 25)                1625      
                                                                 
 dense_11 (Dense)            (None, 1)                 26        
                                                                 
=================================================================
Total params: 143,731
Trainable params: 143,731
Non-trainable params: 0
_________________________________________________________________
# Lets predict with the model
train_predict = model.predict(x_train)
test_predict = model.predict(x_test)

# invert predictions
train_predict = scaler.inverse_transform(train_predict)
y_train = scaler.inverse_transform([y_train])

test_predict = scaler.inverse_transform(test_predict)
y_test = scaler.inverse_transform([y_test])

# Get the root mean squared error (RMSE) and MAE
score_rmse = np.sqrt(mean_squared_error(y_test[0], test_predict[:,0]))
score_mae = mean_absolute_error(y_test[0], test_predict[:,0])
print(Fore.GREEN + 'RMSE: {}'.format(score_rmse))
13/13 [==============================] - 1s 3ms/step
3/3 [==============================] - 0s 6ms/step
RMSE: 0.2233293828264469
fig, ax = plt.subplots()

ax.plot(test_predict[:,0], label='Forecast')
ax.plot(y_test[0], label='Ground truth')
ax.legend(loc='upper left')
ax.set_title("LSTM Result")
Text(0.5, 1.0, 'LSTM Result')

output_95_1

5.2 Models for Multivariate Time Series

Finnally, we are going to analize multivariate TimeSeries forecasting.

Multivariate time series: Multiple variables are varying over time. For example, a tri-axial accelerometer. There are three accelerations, one for each axis (x,y,z) and they vary simultaneously over time.

feature_columns = [
    'rainfall',
    'temperature',
    'drainage_volume',
    'river_hydrometry',
]
target_column = ['depth_to_groundwater']

train_size = int(0.85 * len(df))

multivariate_df = df[['date'] + target_column + feature_columns].copy()
multivariate_df.columns = ['ds', 'y'] + feature_columns

train = multivariate_df.iloc[:train_size, :]
x_train, y_train = pd.DataFrame(multivariate_df.iloc[:train_size, [0,2,3,4,5]]), pd.DataFrame(multivariate_df.iloc[:train_size, 1])
x_valid, y_valid = pd.DataFrame(multivariate_df.iloc[train_size:, [0,2,3,4,5]]), pd.DataFrame(multivariate_df.iloc[train_size:, 1])

train.head()
ds y rainfall temperature drainage_volume river_hydrometry
0 2009-01-01 -31.048571 0.000000 1.657143 -28164.918857 2.371429
1 2009-01-08 -30.784286 0.285714 4.571429 -29755.789714 2.314286
2 2009-01-15 -30.420000 0.028571 7.528571 -25463.190857 2.300000
3 2009-01-22 -30.018571 0.585714 6.214286 -23854.422857 2.500000
4 2009-01-29 -29.790000 1.414286 5.771429 -25210.532571 2.500000

5.2.1 Multivariate Prophet

from prophet import Prophet


# Train the model
model = Prophet()
model.add_regressor('rainfall')
model.add_regressor('temperature')
model.add_regressor('drainage_volume')
model.add_regressor('river_hydrometry')

# Fit the model with train set
model.fit(train)

# Predict on valid set
y_pred = model.predict(x_valid)

# Calcuate metrics
score_mae = mean_absolute_error(y_valid, y_pred['yhat'])
score_rmse = math.sqrt(mean_squared_error(y_valid, y_pred['yhat']))

print(Fore.GREEN + 'RMSE: {}'.format(score_rmse))
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpiz72lvw3/kv3cqdzg.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpiz72lvw3/u34z6wgv.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.9/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=422', 'data', 'file=/tmp/tmpiz72lvw3/kv3cqdzg.json', 'init=/tmp/tmpiz72lvw3/u34z6wgv.json', 'output', 'file=/tmp/tmpiz72lvw3/prophet_modelrmldq8w6/prophet_model-20230423125057.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
12:50:57 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
12:50:58 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing


RMSE: 1.0144259311109194
# Plot the forecast
f, ax = plt.subplots(1)
f.set_figheight(6)
f.set_figwidth(15)

model.plot(y_pred, ax=ax)
sns.lineplot(x=x_valid['ds'], y=y_valid['y'], ax=ax, color='orange', label='Ground truth') #navajowhite

ax.set_title(f'Prediction \n MAE: {score_mae:.2f}, RMSE: {score_rmse:.2f}', fontsize=14)
ax.set_xlabel(xlabel='Date', fontsize=14)
ax.set_ylabel(ylabel='Depth to Groundwater', fontsize=14)

plt.show()

output_101_0

6. Conclusion

The best results are taken from Univariate LSTM (with rolling window of 1 year) and multi-variate Prophet.

7. References

Here I am going to reference some useful links that I have used to build this notebook

Special reference for the helpful information and plots - https://www.kaggle.com/iamleonie/intro-to-time-series-forecasting

ARIMA - https://towardsdatascience.com/time-series-forecasting-arima-models-7f221e9eee06

Auto-ARIMA - https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/

Keras LSTM - https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

Prophet - https://towardsdatascience.com/time-series-prediction-using-prophet-in-python-35d65f626236

Special reference - https://www.kaggle.com/iamleonie/intro-to-time-series-forecasting/notebook#Models

Cyclical features - https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca

ADF - https://medium.com/@cmukesh8688/why-is-augmented-dickey-fuller-test-adf-test-so-important-in-time-series-analysis-6fc97c6be2f0

ACF/PACF - https://towardsdatascience.com/significance-of-acf-and-pacf-plots-in-time-series-analysis-2fa11a5d10a8

LSTM - https://towardsdatascience.com/time-series-analysis-visualization-forecasting-with-lstm-77a905180eba

댓글남기기