Prediction out of sample. How to implement?

A

Antonio06082021-11-25 23:20:22

Python

Antonio0608, 2021-11-25 23:20:22

Good day.
I started studying neural networks here. I read a lot of books and looked at articles on the Internet.
I took the standard NS as a basis and began to study. Data preparation. hyperparameter tuning, etc. Slowly, everything began to understand. But then I realized that there is very little information on how to predict outside the sample.
Here is the NS. Predicting the level of pollution in Beijing.
As if many saw this NS and learned from it.
But how to make a forecast for the next day or week is not clear.
Can anyone understand?

from math import sqrt
from numpy import concatenate
from matplotlib import pyplot
from pandas import read_csv
from pandas import DataFrame
from pandas import concat
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
 
# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
  n_vars = 1 if type(data) is list else data.shape[1]
  df = DataFrame(data)
  cols, names = list(), list()
  # input sequence (t-n, ... t-1)
  for i in range(n_in, 0, -1):
    cols.append(df.shift(i))
    names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
  # forecast sequence (t, t+1, ... t+n)
  for i in range(0, n_out):
    cols.append(df.shift(-i))
    if i == 0:
      names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
    else:
      names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
  # put it all together
  agg = concat(cols, axis=1)
  agg.columns = names
  # drop rows with NaN values
  if dropnan:
    agg.dropna(inplace=True)
  return agg
 
# load dataset
dataset = read_csv('pollution.csv', header=0, index_col=0)
values = dataset.values
# integer encode direction
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns, axis=1, inplace=True)
print(reframed.head())
 
# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24
train = values[:n_train_hours, :]
test = values[n_train_hours:, :]
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
 
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))
model.compile(loss='mae', optimizer='adam')
# fit network
history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False)
# plot history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
 
# make a prediction
yhat = model.predict(test_X)
test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
# invert scaling for forecast
inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
inv_yhat = scaler.inverse_transform(inv_yhat)
inv_yhat = inv_yhat[:,0]
# invert scaling for actual
test_y = test_y.reshape((len(test_y), 1))
inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
inv_y = scaler.inverse_transform(inv_y)
inv_y = inv_y[:,0]
# calculate RMSE
rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
print('Test RMSE: %.3f' % rmse)

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

D

dmshar, 2021-11-26
@Antonio0608

It's strange, you've read a lot of books and articles, and you're asking some kind of .... strange question. And what can be predicted within the sample? Prediction is ALWAYS getting values outside of your original sample.
Well, the phrase that there is little information on the Internet is generally incomprehensible. There is not just a lot of information on this topic, but an extremely large amount.
Well, I just typed it into Google - and immediately:
https://habr.com/ru/post/495884
https://habr.com/ru/post/505338/
Maybe the problem is that you started studying immediately with neural networks without understanding the basics of machine learning?
In essence, the answer to the question is that in order to build a forecast "out of sample", you must first build a model (train the neural network on the training dataset), then validate the model (i.e. check the performance of the model on the test dataset), and only then set independent inputs variables (exactly - a point in time, and possibly - the values of other independent variables at this point in time) and the model should give you a predictable result.
But in order to better understand this, I recommend that you still get acquainted with the basic concepts of ML, including the theory of time series. Then there will be no such school questions in the future.