August 1, 2018 · EN

Deep Learning Approach for Predicting Stock Market via Numerai

It's what every newbie desires to accomplish and it's the ultimate depression when someone actually becomes an insider: predicting the future values of their favorite stocks.

Using simple statistical approaches, it is quite difficult to estimate market values that change via manipulations as well as their numerical features. That's quite so if you use standard features (such as buys, sells, volume, rsi, macd, etc.). Raw predictions of stocks (especially for crypto currencies) using both numerical features and news-based scenarios will be another title that I will be mentioning through my posts.

In this article, I'll share an experimental deep learning approach to a data science competition that Numerai operates.

*I assume that developer prefers jupyter lab as a development environment

First we'll install numerai python client to connect to the server, download purified datasets and send predictions under registered username.

Dependency installation can be done through a jupyter cell by prepending an exclamation mark to pip

numerapi_installation

Using the API, connect to the server (there are 5 sub datasets and one is called bernie. Other sets can be seen in downloaded files):

import numerapi
public_id  = 'xxxxxxx'
secret_key = 'xxxxxxx'
napi       = numerapi.NumerAPI(public_id, secret_key)
SESSION    = 'bernie'

Download datasets (it will download and unzip. The folder name should be something like numerai_dataset_XXX):

datasetname = napi.download_current_dataset(unzip=True).replace('.zip', '')

Now it's time to load training and test sets into their variables. As of now, the active tournament is 118:

import pandas as pd

training        = pd.read_csv(datasetname+'/numerai_training_data.csv')
tournament      = pd.read_csv(datasetname+'/numerai_tournament_data.csv', 
                              header=0, index_col=None)
validation_data = tournament[tournament.data_type=='validation']

training set should look like the following

training_head

training_head2

let's print out what we have as dataset for our prediction task

print("""
    training shape: {}
    tournament shape: {}""".format(training.shape, tournament.shape))

dataset_details

Looking at the training set, we see there are both features and other alphanumerical columns. In order to process raw data, the features can be extracted via the following. We first fetch the feature values for X and then assign related Y labels according to their names.

features = [f for f in list(training) if 'feature' in f]
X        = training[features].values
X        = X.reshape(X.shape[0], 1, X.shape[1])

all_sessions = ['bernie', 'charles', 'elizabeth', 'jordan', 'ken']
for s in all_sessions:
    globals()['Y_{}'.format(s)] = training['target_{}'.format(s)].values
    
Y = globals()['Y_'+SESSION]

With this way we now have X and Y values. Y will refer to the current session (for now, it's bernie).

As you know, globals()['variable_name'] is practical when you create a new variable with a dynamic name.

We'll continue the same process for our validation set that's included in tournament file

validation_data = tournament[tournament.data_type=='validation']
valX            = validation_data[features].values
valX            = valX.reshape(valX.shape[0], 1, valX.shape[1])
for s in all_sessions:
    globals()['valY_{}'.format(s)] = validation_data['target_{}'.format(s)].values
    
valY = globals()['valY_'+SESSION]

We'll send the predictions for tournament dataset. We'll reshape it to have a uniform shape (of dim1, 1, dim2). We'll also save IDs of each row in test set:

testX = tournament[features].values
testX = testX.reshape(testX.shape[0], 1, testX.shape[1])
ids   = tournament['id']

Now we have the following. 50-feature training and validation dataset:

dataset_shapes

Let's create our Deep Learning model. We'll implement an autoencoder to try to both compress and generalize our dataset. What generalizing yields is to make our model more robust against unseen and noisy data, since we do not know what test data brings.

At first, import the libraries we'll be using:

import keras
from keras.models import Model
from keras.layers.core import Dense, Dropout, Flatten, Conv1D
from keras.layers import Input, AlphaDropout
from livelossplot import PlotLossesKeras
from keras.initializers import lecun_normal
from AdamW import AdamW
from keras.callbacks import ReduceLROnPlateau

Notice that we're importing AdamW. That's the new improved version of Adam.
AdamW can be downloaded from github, and here is the related article. You just copy the contents of the file by naming it as 'AdamW.py', then it can be used.

Let's write the autoencoder:

timewindow  = X.shape[1]
numfeatures = X.shape[2]

inputs   = Input(shape=(timewindow, numfeatures), name='input')
x        = inputs

x  = Dense(128, activation="relu", name="encoderlayer1", 
           kernel_initializer=lecun_normal())(x)
x  = Dense(64,  activation="relu", name="encoderlayer2", 
           kernel_initializer=lecun_normal())(x)

x  = Dense(32,  activation="relu", name="encoder", 
           kernel_initializer=lecun_normal())(x)

x  = Dense(64,  activation="relu", name="decoderlayer1", 
           kernel_initializer=lecun_normal())(x)
x  = Dense(128, activation="relu", name="decoderlayer2", 
           kernel_initializer=lecun_normal())(x)
x  = Dense(numfeatures, activation='sigmoid', name="decoder", 
           kernel_initializer=lecun_normal())(x)


aemodel = Model(inputs=inputs, outputs=x)
aemodel.summary()

from keras.utils import plot_model
plot_model(aemodel, to_file='aemodel.png')
from IPython.display import Image
Image(url="aemodel.png")

This will build an autoencoder with the following configurations:

autoencoder

Let's compile our model with AdamW's parameters:

batch_size = 16384
epochs     = 100
b, B, T    = batch_size, X.shape[0], epochs
wd         = 0.005 * (b/B/T)**0.5

aemodel.compile(loss='mean_squared_error',  optimizer=AdamW(weight_decay=wd), 
                metrics=['mae'])

We'll add some small noise to our dataset (X and valX) for a few times, then with the exact data...

ae_histories = []
for i in range(5):
    noise_factor = 0.00001
    noisyX = X + noise_factor * np.random.normal(loc=0.0, scale=1, size=X.shape)
    noisyvalX = valX + noise_factor * np.random.normal(loc=0.0, scale=1, size=valX.shape)

    ae_histories.append(
        aemodel.fit(noisyX, X, batch_size=batch_size, epochs=epochs, shuffle=True, 
                    callbacks=[PlotLossesKeras()], 
                    validation_data=(noisyvalX, valX)))

ae_histories.append(
    aemodel.fit(X, X, batch_size=batch_size, epochs=epochs, shuffle=False, 
                callbacks=[PlotLossesKeras()], 
                validation_data=(valX, valX)))

Now we have our fully working autoencoder that's able to compress given 50-feature data to 32 features.

Let's run it:

compressed_model = Model(inputs=aemodel.input,
                         outputs=aemodel.get_layer("encoder").output)
compressed_X     = compressed_model.predict(X)
compressed_valX  = compressed_model.predict(valX)

compressedX and compressed_valX are our compressed training and validation sets. We'll feed them into our classifier which will then predict given set of data. This classifier will first apply 128 convolutions with kernel size 3 and 5, then will reduce the size of the features to half by applying average pooling method. It's an experimental classifier that applies convolutions to stocks data. We will not compare if using convolutions yields better results comparing to not using it. What you should know is that the fully connected layers use selu as activation and lecun_normal as weight initializer. And last, it uses AlphaDropout as a noise-dropout layer which will keep variance and mean of its input layers. This combination will lead us to make our classifier to be a self normalizing network.

timewindow  = compressed_X.shape[1]
numfeatures = compressed_X.shape[2]

inputs   = Input(shape=(timewindow, numfeatures))
x        = inputs

x  = Conv1D(filters=128, kernel_size=3, padding='same', activation='relu', 
            name='convfeatures1')(x)
x  = AveragePooling1D(pool_size=2, padding='same')(x)
x  = Conv1D(filters=256, kernel_size=5, padding='same', activation='relu', 
            name='convfeatures2')(x)
x  = AveragePooling1D(pool_size=2, padding='same')(x)

x  = Dense(256, activation='selu', kernel_initializer=lecun_normal())(x)
x  = AlphaDropout(0.1)(x)
x  = Dense(128, activation='selu', kernel_initializer=lecun_normal())(x)
x  = AlphaDropout(0.1)(x)
x  = Dense(64, activation='selu', kernel_initializer=lecun_normal())(x)
x  = AlphaDropout(0.1)(x)

x  = Flatten()(x)
x  = Dense(1, activation='sigmoid')(x)

classifier = Model(inputs=inputs, outputs=x)
classifier.summary()

from keras.utils import plot_model
plot_model(classifier, to_file='classifier.png')
from IPython.display import Image
Image(url= "classifier.png")

Model summary should look like the following:
classifier

We're almost done. Let's train our classifier:

batch_size = 1024
epochs     = 100

b, B, T = batch_size, X.shape[0], epochs
wd = 0.005 * (b/B/T)**0.5

classifier.compile(loss='binary_crossentropy', 
                          optimizer=AdamW(weight_decay=wd),
                          metrics=['accuracy'])

REDUCE_LR = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-5)

history = classifier.fit(compressed_X, Y, batch_size=batch_size, epochs=epochs, 
                         shuffle=True, callbacks=[PlotLossesKeras(), REDUCE_LR],
                         validation_data=(compressed_valX, valY))

REDUCE_LR will use a keras callback function to check whether our classifier reduces the validation loss for each epoch. If it does not reduce it after 5 tries, then this callback will reduce the learning rate to half in order to keep consistency.

Your training phase should look like the following:
progress

It's time to write out our predictions with respect to the test data. The following lines will predict and write it out to a csv file. Since numeraire limits the outputs between 0.3 and 0.7, you should keep an eye on your predicted values (min and max)

pred        = classifier.predict(compressed_testX)
pred        = pred.reshape(pred.shape[0])
print ("min: {}, max: {}".format(pred.min(), pred.max()))
results_df  = pd.DataFrame(data={'probability_bernie': pred})
joined      = pd.DataFrame(ids).join(results_df)
joined.to_csv('predictions_{}.csv'.format(SESSION), index=False)

If you are satisfied with your model and results, feel free to submit your file:

submission_id = napi.upload_predictions("predictions_bernie.csv")

progress1

Then check your current competition status:

napi.submission_status()

result

We're done! Now you can stake NMR coin to attend the competitions.
Complete code can be found on github: https://github.com/ucekmez/numerai_challenge

Attention: You don't have to use autoencoder to compress this dataset since numerai tends to give you the required qualified data. In case you have your own dataset, you may need it. Please try to use only the classifier to see if resulting predictions vary. Be curious!

Please feel free to improve this post with your suggestions.