Predict future stock prices using Machine learning
As AI and machine learning are growing for every day, new areas where the techniques can be used are also increasing in a steady pace. Today we can find AI in many areas such as the health industry, finance, military etc and it will continue to grow. However, automated stock prediction tools that are analyzing trading behaviours, financial data and so on has been used for many years.
To look into the future of stock prices, are a very challenging task (also impossible 🙃), but with the use of machine learning techniques to get to know certain patterns of a stock, we could use it as an assistance to make a qualifying decision to buy a stock or not.
In this article, we will explore some techniques on how we can apply machine learning with Tensorflow and Python to see if we can get some good predictions on the future prices of a particular stock.
Prerequisites
In order to make this article not to long and boring, I will not share the installation and setup of everything, I will simply list them below and share some good links on how you can get started. I will also expect you to have knowledge in some Tensorflow/ML techniques and the Python language.
For this guide, we will use following:
- Python
- Tensorflow and keras
- PyCharm (Other editors works as well)
- Pandas
- scikit-learn
- Numpy
Some links that might be helpful:
Goal
The goal in this article is of course to create a model that can predict the future price based on some input data. But how can we achieve this? Well, I am not an economist, but I do have some knowledge in the stock market so I will try to do my best.
SMA50
For our input data, we will use something called SMA50. SMA is a commonly used indicator in stock trading that traders are using to see a trend of a stock. SMA stands for simple moving average and 50 stands for the days back. So basically, the formula is the combined prices of the last 50 days divided by 50. There are different values that traders are using, like SMA200 and so on as well, but we will go with 50 days.
Output
The output data we will use is the average price for the next 25 days ahead. In this way, we should be able to see if the trend is moving up or down. So we will pass in the SMA50, and get out the expected average price for the upcoming 25 days.
Setup our data
To start off our project, we will need to setup our data. We will create a file in our project called parse_stock_data.py. and a file called stock.json.
In the stock.json file, paste in the full response from this URL: Yahoo finance API. This is 5 years of history from the Apple stock which we will use for this project.
Importing and setup
In our parse_stock_data.py file, we will start by importing necessary libraries and setup our global variables that we will use.
import pandas as pd import csv import numpy as np data = pd.read_json("./stock.json") adj_closes = data['chart']['result'][0]['indicators']['adjclose'][0]['adjclose'] SMA_LENGTH = 50 DAYS_LENGTH = 25
SMA_LENGTH is the days back in history that we are using for our simple moving average calculation, and the DAYS_LENGTH is the length of the average price in the future that we will use. As I stated above, we will use SMA50 and get the average price for the upcoming 25 days.
Next up, we will create a function for creating our train_data.csv file with all the SMA50 values. In the same file, add following code:
def sma(start, end, days=SMA_LENGTH): closes = adj_closes[start:end] tot = 0 for close in closes: tot += close return tot / days def createSMAData(): start = 0 end = SMA_LENGTH smas = [] smas_end = SMA_LENGTH smas_start = 0 while end < len(adj_closes) - DAYS_LENGTH: smas.append(sma(start, end)) start += 1 end += 1 with open('train_data.csv', 'w', newline='') as file: writer = csv.writer(file) field = [] i = 0 while i < SMA_LENGTH: field.append("SMA" + str(i)) i += 1 writer.writerow(field) while smas_end < len(smas): sma_row = smas[smas_start:smas_end] writer.writerow(sma_row) smas_end += 1 smas_start += 1
This function will generate a file in the root of the project with all the data we need to train our models. Example from the file:
SMA0,SMA1,SMA2,SMA3,SMA4,SMA5,SMA6,SMA7,SMA8,SMA9,SMA10,SMA11,SMA12,SMA13,SMA14,SMA15,SMA16,SMA17,SMA18,SMA19,SMA20,SMA21,SMA22,SMA23,SMA24,SMA25,SMA26,SMA27,SMA28,SMA29,SMA30,SMA31,SMA32,SMA33,SMA34,SMA35,SMA36,SMA37,SMA38,SMA39,SMA40,SMA41,SMA42,SMA43,SMA44,SMA45,SMA46,SMA47,SMA48,SMA49 49.39500602722168,49.14914573669434,48.88692665100098,48.616263198852536,48.34057975769043,48.06411293029785,47.74800048828125,47.44402542114258,47.168003692626954,46.869580307006835,46.53680236816406,46.22504035949707,45.903683776855466,45.5953337097168,45.30482833862305,44.99176986694336,44.69056434631348,44.414981307983396,44.04805503845215,43.703880310058594,43.34815971374512,43.04257568359375,42.72673851013184,42.43001678466797,42.14548034667969,41.84486389160156,41.532374954223634,41.212782287597655,40.96819534301758,40.75642700195313,40.51730438232422,40.25156707763672,39.98347496032715,39.75901573181152,39.577036209106446,39.3965941619873,39.29303825378418,39.17304786682129,39.04325439453125,38.973047485351564,38.9595923614502,38.947386779785155,38.9408992767334,38.92418045043945,38.90455833435058,38.85978630065918,38.818267059326175,38.784401168823244,38.71871589660645,38.6945597076416 49.14914573669434,48.88692665100098,48.616263198852536,48.34057975769043,48.06411293029785,47.74800048828125,47.44402542114258,47.168003692626954,46.869580307006835,46.53680236816406,46.22504035949707,45.903683776855466,45.5953337097168,45.30482833862305,44.99176986694336,44.69056434631348,44.414981307983396,44.04805503845215,43.703880310058594,43.34815971374512,43.04257568359375,42.72673851013184,42.43001678466797,42.14548034667969,41.84486389160156,41.532374954223634,41.212782287597655,40.96819534301758,40.75642700195313,40.51730438232422,40.25156707763672,39.98347496032715,39.75901573181152,39.577036209106446,39.3965941619873,39.29303825378418,39.17304786682129,39.04325439453125,38.973047485351564,38.9595923614502,38.947386779785155,38.9408992767334,38.92418045043945,38.90455833435058,38.85978630065918,38.818267059326175,38.784401168823244,38.71871589660645,38.6945597076416,38.685178680419924 48.88692665100098,48.616263198852536,48.34057975769043,48.06411293029785,47.74800048828125,47.44402542114258,47.168003692626954,46.869580307006835,46.53680236816406,46.22504035949707,45.903683776855466,45.5953337097168,45.30482833862305,44.99176986694336,44.69056434631348,44.414981307983396,44.04805503845215,43.703880310058594,43.34815971374512,43.04257568359375,42.72673851013184,42.43001678466797,42.14548034667969,41.84486389160156,41.532374954223634,41.212782287597655,40.96819534301758,40.75642700195313,40.51730438232422,40.25156707763672,39.98347496032715,39.75901573181152,39.577036209106446,39.3965941619873,39.29303825378418,39.17304786682129,39.04325439453125,38.973047485351564,38.9595923614502,38.947386779785155,38.9408992767334,38.92418045043945,38.90455833435058,38.85978630065918,38.818267059326175,38.784401168823244,38.71871589660645,38.6945597076416,38.685178680419924,38.70105392456055 .........
The file should have 50 SMA50 values per row, and around 1130 rows in total.
Next up, we will create a function that will produce our output data. As stated, it will be the upcoming 25 days average price.
def createAverageData(): with open('average.csv', 'w', newline='') as file: writer = csv.writer(file) field = ["Average"] writer.writerow(field) closes_start = DAYS_LENGTH + (SMA_LENGTH * 2) while closes_start < len(adj_closes): closes_chunk = adj_closes[closes_start - DAYS_LENGTH + 1:closes_start + 1] average = np.sum(closes_chunk) / len(closes_chunk) writer.writerow([average]) closes_start += 1
This file should only have one column with the title Average and have equally many rows as our train_data.csv file.
At last, we will create a function that will take the last SMA50 values from the stock data and add to a seperate file. The values from this function can be used to get the predicted future price ahead of us.
And then we will call our three functions to produce our files.
def createLatestSMAs(): end = len(adj_closes) - SMA_LENGTH latest_smas = [] while end < len(adj_closes): latest_smas.append(sma(end - SMA_LENGTH, end)) end += 1 with open('latest-smas.csv', 'w', newline='') as latest_file: writer = csv.writer(latest_file) field = ["Latest"] writer.writerow(field) writer.writerow(latest_smas) createAverageData() createSMAData() createLatestSMAs()
All right, now we can run this file and we should have three new csv files in the project root folder.
- latest-smas.csv (Will be used to predict the future)
- train_data.csv (Will be used to train and test our model (X value))
- average.csv (Will be used to train the model output (Y value))
In the next section, we will setup our test and training data.
Training and test data
Create a file called stock_data.py. In this file, we will read our csv files and create test and train data, we will normalize the data to make our model perform at its best. If you are a ML beginner, normalization is a technique where we normalize the values in the data set so it is close to 0 to make the algorithms work more efficient. Since the prices in our data is very different, it would be very noisy if we weren't to normalize it.
In the stock_data file, we start by adding our normalization function
def normalize(df, min=None, max=None): if min is None: min = df.min() if max is None: max = df.max() df_norm = (df - min) / (max - min) return df_norm, min, max
Next, we will create two functions. One for getting the train/test data and one for getting our data that we will use to predict after we have trained our model.
def getTrainingData(): smas = pd.read_csv("./train_data.csv") averages = pd.read_csv("./average.csv") num_features, min, max = normalize(pd.DataFrame(smas)) x_train, x_test, y_train, y_test = train_test_split( num_features, averages, test_size=0.2, shuffle=True) return x_train, x_test, y_train, y_test, min, max def getTestData(min, max): data = pd.read_csv("./test.csv") data = normalize(pd.DataFrame(data), min, max) data = data[0].astype("float32") return data
One note to make, is that we are passing in min and max into our getTestData(). That is because we need to use the same min and max from the normalization of our training data, otherwise the values would not be related and the prediction would be off.
In the next section, we will create our model and train it.
Build our model
Now, we will create a new file called main.py. In this file, we will setup our neural network with Tensorflow and train it with out test data.
First, we import libraries and setup our global variables
from tensorflow import keras from tensorflow.keras import layers import stock_data x_train, x_test, y_train, y_test, min, max = stock_data.getTrainingData() x_train = x_train.astype("float32") x_test = x_test.astype("float32")
Next, we will create our function that will build our model.
def buildModel(): model = keras.Sequential([ layers.Dense(64, input_shape=(x_train.shape[1],), activation="relu"), layers.Dropout(.2), layers.Dense(32, activation="relu"), layers.Dense(1) ]) model.compile(loss="mse", optimizer=keras.optimizers.legacy.Adam( learning_rate=0.01), metrics=['mse']) return model
We are using an input layer with 64 neurons and relu activation function. We are then using a dropout of 20%, which will deactivate 20% of the neurons while training. Then we are adding a hidden layer with 32 neurons with relu activation function and our output layer will only have 1 neuron, since we are only looking for one value (the average price of the 25 days ahead).
We are then compiling the model where we are setting the loss function to mean square error and the optimizer is the adam algorithm.
Next, we will train the model as well.
model = buildModel() callback = keras.callbacks.EarlyStopping(monitor="mse", patience=5) model.fit(x_train, y_train, epochs=800, validation_split=0.2, verbose=True, callbacks=[callback])
We are using an early stopping callback function to prevent overfitting the model. We will set the patience to 5.
And at last, we will predict some values, and see how it performs. For our prediction, we will create a new csv file called test.csv where we will add some samples from our test_data.csv. It should look something like this:
SMA0,SMA1,SMA2,SMA3,SMA4,SMA5,SMA6,SMA7,SMA8,SMA9,SMA10,SMA11,SMA12,SMA13,SMA14,SMA15,SMA16,SMA17,SMA18,SMA19,SMA20,SMA21,SMA22,SMA23,SMA24,SMA25,SMA26,SMA27,SMA28,SMA29,SMA30,SMA31,SMA32,SMA33,SMA34,SMA35,SMA36,SMA37,SMA38,SMA39,SMA40,SMA41,SMA42,SMA43,SMA44,SMA45,SMA46,SMA47,SMA48,SMA49 168.76798614501953,169.16262573242187,169.62475708007813,170.0324478149414,170.46494873046876,170.83720153808594,171.19690643310548,171.60852600097655,172.03567321777345,172.44666687011718,172.8891458129883,173.41687530517578,173.95217498779297,174.35041107177733,174.8113522338867,175.29566284179688,175.76203155517578,176.29274536132812,176.81284729003906,177.34289001464845,177.87633697509764,178.41832916259764,178.91917694091796,179.3166925048828,179.7226318359375,180.14574523925782,180.5927978515625,181.1275018310547,181.6901220703125,182.12492980957032,182.521787109375,182.92938201904298,183.3174526977539,183.71911010742187,184.15711853027344,184.5795477294922,185.05410675048827,185.52866577148438,185.9393112182617,186.2872412109375,186.62618377685547,186.83450256347658,186.9745135498047,187.11052978515625,187.16565551757813,187.17903747558594,187.19461791992188,187.1868753051758,187.12175567626954,187.06639923095702 113.60285537719727,113.8467546081543,114.0992253112793,114.32479522705079,114.6397444152832,114.91950592041016,115.14902740478516,115.33829803466797,115.49349502563477,115.64363388061524,115.7581869506836,115.93440383911133,116.16229690551758,116.45442657470703,116.72635223388671,116.96029418945312,117.18584197998047,117.3372557067871,117.53539321899414,117.75163818359376,118.07019668579102,118.38613952636719,118.73917236328126,119.04187927246093,119.2926123046875,119.59683410644531,119.82822006225587,120.14442184448242,120.45241134643555,120.8060694885254,121.07524856567383,121.51246170043946,121.91247924804688,122.2448583984375,122.50221694946289,122.76207611083984,123.11983657836915,123.5675291442871,124.0978727722168,124.56367080688477,125.0133317565918,125.36420455932617,125.59365997314453,125.88392211914062,126.21767501831054,126.51876052856446,126.9133203125,127.36818328857422,127.80002685546874,128.19720932006837 169.59946350097655,169.9606719970703,170.28271270751952,170.4220913696289,170.50297210693358,170.60860931396485,170.82977935791016,170.97728729248047,170.93600769042968,170.78618286132811,170.58757476806642,170.20788635253905,169.95459014892577,169.7710433959961,169.49444274902345,169.31917144775392,169.23268859863282,169.16718597412108,168.976865234375,168.65734985351563,168.2882873535156,167.94827178955077,167.54102233886718,167.0568521118164,166.51456451416016,166.07337036132813,165.63347747802734,165.2597427368164,165.04726623535157,164.92036224365233,164.858291015625,164.8233801269531,164.80777282714843,164.79614227294923,164.8680905151367,164.98922790527342,165.15151947021485,165.32184936523439,165.52028747558595,165.8421844482422,166.11271392822266,166.3556396484375,166.60649353027344,166.82601135253907,166.73950775146486,166.6027603149414,166.52369995117186,166.31897857666016,166.16810760498046,166.06917846679687
Make sure you know which row you picked so you can compare with the average.csv values.
And finally, we add this to the main.py file before running it.
prediction = model.predict(stock_data.getTestData(min, max)) print(prediction)
Output
I will pick three random numbers that I will predict, and see how it performs. The output I expect is:
179.4 122.6 149.9
And from the prediction, we get:
[[176.61264 ] [122.721794] [159.76903 ]]
Not too bad 😃 Next challenge would be to predict the future price from our latest-smas.csv file and see if we should go all in or not 😉 No seriously, don't do that...
Summary
So there we have it, our stock prediction model. Apple is a good stock to begin developing a prediction model with since it is stable with a small volatility. Smaller companies tends to jump up and down much more unpredictable and will therefore be much harder to try to predict. But regardless of which stock you will try with, there are so many factors (mostly human) that will decide the price, and to only rely on AI to make a choices is not a good idea, not in my book at least. But to utilize the power of trends in the history and other factors could at least give us a hint or guidance in our next purchase.
I really hope you enjoyed this guide. If you have any feedback or if you have any great ideas on what data we should train our model with, please don't hesitate to contact us.
Have a great day, and thanks for reading.