Unlocking Stock Market Predictions with LSTM: A Deep Dive into Data-Driven Insights

12 min readSep 21, 2024

An infographic illustrating the intersection of data science, machine learning, and cryptocurrency trends, featuring charts and graphs that highlight market analysis and AI applications.

Every data point tells a story, and I’m here to share mine — how I transformed insights into impactful decisions and navigated the ever-evolving landscape of data science.

Source Code/File/Data : All the related data/code/File used for this article can be found on Github. Click the Link : Forex Market Prediction with RNN and LSTM Mode

In the fast-paced world of finance, data analysis and predictions are key elements for decision making. As more investors and analysts turn to machine learning to predict market movements, the use of tools like LSTMs (Long Short-Term Memory networks) in time series analysis has become popular. Today, This article will guide through an exciting journey of stock market analysis using a combination of technical indicators and an LSTM model to predict the stock price direction.

Let’s dive in.

Step 1: Data Collection and Setup

The first step in any analysis is collecting data. For this project, we’ll download historical stock data from the Russell 1000 Index using Yahoo Finance. This index tracks large-cap companies and provides valuable insights into market movements.

# Import necessary libraries
import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
import yfinance as yf  
import pandas_ta as ta  

# Download financial data
data = yf.download(tickers='^RUI', start='2012-01-01', end='2024-01-20')
# Display the first 10 rows of the data
data.head(10)

Library Imports: The code imports essential libraries for various functionalities:

NumPy: Handles numerical computations and NaN values.

Matplotlib: Provides tools for visualizing data.

Pandas: Facilitates data manipulation with DataFrames.

yfinance: Allows for easy access to historical financial data from Yahoo Finance.

pandas_ta: Adds capabilities for technical analysis on the financial data.

Data Retrieval: The code downloads historical stock data for the Russell 1000 Index over a specified date range (from March 11, 2012, to Sept 20, 2024).

Data Preview: The data.head(10) command displays the first ten rows of the downloaded dataset, which is useful for quickly inspecting the data structure and contents.

Here, we pull data for the Russell 1000 Index from 2012 to 2024, forming the foundation of our analysis. The yfinance library fetches stock data, while pandas and matplotlib handle data manipulation and visualization.

Step 2: Adding Technical Indicators

Technical indicators like the Relative Strength Index (RSI) and Exponential Moving Averages (EMAs) provide crucial insights into market conditions.

# Adding technical indicators to the dataset
data['RSI'] = ta.rsi(data.Close, length=15)  # Calculate the Relative Strength Index (RSI) over 15 periods
data['EMAF'] = ta.ema(data.Close, length=20)  # Calculate the Exponential Moving Average (EMA) for 20 periods
data['EMAM'] = ta.ema(data.Close, length=100)  # Calculate the EMA for 100 periods
data['EMAS'] = ta.ema(data.Close, length=150)  # Calculate the EMA for 150 periods

# Create a target variable based on the difference between adjusted close and open prices
data['Target'] = data['Adj Close'] - data.Open  # Calculate the difference
data['Target'] = data['Target'].shift(-1)  # Shift the target variable to align with the next day's data

# Classify the target variable into binary classes (1 or 0) based on whether the target is positive
data['TargetClass'] = [1 if data.Target[i] > 0 else 0 for i in range(len(data))]

# Create a new column for the next day's adjusted close price
data['TargetNextClose'] = data['Adj Close'].shift(-1)  # Shift adjusted close to align with the next day

# Clean the dataset
data.dropna(inplace=True)  # Remove rows with any NaN values
data.reset_index(inplace=True)  # Reset the index after dropping rows
data.drop(['Volume', 'Close', 'Date'], axis=1, inplace=True)  # Drop unnecessary columns

Adding Indicators: The code uses the pandas_ta library to add several technical analysis indicators to the dataset:

RSI: The Relative Strength Index is calculated over 15 periods to identify overbought or oversold conditions.
EMAs: Three Exponential Moving Averages are computed for 20, 100, and 150 periods to smooth out price data and identify trends.
Target Variable Creation:
A new column called Target is created to represent the difference between the adjusted close and the open prices. This is then shifted to align with the next day’s data.
A binary classification (TargetClass) is created, indicating whether the price will rise (1) or fall (0) the next day.
Next Day’s Close: The code creates a new column, TargetNextClose, which contains the adjusted close price for the next day.
Data Cleaning:
Rows containing NaN values are dropped to ensure a clean dataset.
The index is reset after dropping rows, and unnecessary columns (Volume, Close, and Date) are removed to simplify the DataFrame.

Step 3: Data Cleaning and Preparation

Clean data ensures accurate predictions. Let’s remove unnecessary columns and handle missing values.

# Select the first 11 columns from the DataFrame and create a new dataset
data_set = data.iloc[:, 0:11]  # Use iloc to slice the DataFrame for columns 0 to 10

# Set Pandas options to display all columns in the output
pd.set_option('display.max_columns', None)

# Display the first 20 rows of the new dataset
data_set.head(20)

# Uncomment below lines to print additional information about the dataset
#print(data_set.shape)  # Print the shape (rows, columns) of the new dataset
#print(data.shape)      # Print the shape of the original dataset
#print(type(data_set))  # Print the type of the new dataset

Creating a New Dataset: The code creates a new DataFrame, data_set, which contains the first 11 columns of the original data DataFrame. This allows for focused analysis on a specific subset of features.

Display Settings: The Pandas option display.max_columns is set to None, enabling the display of all columns in the output without truncation. This is useful for reviewing data without missing any columns.
Previewing the Data: The data_set.head(20) command displays the first 20 rows of the newly created dataset, allowing for an initial review of its contents.
Optional Information: The commented-out lines can be used to print additional information, such as the shape of the new dataset (data_set.shape), the shape of the original dataset (data.shape), and the type of the new dataset (type(data_set)).

Step 4: Feature Scaling

Scaling is crucial in preparing data for machine learning models, particularly for LSTMs, which are sensitive to variations in the data.

# Initialize the scaler to transform features to the range [0, 1]
sc = MinMaxScaler(feature_range=(0, 1))

# Fit the scaler to the dataset and transform it
data_set_scaled = sc.fit_transform(data_set)

# Print the scaled dataset
print(data_set_scaled)

Importing the Scaler: The MinMaxScaler from sklearn.preprocessing is imported. This tool is used to scale features to a specific range, which is useful for many machine learning algorithms that require normalized data.

Initializing the Scaler: The MinMaxScaler is initialized with a feature range of (0, 1). This means all values in the dataset will be transformed to fall between 0 and 1.
Scaling the Dataset: The fit_transform method is called on the data_set DataFrame. This method first computes the minimum and maximum values for each feature (column) and then scales the data accordingly. The result is stored in data_set_scaled.
Output: The scaled dataset is printed, showing the transformed values. Each original value is replaced by a corresponding value between 0 and 1, which helps in maintaining the relative distances among data points while normalizing the dataset.

Step 5: Preparing Data for LSTM

To capture time-based trends, we organize the data into sequences of 30 days, which the LSTM model can then learn from.

# Initialize an empty list to hold features
X = []

# Set the number of previous time steps to consider for each sample
backcandles = 30

# Print the total number of samples in the scaled dataset
print(data_set_scaled.shape[0])

# Loop through each feature (excluding the target columns)
for j in range(8):  # Assuming the first 8 columns are features
    X.append([])  # Create a sublist for each feature
    for i in range(backcandles, data_set_scaled.shape[0]):
        # Append slices of the dataset to X, considering 'backcandles' previous entries
        X[j].append(data_set_scaled[i-backcandles:i, j])

# Move the first axis to the third position for reshaping the array correctly
X = np.moveaxis(X, [0], [2])

# Prepare the target variable, adjusting the length to match X
# yi contains the last column (target) starting from the 'backcandles' index
X, yi = np.array(X), np.array(data_set_scaled[backcandles:, -1])
# Reshape the target variable to ensure it has the correct dimensions
y = np.reshape(yi, (len(yi), 1))

# Uncomment below line if you need to transform y using the scaler
# y = sc.fit_transform(yi)

# Reshape X to ensure compatibility with LSTM input shape (samples, time steps, features)
# Uncomment if necessary
# X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

# Print the resulting feature and target arrays
print(X)
print(X.shape)  # Print the shape of X
print(y)
print(y.shape)  # Print the shape of y

Feature Initialization: An empty list X is created to store the features. The variable backcandles is set to 30, which indicates the number of previous time steps to include for each input sample.

Looping Through Features: The code iterates over the first 8 columns of the scaled dataset (assuming these are the features). For each feature, it appends slices of data, capturing the previous 30 entries for each time step starting from the 30th index.
Reshaping the Feature Array: The first axis of the X list is moved to the third position using np.moveaxis(), reshaping the data to fit the expected input format for LSTM models.
Preparing the Target Variable: The target variable yi is extracted from the last column of the scaled dataset, starting from the index corresponding to backcandles. This ensures the length of y matches the output size of X.
Reshaping Target Variable: The target variable y is reshaped into a 2D array for compatibility with machine learning models.
Print Statements: The code prints the contents and shapes of both X and y to verify their structures and confirm that the transformation was successful.

Step 6: Splitting the Data

We’ll split the data into training and testing sets to evaluate model performance.

# Split the dataset into training and testing sets
splitlimit = int(len(X) * 0.8)  # Set split limit to 80% of the data length
print(splitlimit)  # Print the index for the split

# Create training and testing sets for features
X_train, X_test = X[:splitlimit], X[splitlimit:]
# Create training and testing sets for targets
y_train, y_test = y[:splitlimit], y[splitlimit:]

# Print the shapes of the training and testing sets
print(X_train.shape)  # Shape of the training features
print(X_test.shape)   # Shape of the testing features
print(y_train.shape)  # Shape of the training targets
print(y_test.shape)   # Shape of the testing targets

# Print the training target values
print(y_train)  # Display the training targets

Split Limit Calculation: The variable splitlimit is calculated as 80% of the total length of X. This will be used as the index to divide the dataset into training and testing subsets.

Creating Training and Testing Sets:
X_train and y_train are created by taking the first 80% of X and y, respectively.
X_test and y_test are created by taking the remaining 20%.
Shape Verification: The shapes of X_train, X_test, y_train, and y_test are printed to ensure that the split was performed correctly and that the dimensions are as expected.
Displaying Training Targets: The values of y_train are printed to review the target data that will be used for training.

Step 7: Building the LSTM Model

We’re now ready to construct and train our LSTM model.

from keras.models import Sequential  # Import Sequential model (not used here but commonly used)
from keras.layers import LSTM, Dropout, Dense, TimeDistributed  # Import necessary layers
import tensorflow as tf  # Import TensorFlow for model building
import keras  # Import Keras
from keras import optimizers  # Import optimizers
from keras.callbacks import History  # Import History for tracking training
from keras.models import Model  # Import functional API model
from keras.layers import Input, Activation, concatenate  # Import additional layers
import numpy as np  # Import NumPy for numerical operations

# Set random seeds for reproducibility
# tf.random.set_seed(20)  # Uncomment to set TensorFlow seed (if needed)
np.random.seed(10)  # Set NumPy random seed

# Define the LSTM model architecture
lstm_input = Input(shape=(backcandles, 8), name='lstm_input')  # Input layer with shape based on time steps and features
inputs = LSTM(150, name='first_layer')(lstm_input)  # LSTM layer with 150 units
inputs = Dense(1, name='dense_layer')(inputs)  # Dense layer for output
output = Activation('linear', name='output')(inputs)  # Output layer with linear activation

# Create the model using functional API
model = Model(inputs=lstm_input, outputs=output)
# Compile the model with Adam optimizer and mean squared error loss
adam = optimizers.Adam()  # Instantiate Adam optimizer
model.compile(optimizer=adam, loss='mse')  # Compile the model

# Fit the model to the training data
model.fit(x=X_train, y=y_train, batch_size=15, epochs=30, shuffle=True, validation_split=0.1)  # Train the model

Import Libraries: Necessary Keras and TensorFlow modules are imported for building the LSTM model.

Set Random Seed: Setting the random seed ensures reproducibility of results across different runs.

Model Architecture:
An Input Layer is defined with shape (backcandles, 8), indicating the number of previous time steps (backcandles) and the number of features (8).
An LSTM Layer with 150 units is added to process the sequential data.
A Dense Layer follows the LSTM to provide a single output.
The final Activation Layer uses a linear activation function for regression tasks.
Model Compilation: The model is compiled with the Adam optimizer and mean squared error (MSE) loss function, making it suitable for regression.
Model Training: The model is trained on the training data (X_train and y_train) for 30 epochs, using a batch size of 15 and a validation split of 10% to evaluate performance during training.

Step — 8 Use Trained Model to Predict test dataset

# Use the trained model to make predictions on the test dataset
y_pred = model.predict(X_test)  # Predict outputs for the test set

# Optional: Convert predicted values to binary classes based on a threshold (commented out)
# y_pred = np.where(y_pred > 0.43, 1, 0)  # Classify predictions into binary classes (1 or 0)

# Print the first 10 predicted values and their corresponding actual test values
for i in range(10):
    print(y_pred[i], y_test[i])  # Display the predicted and actual values side by side

Model Prediction: The model’s predict method is called on the test dataset (X_test), generating predictions (y_pred).

Optional Classification: There’s a commented-out line that could convert the predicted values into binary classes (0 or 1) based on a threshold of 0.43, which would be useful for classification tasks.
Display Predictions: A loop prints the first 10 predictions alongside the corresponding actual values from the test set (y_test), allowing for a quick comparison to evaluate model performance.

Step-9 Plot Predicted result into chart

# Set the size of the plot
plt.figure(figsize=(16, 8))

# Plot the actual test values
plt.plot(y_test, color='green', label='Test')  # Actual test values in green

# Plot the predicted values
plt.plot(y_pred, color='red', label='Pred')  # Predicted values in red

# Add a legend to differentiate between actual and predicted values
plt.legend()

# Display the plot
plt.show()

Figure Size: The plot’s size is set to 16 inches by 8 inches for better visibility.

Plot Actual Values: The actual values from the test dataset (y_test) are plotted in green.
Plot Predicted Values: The predicted values (y_pred) from the model are plotted in red.
Legend: A legend is added to distinguish between the actual and predicted values.
Show Plot: Finally, the plot is displayed, providing a visual comparison of the model’s predictions against the actual test values.

Project Summary

This project focuses on predicting stock market movements using Long Short-Term Memory (LSTM) networks. The process involves data collection, preprocessing, feature engineering, model training, and evaluation. The primary aim is to forecast future prices based on historical data and technical indicators.

Why LSTM?

Sequential Data Handling: LSTMs are specifically designed to work with sequential data, making them ideal for time series forecasting like stock prices.
Memory Capabilities: They can remember long-term dependencies due to their unique architecture, which helps capture trends and patterns in historical data.

Benefits of LSTM

Performance: Often yields better predictive performance for time series data compared to traitional methods.
Flexibility: Can handle varying input lengths and sequences, adapting to different datasets.
Complex Patterns: Capable of modeling complex nonlinear relationships in data.

Disadvantages of LSTM

Computational Complexity: LSTMs can be computationally intensive and require significant resources for training, especially on large datasets.
Overfitting: Risk of overfitting, particularly with small datasets or overly complex models.
Tuning Requirements: Requires careful hyperparameter tuning to achieve optimal performance.

How to Improve

Data Augmentation: Increase the size and diversity of the training dataset through techniques like data augmentation or synthetic data generation.
Feature Engineering: Experiment with additional features such as sentiment analysis, macroeconomic indicators, or other technical indicators.
Regularization: Implement dropout layers or L2 regularization to prevent overfitting.
Hyperparameter Optimization: Utilize techniques such as grid search or Bayesian optimization to find optimal hyperparameters.

Suggested ML Models to Use

GRU (Gated Recurrent Unit): Similar to LSTM but with a simpler architecture; often performs well with less computational overhead.
ARIMA (AutoRegressive Integrated Moving Average): A traditional statistical method that can be effective for univariate time series forecasting.
Random Forests: Good for regression tasks and can handle large datasets with less risk of overfitting.
XGBoost: An efficient implementation of gradient boosting that often performs well in predictive tasks.
Transformers: Emerging models for time series forecasting that leverage attention mechanisms, potentially outperforming LSTMs in some scenarios.

Conclusion

Using LSTM for time series prediction offers several advantages, particularly for capturing complex patterns in sequential data. However, it also comes with challenges that can be addressed through careful model design and optimization techniques. Exploring alternative models like GRU, ARIMA, or ensemble methods can provide additional insights and potentially improve predictive performance.

Source Code/File : All the related data and code used for this article can be found on Github. Interested Click the Link : Forex Market Prediction with RNN and LSTM Mode

Projects you may find interesting :

Forex Market Prediction with RNN and LSTM Model UAE Real Estate Market Research Project — 2024 Power BI Dashboard for IT Health Services Sales Analysis with Python Sales Dashboard with Excel Electric Vehicle Market Research Project-2024

Thank You For Your Time.

#PredictiveAnalytics #FinancialForecasting #DataAnalysis #AI #TimeSeries #BigData