Predicting Stock Data with Cassandra and TensorFlow

Obioma Anomnachi@anant.us on September 12, 2023

In this blog post, we will delve into the exciting world of AstraDB and TensorFlow, two powerful tools that can revolutionize the way you handle and analyze large-scale datasets. We will explore the fundamentals of AstraDB, an Apache Cassandra-compatible database, and its integration with TensorFlow, a popular open-source machine learning framework. The main goal of this blog is to showcase how the tools work and integrate together via a simple demonstration of a real-life scenario and help you use this GitHub Repo to get your hands into Tensorflow right now!

Introduction

The scenario that this blog and the associated repo demo will cover is a time-series forecasting of a specific stock price. The problem itself is very common and widely known. That being said, this is a technology demo and is in no way intended as market advice. The purpose of the demo is to show how Astra and TensorFlow can work together to do a time-series forecasting. We are going to shed light on how we can use AstraDB to store some of the needed data and model information and serve it back to the code.

Cassandra 5 is releasing soon, though you can ! AstraDB is is a managed Cassandra service that you can use now to create a cluster now!

Set-Up

For this tutorial we will need to have an AstraDB instance up and running as well as a keyspace and a table, to achieve this you can follow the following steps:

Setting up an Astra DB database

Go to https://astra.datastax.com/ sign up or sign in to the dashboard.

Click on the create database button.

Fill in the details of your DB. For this tutorial we will be using tensorflow_demo as a database name and tf_keyspace for the keyspace name.

Create a Secure Connect Bundle and Tokens

After creating the database and keyname, make sure you generate a bundle and download it, this can be achieved by clicking on the Get Bundle button (shown below), and then picking the region and clicking on download.

Now click on the “create a custom token” link and make sure you adjust the role for the tokens to allow creation and read/write access, feel free to set the role for the token and credentials to Database Administrator for a quick set up.

AstraDB tokens play a crucial role in ensuring the security of the database. Tokens are used to authenticate and authorize access to data stored in AstraDB. Each token represents a specific level of access permissions, such as Read-only or Read-Write, allowing administrators to control and restrict data access based on user roles and privileges. This ensures that only authorized users can interact with the database, safeguarding the integrity and confidentiality of the data stored in AstraDB. In our tutorial we’ve used an administrator token with read write permissions on the database to be able to store and retrieve data, as well as create and drop tables and keyspaces.

Getting Test Data

Now that you have the database, credentials and bundle file on your machine, let’s make sure you get the data needed to run this tutorial. This data is open source and available on many platforms. One of which is the Kaggle platform where people have already done the work of gathering this data via APIs.

Here is the link to that data https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset.

For this tutorial we’ve used the Dell stock market price data, but note that you can incorporate different stocks as long as the data respects the format to not have to change much.

Another source that can fit it into this problem is also the Tesla dataset that’s also open on Github https://github.com/plotly/datasets/blob/master/tesla-stock-price.csv.

The simplest option which we recommend as well is to clone our Github repository https://github.com/Anant/tensorflow-astra-demo (which we will cover anyway) and where we also added sample data for you to jump right into the code.

Now let’s jump right into the code!

Code for the tutorial

The code for this tutorial can be found under our repo link on Github (https://github.com/Anant/tensorflow-astra-demo). If you are already familiar with git and JupyterLabs just go ahead and clone and rerun the notebook steps using

git clone https://github.com/Anant/tensorflow-astra-demo

The readme on the repo should also has all the details needed to get started with that code.

Using the Repo

In this section, we make sure to cover some of these steps you’ll need to follow to complete the tutorial from the Github Repo as well and highlight some of the needed details.

After cloning the code, navigate to the project directory, and start by installing the dependencies (note that if you are using the GitPod deployment method then this will happen automatically) using the following command:

pip install -r requirements.txt

Note that, the Python version used in this example is 3.10.11 so make sure you either have that installed or make sure create a virtual environment using pyenv [https://realpython.com/intro-to-pyenv/] or anaconda [https://docs.anaconda.com/free/anaconda/install/index.html] (whatever works best for you).

Jupyter[Lab] Installation

Last but not least, and before jumping to running the code, make sure you have Jupyter installed on your machine. You can do that by running the command below.

pip install jupyterlab

Note that here we are installing jupyterlab, which is a more advanced version of jupyter notebooks. It offers many features in addition to the traditional notebooks.

Once installed, type jupyter-lab in the terminal it should open a window in your browser listing your working directory’s content, if you are in the cloned directory you should see the notebook and should be able to follow the steps.

If it doesn’t start automatically, you can navigate to the JupyterLabs server by clicking on the URLs in your terminal:

Updating the Local Secrets File

Now that you’ve run the jupyter command from your working directory you should be able to see the project tree in your browser and navigate to the files. First, let’s make sure we configure the secrets before starting any coding, so make sure you open the local_secrets.py file and fill in the details provided/extracted from the AstraDB website after setting up the database details.

Now we can navigate to the notebook. Let’s start by reading the data and storing it to AstraDB.

Storing the Data in AstraDB

This step is doing nothing but showcasing the fact that you can use AstraDB to feed data to your models. AstraDB is above all else a powerful and scalable database. Data has to be stored somewhere and, whether you are on Astra or Cassandra already or just getting started with the tools, this tutorial shows you how easy it is to use Astra for your backend.

It’s easy to use the CQL language to create a table and load the sample data from this tutorial into it, the logic should be as follows:

# Create training data table
query = """
CREATE TABLE IF NOT EXISTS training_data (
    date text,
    open float,
    high float,
    low float,
    close float,
    adj_close float,
    volume float,
    PRIMARY KEY (date)
)
"""

session.execute(query)

# Load the data into your table

for row in data.itertuples(index=False):
    query = f"INSERT INTO training_data (date, open, high, low, close, adj_close, volume) VALUES ('{row[0]}', {row[1]}, {row[2]}, {row[3]}, {row[4]}, {row[5]}, {row[6]})"
    session.execute(query)

Once done with the previous step, let’s read the data again and make sure we order it as a time series forecasting problem. This is a data management step to make sure data is ordered for the series.

Use the code below to read the data for your time series and preparing it for use by TensorFlow:

# Retrieve data from the table
query = "SELECT * FROM training_data"
result_set = session.execute(query)

# Convert the result set to a pandas DataFrame
data = pd.DataFrame(list(result_set)).sort_values(by=['date'])

Analyzing the Data

We can now write some code to understand and analyze the data with some graphics. The below chart for example shows the Open and Close prices of the stock (the price when the market opens and when the market closes) for the timeframe of 4 years between 2016 and 2020.

You can find more graphs on the notebook describing some other factors and analyzing the fluctuation of the prices over time.

Let’s move now to building the model itself.

Building the Model

For this example we will be using an LSTM model (or Long Short Term Model), a deep learning model known for its capability to recognize patterns in the data which works well with the problem we are demonstrating.

It’s also important to note that we will be using Keras (a deep learning API for Python) as a backend for our TensorFlow example. The model is composed of 2 LSTM layers, one dense layer, 1 dropout layer and finally another dense layer.

For this tutorial we’ve opted to use LSTM to train our models, note that you can use whatever model that works best for you, you can even opt for any other regression model to solve this same problem. Tensorflow offers a bunch of them!

Another important aspect to note is the data normalization and scaling.

The model’s performance is very sensitive to the scale and variation of the actual data values; thus, it’s crucial to normalize the data and transform it into a consistent range or distribution. We’ve used one of the most popular scalers in the world of machine learning which is the min max scaler for data normalization. We also implemented the data splitting into a train and test sets.

In this case and for the sake of the demo, we used 95% of the data as training and the rest is for testing. Feel free to also play with that train-test split parameter to compare the results, note that you can even work on adding more data and playing with its configuration.

Once done transforming and configuring the data and the model, we can start the fit step also known as the training step.

Using the data

Once you have the model trained on the training data, you’ll be able to use it to make a time series prediction.

Running the following command will generate the predicted data

predictions = model.predict(x_test)

But note that since we’ve scaled our values using our minMaxScaler, the generated predictions at this point are all within the scaling range and to get them back to the original scale we will need to inverse what we’ve done, with Tensorflow, doing that is as simple as running the following line of code

predictions = scaler.inverse_transform(predictions)

The following graph shows the training data evolution over time as well as the test vs forecast generated by our model.

Notice the different colors at the far right of the graph. The orange color is the testing subset of the actual data. The green line represents the forecasted stock price. In this case, our forecasting did a relatively good job predicting the drop in stock price. That Predictions line is the main outcome of any forecast.

Note that for this regression problem we’ve also used some model evaluation metrics. MSE and RMSE were used for that matter and below is the snippet of code corresponding to their implementation:

# evaluation metrics
mse = np.mean(((predictions - y_test) ** 2))
rmse = np.sqrt(mse)

Model Storage

Let’s now move on to model storage. In this section we will be interacting again with our database AstraDB, and more specifically, storing the model and the error metrics in a table. We picked a simple structure for this for visibility purposes.

The table schema is as follows:

The model data itself is going to be stored as a blob, and for this we will need to take our model data, convert it into json and then store it into our table. The below code describes how you can do this easily:

We can also take a look at the model’s summary, and make sure that it corresponds exactly to what we trained and stored previously:

loaded_model.summary()

You should see something like this:

And you’ve made it to the end of this tutorial!

As we wrap up this blog, you’ve now learned how to interact with AstraDB and use it as a source of your model’s artifacts and data.

In this blog, we have focused on one specific integration of Tensorflow on AstraDB, but it’s important to note that AstraDB (and Cassandra more generally) has multiple use cases and can be integrated with various tools and applications. While our discussion centered around its integration with tensorflow, it’s worth mentioning that Cassandra’s versatility extends beyond this particular context. So stay tuned for more exciting and cool stuff!

Conclusion

In this blog, we explored the powerful combination of TensorFlow and Cassandra, this time by demoing TensorFlow with AstraDB. Throughout the tutorial, we covered essential steps, such as setting up the environment, installing the necessary dependencies, and integrating TensorFlow with AstraDB. We explored how to perform common tasks like data ingestion, preprocessing, and model training using TensorFlow’s extensive library of functions. Additionally, we discussed how AstraDB’s flexible schema design and powerful query capabilities enable efficient data retrieval and manipulation.

By combining TensorFlow and AstraDB, developers can unlock a world of possibilities for creating advanced machine learning applications. From large-scale data analysis to real-time predictions, this powerful duo empowers us to build intelligent systems that can handle massive datasets and deliver accurate results.

Getting help

You can reach out to us on the Planet Cassandra Discord Server to get specific support for this demo. You can also reach out to the Astra team through the chat on Astra’s website. TensorFlow has a ton of use cases that can benefit virtually enterprise. You’ve learned to do basic time series prediction using TensorFlow. Let us know what you do next! Happy coding!

Resources

https://github.com/Anant/tensorflow-astra-demo

Introduction

Set-Up

Setting up an Astra DB database

Create a Secure Connect Bundle and Tokens

Getting Test Data

Code for the tutorial

Using the Repo

Jupyter[Lab] Installation

Updating the Local Secrets File

Storing the Data in AstraDB

Analyzing the Data

Building the Model

Using the data

Model Storage

Conclusion

Getting help

Resources

Astra Resources

Tensorflow Resources

Stock Market price prediction using LSTM

Become part of our

growing community!

Planet Cassandra is a service for the Apache Cassandra® user community to share with each other. From tutorials and guides, to discussions and updates, we're here to help you get the most out of Cassandra. Connect with us and become part of our growing community today.

© 2009-2023 The Apache Software Foundation under the terms of the Apache License 2.0. Apache, the Apache feather logo, Apache Cassandra, Cassandra, and the Cassandra logo, are either registered trademarks or trademarks of The Apache Software Foundation.

Get Involved with Planet Cassandra!

We believe that the power of the Planet Cassandra community lies in the contributions of its members. Do you have content, articles, videos, or use cases you want to share with the world?