Loading Vector Data into Cassandra in Parallel using Ray
This blog will delve into the nuances of combining the prowess of Datastax Astra with the power of Ray and is a companion to this demo on GitHub. We’ll explore the step-by-step procedure, the pitfalls to avoid, and the advantages this dynamic duo brings to the table. Whether you’re a data engineer, a developer looking to optimize your workflows, or just a tech enthusiast curious about the latest in data solutions, this guide promises insights aplenty. Soon you’ll be able to use Cassandra 5 in place of AstraDB in this demo – but for a quick start, AstraDB is a great way to get started with a vector-search-compliant Cassandra database!
Introduction
Vector search is a technology that works by turning data that we are interested in into numerical representations of locations in a coordinate system. A database that holds and operates on vectors is called a vector store. This functionality is coming to Cassandra 5.0, which is releasing soon. To preview this functionality, we can make use of Datastax Astra. Similar items have their vector locations close to each other in this space. That way we can take some items and find items similar to it. In this case, we have bits of text that are embedded. Embedding takes text into a machine learning model that returns vectors which represent the data. You can almost think about embedding translating data from real text into vectors.
Ray is a processing engine for python code that is specialized for distributed machine learning tasks. In this tutorial we use Ray core to parallelize running text through a specific embedding model so that we can load those vectors into Astra.
We use Ray to speed up the process of embedding our text chunks. While it might seem that using a pre-trained machine learning model would be less taxing than training a new one in terms of the amount of calculations that need to happen, calculating the model’s result a large number of times can still take a long time and a significant amount of computing power. We want to minimize the excess time that the process takes. Ray lets us run the process on multiple computers simultaneously to reduce how long it takes to complete.
Prerequisites
For this tutorial you will need:
- An Astra Account from DataStax. Sign up for a Free Tier Astra account here.
- A Colaboratory account from Google.
Before you can proceed further you will need to set up your Astra or Cassandra database. After creating a free account you will need to create a database within that account and create a Keyspace within that database. All of this can be done purely using the Astra UI.
Creating your AstraDB
Setting up your Astra DB
There is a ton of great documentation for how to create an Astra Database which is included in the Knowledge Resources at the bottom of this STACK page.
In brief, go to astra.datastax.com, create/sign in to your account, create a new DB– they are free for most users. For this demo, we use a database called vector-search and a keyspace named vector_search. To create that database, select the “Databases” tab (shown on the left menu) then click on the “Create Database button” (shown on the right) and fill out the needed information for your database:
Note that our demo instructions use the keyspace name: vector_search.
Once you’ve created your database, you’ll need to generate the Token or Secure Connect bundle to connect to your database with the connect tab. Choose the permissions that make the most sense for your use case.
AstraDB tokens play a crucial role in ensuring the security of the database. Tokens are used to authenticate and authorize access to data stored in AstraDB. Each token represents a specific level of access permissions, such as read-only or read-write, allowing administrators to control and restrict data access based on user roles and privileges. This ensures that only authorized users can interact with the database, safeguarding the integrity and confidentiality of the data stored in AstraDB. For this demo, choose Database Administrator. In our tutorial we’ve used an administrator token with read write permissions on the database to be able to store and retrieve data, as well as create and drop tables and keyspaces.
Never share your token or bundle with anyone. It is a bundle of several pieces of data about your database, and can be used to access it.
Running the Code
Once the setup is complete, open Google Colab. Under File → Open Notebook, go to the Github tab and paste in the link to the astra_vector_search.ipynb from our repo in order to open the notebook in Colab.
Download the files local_creds_secrets.py, and requirements.txt from the Github repo. Download your secure connect bundle and generate a token from the Astra UI.
Edit local_creds_secrets.py, pasting your client id and client secret from AstraDB into the empty strings on the specified rows. Change the file name on the secure_connect_bundle line to reflect the file name of your secure connect bundle. If you changed the Keyspace name when creating your database, enter it into the db_keyspace line.
In Google Colab enter the file sidebar on the left of the screen and upload local_creds_secrets.py, requirements.txt, and your secure connect bundle. Restart the runtime. In order to run individual cells from the notebook, select them and press Ctrl+Enter.
Explaining and Using the Code
Within the notebook, you can click on any cell to edit its contents. To execute a cell and see its output, select the cell and press Shift + Enter. As you make modifications, ensure you regularly save your notebook using the save icon or Ctrl + S. Once you’re done, close the browser tabs associated with Jupyter Notebook and go back to your terminal or command prompt. Press Ctrl + C to safely shut down the Jupyter Notebook server.
Notebook Cell Explanation
The first cell in the notebook installs the Python dependencies listed in requirements.txt.
The second cell imports ray and starts a Ray runtime environment that includes the specified dependencies.
You can, instead, connect your notebook to an existing Ray cluster during this step. Provide the ip address of the head node of an external Ray cluster in ray.init in order to run Ray processes on the specified cluster.
After that, we load more packages and create the RecursiveCharacterTextSplitter object that splits long text objects into shorter ones of specified length.
Then we use ArxivLoader (a langchain utility for downloading text versions of scientific research papers from arxiv.org) to pull a paper and split it up into chunks using the previously defined text splitter.
Once that is complete, we create a Ray data object from our collection of split text bits. We do some minor pre-processing by replacing the newline character that ArxivLoader returns when there is a line break in the paper with a normal space character. We make this simple operation performant by using the ray flatmap function on our collection, applying the normal Python replace function to each bit of text.
Next, after we define the name of our embedding model (“intfloat/multilingual-e5-small” from HuggingFace, a model which turns up to 512 tokens of text into a vector of length 384), we create a class which loads the model upon creation and uses the model to return our embedded vectors when called.
Then we use the Ray map_batches function with the Embed class as an argument to apply the embedding model to each piece of text in our data set. The map_batches function is used to apply a user-provided function to all elements in a ray Dataset, splitting it into batches to parallelize the process.
After that, we create our connection to Astra and create our tables and indices for storing the vector data. Our table is named papers and is defined using the following schema definition:
CREATE TABLE IF NOT EXISTS papers
(
id int PRIMARY KEY,
name TEXT,
description TEXT,
item_vector VECTOR<FLOAT, 384>
);
And we also create a Storage Attached Index or SAI named ann_index, defined below:
CREATE CUSTOM INDEX IF NOT EXISTS ann_index
ON {table_name}(item_vector) USING 'StorageAttachedIndex';
Once the table and index are created, we can load our data into the database. This is where we take the python object that contains the article information and the embedded vector and process each entry. We make an insert query for each row and send it to the database. Do this by running the cell in the notebook under the header “Insert vector records into AstraDB”.
We can then search from the table using an ANN (Approximate Nearest Neighbor) query, like this
SELECT * FROM vector_search.papers
ORDER BY item_vector ANN OF {embedding}
LIMIT 2;
where embedding is the example sentence we want to find examples similar to.
Conclusion
That puts us at the end of the embedding process.
In summary, the first step to make use of Astra’s new vector storage feature is to load pieces of vectorized text into an Astra database. This process involves running bits of text through a vector embedding model, and often needs to be done tens of thousands to millions of times. In order to speed up this process we can use multiprocessing frameworks to split the work up into manageable pieces and complete the process quickly across many machines. For this we use Ray, a Python compute framework optimized for machine learning tasks. We create a temporary Ray cluster inside of the Google Colab resources and pull down data using ArxivLoader, that we then split, vectorize, and upload into Astra. Once all that is done, we can query the data using Astra ANN vector search feature.
Getting help
You can reach out to us on the Planet Cassandra Discord Server to get specific support for this demo. You can also reach out to the Astra team through the chat on Astra’s website. You can use this demo to introduce yourself to Ray — a pivotally important data engineering tool that takes many of the benefits of Spark and tunes them specifically for AI/ML and vector processing. We’ve found it works very well with Cassandra and we’re sure you will too. Happy coding!
Resources
https://www.datastax.astra.com
https://www.anyscale.com/blog/llm-open-source-search-engine-langchain-ray
https://www.anyscale.com/blog/turbocharge-langchain-now-guide-to-20x-faster-embedding
https://github.com/Anant/ray-vector-embedding/tree/main
https://docs.ray.io/en/latest/
https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.map_batches.html