Creating SaaS Machine Learning Solution with RAG Pipelines — Part 1 [Week 1]

5 min readMar 5, 2024

Unlocking Advanced Natural Language Processing Capabilities for Businesses

If you are a developer, it doesn’t matter what stack is yours it’s improbable that you haven’t heard about LLMs.So, why don’t we start making something cool out of it? Shall we?

In this series, I will be coding 4 SaaS applications. And, this series will going to last for the next 30 days.

You can keep tracking this repository for the reference -

https://github.com/Akshay594/SaaS-LLM

Tech-stack I will be using throughout this series -

Python FastAPI
React.js
LLamaindex
Langchain
AWS Services for hosting it
A SaaS domain name

So, we will gonna go from setting up the backend along with the front end. And, we will deploy this entire application with the help of AWS services.

Let’s first decide what business problem we want to solve with this SaaS app.

Objective -

We will create a SaaS application that will help us manage student data across the university’s database.
We will be able to automate queries such as shooting emails to students with prompts to let them know where they have excelled in an exam and where they have lacked.

Here is the sample data from Kaggle I downloaded from the following link — https://www.kaggle.com/datasets/rohithmahadevan/students-marksheet-dataset

Name	Gender	Age	Section	Science	English	History	Maths
Bronnie	Female	13	C	21	81	62	49
Lemmie	Male	15	B	29	41	17	40
Danya	Female	14	C	12	87	16	96
Denna	Female	14	B	15	53	82	33
Jocelin	Male	14	A	43	6	3	21
Malissa	Female	14	C	98	51	85	76
Ichabod	Female	14	B	38	74	54	60
Corrine	Male	15	A	39	16	22	49
Tate	Male	15	C	35	25	37	27

I know there are a bunch of tutorials on the internet that may or may not cover this topic but I had hard luck finding those so here I am! Making one for my mental clarity.

In this section, we will set the class for an RAG pipeline that can be ingested with structured and unstructured data with the help of llama-index.

Let’s get to the coding now. But first, we need to understand how exactly the llama-index helps you to ingest data and enable you to talk to it.

There are four stages in the llama-index RAG pipeline -

Load the data
Indexing the data
Storing the indexed values
Querying the data

Loading the data -

To query the data, we need the data. For reading the data we need connectors or readers for different types of data such as structured, unstructured, image data, audio spectrograms, etc.

We will start with a very basic one here which is CSV or EXCEL files. So, throughout the series,

we will use https://www.kaggle.com/datasets/rohithmahadevan/students-marksheet-dataset

You can download the data from there and get ready for the next steps.

Now let’s start building the class that would accept a folder path with data in it.

from llama_index.core import SimpleDirectoryReader

class StudentAgent:
    def __init__(self, file_path:str) -> None:
        reader = SimpleDirectoryReader(input_dir=file_path)
        self.documents = reader.load_data()
        print(f"Document has been loaded from : {file_path} folder.")
    
agent = StudentAgent("data")

And, that’s how my setup looks like with data and app.py

It’s the basic set that we have done so far! Let’s index this data and start querying it already.

What is Indexing?
Indexing in LLMs turns text into numbers to find information quickly and accurately.

Let’s imagine you want to find information about “cat” in a sentence.

We’ll use the simple indexed sentence: “The cat sat on the mat,” where each word is represented by a single number.

Indexed Sentence:

The — [0.1]
cat — [0.8]
sat — [0.4]
on — [0.2]
the — [0.1]
mat — [0.7]

To find info about “cat” in a sentence with each word turned into a number, like “cat” = [0.8], we look for the number [0.8] in our list.

If you ask, “Where did the cat sit?”, the model sees [0.8] and quickly finds the sentence about “cat” to answer your question. This way, it easily spots the right info using numbers.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_API_KEY"class StudentAgent:
    def __init__(self, file_path:str) -> None:
        reader = SimpleDirectoryReader(input_dir=file_path)
        self.documents = reader.load_data()
        self.index = None
        print(f"Document has been loaded from : {file_path} folder.")    def make_index(self):
        self.index = VectorStoreIndex.from_documents(self.documents)
        print("Index has been made: ", self.index)
    
agent = StudentAgent("data")
agent.make_index()

After running the file with the `python app.py` command you should see the following output on your terminal.

Document has been loaded from : `data` folder.
Index has been made:  <llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x14ba92f50>

Next would be to query this index that we have created from the loaded documents.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
import os

os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_API_KEY"class StudentAgent:
    def __init__(self, file_path:str) -> None:
        reader = SimpleDirectoryReader(input_dir=file_path)
        self.documents = reader.load_data()
        self.index = None
        print(f"Document has been loaded from : `{file_path}` folder.")    def make_index(self):
        self.index = VectorStoreIndex.from_documents(self.documents)
        print("Index has been made: ", self.index)    def make_query(self, query):
        query_engine = self.index.as_query_engine()
        return query_engine.query(query)
    
agent = StudentAgent("data")
agent.make_index()
response = agent.make_query("Tell me about this data.")
print(response)

You should be seeing the following output -

Document has been loaded from : `data` folder.
Index has been made:  <llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x12d49bf90>
This data appears to be a collection of student information and their corresponding marks or grades. Each entry includes details such as student ID, name, gender, age, grade, and marks in different subjects. The data seems to be structured in a CSV format with rows representing individual students and columns representing different attributes.

Okay, this seems fine. But can we run any data anylsis over it? Let’s try -

agent = StudentAgent("data")
agent.make_index()
response = agent.make_query("How many students do we have in the data?")
print(response)

We get the following output -

There are 50 students in the data.

This is certainly not true since we have 250 students in our data. So, let’s address this issue in our next article.

Thanks,

Gopal

If you like my work then please subscribe to my newsletter on substack!

https://gopalsingh.substack.com

Creating SaaS Machine Learning Solution with RAG Pipelines — Part 1 [Week 1]

Objective -

Loading the data -

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Gopal Singh

No responses yet