Creating SaaS Machine Learning Solution with RAG Pipelines — Part 1 [Week 1]

Gopal Singh
5 min readMar 5, 2024

--

Unlocking Advanced Natural Language Processing Capabilities for Businesses

If you are a developer, it doesn’t matter what stack is yours it’s improbable that you haven’t heard about LLMs.So, why don’t we start making something cool out of it? Shall we?

In this series, I will be coding 4 SaaS applications. And, this series will going to last for the next 30 days.

You can keep tracking this repository for the reference -

https://github.com/Akshay594/SaaS-LLM

Tech-stack I will be using throughout this series -

  1. Python FastAPI
  2. React.js
  3. LLamaindex
  4. Langchain
  5. AWS Services for hosting it
  6. A SaaS domain name

So, we will gonna go from setting up the backend along with the front end. And, we will deploy this entire application with the help of AWS services.

Let’s first decide what business problem we want to solve with this SaaS app.

Objective -

We will create a SaaS application that will help us manage student data across the university’s database.

We will be able to automate queries such as shooting emails to students with prompts to let them know where they have excelled in an exam and where they have lacked.

Here is the sample data from Kaggle I downloaded from the following link — https://www.kaggle.com/datasets/rohithmahadevan/students-marksheet-dataset

Name	Gender	Age	Section	Science	English	History	Maths
Bronnie Female 13 C 21 81 62 49
Lemmie Male 15 B 29 41 17 40
Danya Female 14 C 12 87 16 96
Denna Female 14 B 15 53 82 33
Jocelin Male 14 A 43 6 3 21
Malissa Female 14 C 98 51 85 76
Ichabod Female 14 B 38 74 54 60
Corrine Male 15 A 39 16 22 49
Tate Male 15 C 35 25 37 27

I know there are a bunch of tutorials on the internet that may or may not cover this topic but I had hard luck finding those so here I am! Making one for my mental clarity.

In this section, we will set the class for an RAG pipeline that can be ingested with structured and unstructured data with the help of llama-index.

Let’s get to the coding now. But first, we need to understand how exactly the llama-index helps you to ingest data and enable you to talk to it.

There are four stages in the llama-index RAG pipeline -

  1. Load the data
  2. Indexing the data
  3. Storing the indexed values
  4. Querying the data

Loading the data -

To query the data, we need the data. For reading the data we need connectors or readers for different types of data such as structured, unstructured, image data, audio spectrograms, etc.

We will start with a very basic one here which is CSV or EXCEL files. So, throughout the series,

we will use https://www.kaggle.com/datasets/rohithmahadevan/students-marksheet-dataset

You can download the data from there and get ready for the next steps.

Now let’s start building the class that would accept a folder path with data in it.

from llama_index.core import SimpleDirectoryReader
class StudentAgent:
def __init__(self, file_path:str) -> None:
reader = SimpleDirectoryReader(input_dir=file_path)
self.documents = reader.load_data()
print(f"Document has been loaded from : {file_path} folder.")

agent = StudentAgent("data")

And, that’s how my setup looks like with data and app.py

It’s the basic set that we have done so far! Let’s index this data and start querying it already.

What is Indexing?
Indexing in LLMs turns text into numbers to find information quickly and accurately.

Let’s imagine you want to find information about “cat” in a sentence.

We’ll use the simple indexed sentence: “The cat sat on the mat,” where each word is represented by a single number.

Indexed Sentence:

  1. The — [0.1]
  2. cat — [0.8]
  3. sat — [0.4]
  4. on — [0.2]
  5. the — [0.1]
  6. mat — [0.7]

To find info about “cat” in a sentence with each word turned into a number, like “cat” = [0.8], we look for the number [0.8] in our list.

If you ask, “Where did the cat sit?”, the model sees [0.8] and quickly finds the sentence about “cat” to answer your question. This way, it easily spots the right info using numbers.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_API_KEY"class StudentAgent:
def __init__(self, file_path:str) -> None:
reader = SimpleDirectoryReader(input_dir=file_path)
self.documents = reader.load_data()
self.index = None
print(f"Document has been loaded from : {file_path} folder.")
def make_index(self):
self.index = VectorStoreIndex.from_documents(self.documents)
print("Index has been made: ", self.index)

agent = StudentAgent("data")
agent.make_index()

After running the file with the `python app.py` command you should see the following output on your terminal.

Document has been loaded from : `data` folder.
Index has been made: <llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x14ba92f50>

Next would be to query this index that we have created from the loaded documents.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_API_KEY"class StudentAgent:
def __init__(self, file_path:str) -> None:
reader = SimpleDirectoryReader(input_dir=file_path)
self.documents = reader.load_data()
self.index = None
print(f"Document has been loaded from : `{file_path}` folder.")
def make_index(self):
self.index = VectorStoreIndex.from_documents(self.documents)
print("Index has been made: ", self.index)
def make_query(self, query):
query_engine = self.index.as_query_engine()
return query_engine.query(query)

agent = StudentAgent("data")
agent.make_index()
response = agent.make_query("Tell me about this data.")
print(response)

You should be seeing the following output -

Document has been loaded from : `data` folder.
Index has been made: <llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x12d49bf90>
This data appears to be a collection of student information and their corresponding marks or grades. Each entry includes details such as student ID, name, gender, age, grade, and marks in different subjects. The data seems to be structured in a CSV format with rows representing individual students and columns representing different attributes.

Okay, this seems fine. But can we run any data anylsis over it? Let’s try -

agent = StudentAgent("data")
agent.make_index()
response = agent.make_query("How many students do we have in the data?")
print(response)

We get the following output -

There are 50 students in the data.

This is certainly not true since we have 250 students in our data. So, let’s address this issue in our next article.

Thanks,

Gopal

If you like my work then please subscribe to my newsletter on substack!

https://gopalsingh.substack.com

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Gopal Singh
Gopal Singh

Written by Gopal Singh

Data Scientist , Programmer, Writer. Let’s connect: https://www.linkedin.com/in/theunblunt

No responses yet

Write a response