Local DeepSeek-R1 AI RAG using JavaScript (TypeScript)

An old human teacher is teaching a classroom full of AI robots new knowledge

What are we going to a build in this tutorial?

This tutorial is about building a Retrieval Augmented Generation (RAG) pipeline in JavaScript (TypeScript).

Download DeepSeek-R1 using Ollama and then run the LLM model locally (privacy first, your data does not get transferred to third parties)
Use create-next-app to quickly setup a Next.js 15 project
Create some UI components for our chat bot using React 19
Build a simple frontend and a backend using the ai SDK (which will handle the most difficult bits of streaming a response)
Install PostgreSQL and pgvector to store our vectors (embeddings)
Create a Retrieval-Augmented Generation (RAG) pipeline using LangChain.js, that will extract the content from our local (markdown) documents, turn each chunk into an embedding vector and finally store it in our vector database
Find relevant knowledge using a similarity search in our vector database
And finally add the knowledge to the prompt, to give the AI access to your own custom knowledge base (before it starts reasoning / answering)

Tip

the final source code for this project is available in the js-deepseek-r1-local-rag GitHub repository

Local (offline) DeepSeek-R1 using Ollama

In this chapter we will start by installing ollama and then ask it to pull the deepseek model. Once the model is running on own hardware we can test a few prompts using our terminal

ollama

Ollama installation

Windows

To install Ollama on Windows download the Ollama setup exe from the ollama website, once it has finished downloading click on the download in your browser or open the download folder and double click on the file to start the Ollama installation

After the installation has finished Ollama will get added to your PATH, you might want to restart your terminal to make sure it is aware of the new PATH

Linux

To install Ollama on Linux you can use the following command:

curl -fsSL https://ollama.com/install.sh | sh

Or check out the "manual installation" on Linux documentation in the Ollama repository

macOS

To install Ollama on macOS I recommend using brew (ollama formula), or if you prefer you can download a zip file for macOS from the ollama website

Ollama quickstart

We will start with the deepseek-r1:1.5b model, it is the smallest deepseek-r1 model that is available, we will then make a first test to see how well it performs on our hardware, if your hardware is powerful enough you can later switch to a version that is more hungry for computational power. If you prefer to use another model from the Ollama library, like the llama 3 model.

Ollama has the following estimate when it comes to resources that need to be available on your local machine:

You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

In your terminal use the following command to pull the DeepSeek-R1-Distill-Qwen-1.5B model using ollama:

ollama run deepseek-r1:1.5b

As this is the first run we ask Ollama to run that model, it will first download it

After the installation has finished Ollama run the model and then offer you to enter your first message, try out asking it something like this:

Please identify yourself

DeepSeek-R1 will do some thinking and then reply with a short answer, mine was like this:

Hi! I'm DeepSeek-R1, an AI assistant created by DeepSeek. I'm here to help answer questions, provide information, and assist with various tasks. Thank you for asking! How can I assist you today?

Great we have verified that DeepSeek-R1 is running locally (offline) and can answer our questions

Tip

To exit the conversation (the ollama shell) press Ctrl+D (macOS: ⌘D, Linux: Ctrl+D)

Ollama commands (optional)

Ollama can be used for more than just relaying questions to the model, an easy way to find out more about those commands is to type /? as message to Ollama, after you press Enter you will get a list of useful Ollama commands

To list the models that are currently installed locally use this command:

ollama list

to stop a model use:

ollama stop

to see what version of ollama is currently installed use:

ollama -v

Ollama Readme

AI Frontend using Next.js 15 and React 19

In this chapter we will install Next.js 15 and React 19 to create a page for our React 19 chat bot components.

Next.js
React

Prerequisites

This tutorial assumes that already have a development environment set up:

you will need to have git installed
an IDE like VSCode. We will use the VSCode terminal but feel free to use your favorite command line tool instead.
you also need to have Node.js installed, preferably a version >= 22.7.0 as those versions have support for running Typescript

git post (installing and using git)
VSCode post (installing and using VSCode)
Node.js post (installation guide)

Next.js 15 installation using create-next-app (CNA)

To install Next.js 15 and React 19 we are going to use create-next-app (CNA)

First you may want to create a GitHub repository for your new project, which I explain in the GitHub post.

Next clone that project locally using VSCode. For more details have a look at the VSCode post

Now that our project is open in VSCode we use the terminal to run the CNA command (to install the latest Next.js canary version using CNA):

npx create-next-app@canary

When you get asked if you agree to install the latest CNA canary, press y to accept

When asked for a name make sure you only enter a dot (.), to tell CNA that we are already in the project folder (and so it won't create one). As name for the project CNA will use the current folder name, if later you want to change the name, you can do so by editing the package.json in the root of the project and then changing the name to something else.

I suggest you answer the few questions like this (or customize the answers to your liking):

Would you like to use TypeScript? **Yes**
Would you like to use ESLint? **Yes**
Would you like to use Tailwind CSS? **No**
Would you like your code inside a `src/` directory? **No**
Would you like to use App Router? (recommended) **Yes**
Would you like to use Turbopack for `next dev`? **Yes**
Would you like to customize the import alias (`@/*` by default)? **Yes**
What import alias would you like configured? **@/\***

"create-next-app (CNA)" documentation

Cleanup

Next we do a little bit of cleanup, first we delete the files in /public and then we simplify the layout and homepage like this:

/app/layout.tsx

import type { Metadata } from 'next'
import './globals.css'
 
export const metadata: Metadata = {
    title: 'Create Next App',
    description: 'Generated by create next app',
}
 
export default function RootLayout({
    children,
}: Readonly<{
    children: React.ReactNode
}>) {
    return (
        <html lang="en">
            <body>
                {children}
            </body>
        </html>
    )
}

/app/page.tsx

import styles from './page.module.css'
 
export default function Home() {
    return (
        <div className={styles.page}>
            
        </div>
    )
}

React 19 chat UI components

In this chapter we will focus on getting a working MVP as quickly as possible, so we won't add a styling library to the project, but feel free to improve the UI later, by for example adding Tailwind, ShadCN or MUI (with pigment css) to your stack

Base components

Let's get started by creating a base button component

Tip

VSCode can create the folders and file itself in one go. In VSCode right click somewhere in the file explorer and then select New File..., then instead of just the file name enter the full path "components/base/Button.tsx" and press ENTER, VSCode will then create the two folders as well as the file for you

Add the following code for our button component:

/components/base/Button.tsx

import { type ButtonHTMLAttributes } from 'react'
 
export type BaseButtonProps = ButtonHTMLAttributes<HTMLButtonElement>
 
const BaseButton: React.FC<BaseButtonProps> = (props) => {
 
    const { children, ...rest } = props
 
    return (
        <button {...rest}>
            {children}
        </button>
    )
}
 
export default BaseButton

We will also need an input field component:

/components/base/Input.tsx

import { type InputHTMLAttributes } from 'react'
 
export type BaseInputProps = InputHTMLAttributes<HTMLInputElement>
 
const BaseInput: React.FC<BaseInputProps> = (props) => {
 
    return (
        <input {...props} />
    )
}
 
export default BaseInput

And finally our third base component will be for the form element:

/components/base/Form.tsx

import { type FormHTMLAttributes } from 'react'
 
export type BaseFormProps = FormHTMLAttributes<HTMLFormElement>
 
const BaseForm: React.FC<BaseFormProps> = (props) => {
 
    const { children, ...rest } = props
 
    return (
        <form {...rest}>
            {children}
        </form>
    )
}
 
export default BaseForm

We have created these base components so that we have an abstraction layer between the chat client and any UI library we might add in the future, the same idea is behind re-exporting the type from react, by using a custom type we make sure that in the future we can customize it without having to edit any component that uses the type

Chat bot UI

Now that we have created our 3 base components we can finally build a minimalistic chat client (components/chat/Client.tsx):

/components/chat/Client.tsx

'use client'
 
import Form from '@/components/base/Form'
import Input from '@/components/base/Input'
import Button from '@/components/base/Button'
 
const ChatClient: React.FC = () => {
 
 
    const onSubmitHandler = (event: React.FormEvent<HTMLFormElement>) => {
 
        const formData = new FormData(event.currentTarget)
 
        event.preventDefault()
 
        // use the entries (names and values) or values (only values)
        formData.entries().forEach(([name, value], index) => {
            console.log('name, value, index: ', name, value, index)
        })
 
    }
 
    return (
        <Form onSubmit={onSubmitHandler}>
            <Input name="prompt" />
            <Button type="submit">Submit</Button>
        </Form>
    )
}
 
export default ChatClient

For now we just added a simple form and a submit handler that will log the content of the formdata

And then we update the code of the page to add our newly created ChatClient UI component:

/app/page.tsx

import styles from './page.module.css'
import ChatClient from '@/components/chat/Client'
 
export default function Home() {
 
    return (
        <div className={styles.page}>
            <ChatClient />
        </div>
    )
}

Line 2: we import our new chat client component

Line 8: we put the ChatClient component into our Next.js page

To test our basic chat client we first need to start the dev server:

npm run dev

And if we then visit the page at http://localhost:3000 we can submit the form to see the formdata console log (in the browser dev tools console tab)

AI client and backend in JavaScript

Next we are going to use the "AI" SDK to transform our simple form into a chat client. Then we will create an API route to pass the questions to ollama and then return a stream of responses.

Warning

one drawback of using DeepSeek-R1 is that it has no support for tools, yet (there is an open issue #9 regarding support for functions and structured outputs).

The tools support in the ai SDK is one its strength, if you want support for tools, have a look at the ollama models list and use the tools filter

ai SDK (The AI Toolkit for TypeScript)

AI SDK client component

First we install the the AI SDK (by Vercel):

npm i ai --save-exact

Next we update our chat client like this:

/components/chat/Client.tsx

'use client'
 
import Form from '@/components/base/Form'
import Input from '@/components/base/Input'
import Button from '@/components/base/Button'
import { useChat } from '@ai-sdk/react'
 
const ChatClient: React.FC = () => {
 
    const { messages, input, handleInputChange, handleSubmit } = useChat(
        {
            onError: (error) => {
                console.error('useChat onError: ', error)
            },
        }
    )
 
    return (
        <>
            {messages.map(message => (
                <div key={message.id}>
                    {message.role === 'user' ? 'User: ' : 'AI: '}
                    {message.content}
                </div>
            ))}
 
            <Form onSubmit={handleSubmit}>
                <Input name="prompt" value={input} onChange={handleInputChange} />
                <Button type="submit">Submit</Button>
            </Form>
        </>
    )
}
 
export default ChatClient

Line 6: We first import useChat from ai SDK we just installed and then is it to replace our onSubmitHandler

Lines 10 to 16: useChat gives us two useful values, the messages list (all the messages of the current session) and the current value for the input field, and it gives us two functions one that will handle input field changes for us and a second on to handle submitting the form

Lines 20 to 25: in the JSX part we add some code to display the chat messages

Lines 27 to 30: we finally set the input field value to the input value from the ai SDK. The handleInputChange function will get triggered on every input field change, and the handleSubmit function when the user clicks the submit button

ai SDK "useChat" reference

AI SDK backend (Next.js API route)

The second part of our chat bot is its backend, for which we use a Next.js API route handler

The ai SDK has several official providers that connect to different AI APIs, but as we are running ollama locally we will instead use the ollama (community) provider

First we install ollama-ai-provider package:

npm i ollama-ai-provider --save-exact

Now that we have the ai SDK as well as the ollama provider we can create a new API route (app/api/chat/route.ts) to handle requests and stream the DeepSeek-R1 answer back to our client:

/app/api/chat/route.ts

import { createOllama } from 'ollama-ai-provider'
import { streamText } from 'ai'
import { type CoreMessage } from 'ai'
 
interface ChatRequest {
    messages: {
        role: string
        content: string
    }[]
}
 
// sets max streaming response time to 30 seconds
export const maxDuration = 30
 
export async function POST(req: Request) {
 
    const body = await req.json() as ChatRequest
    const messages = body.messages as CoreMessage[]
 
    const ollamaProvider = createOllama({
        baseURL: 'http://localhost:11434/api',
    })
 
    const ollamaModel = ollamaProvider('deepseek-r1:1.5b')
 
    const lastMessage = messages[messages.length - 1].content
 
    const prompt = `You are a helpful AI chat bot.
    You should answer the following question: "${lastMessage}"\n\n`
 
    const result = streamText({
        model: ollamaModel,
        prompt: prompt,
        temperature: 0.1,
    })
 
    return result.toDataStreamResponse({
        getErrorMessage: (error) => {
            const message = error instanceof Error ? error.message : (typeof error === 'string' ? error : 'Unknown error')
            console.error(message)
            return message
        },
    })
 
}

Line 2: the ai SDK is has a good DX, all you need to do is import the streamText function

Lines 5 to 10: we create an interface for our messages

Lines 20 to 22: we create an ollama provider (the provider will open a connection and handle the communication between the model and our API route), we define the URL to our local ollama and set the port number 11434 as this is the default value (an improvement for later would be to create an env file and add an entry for ollama URL variable)

Line 24: we tell the provider which model we want, for this tutorial I used DeepSeek-R1 as I wanted to test its reasoning capabilities, but if you prefer to use another model from the Ollama library

Lines 28 and 29: for the prompt we did nothing fancy, we just tell the AI we want it to act as a helpful AI chat bot and then pass the last message from the user

Lines 31 to 35: next we call the streamText function from the ai SDK with a few options that specify which model we want to use, what our prompt will be and a temperature

Tip

a low temperature will generate uniform answers and a higher temperature (up to 1) will generate more diverse and unpredictable responses

Lines 37 to 43: we start the stream using the toDataStreamResponse function and add a custom callback to catch potential errors, like problems connecting to ollama

Now we can finally test our chat client, make sure Ollama and the Next.js dev server are running and then visit http://localhost:3000.

Next ask the AI something, like "Please identify yourself" and then click Submit. You should now see the first reasoning steps getting streamed in. After a while DeepSeek-R1 will stop reasoning and send its final answer.

If ollama is not running you will get an error like "Cannot connect to API", in which case you should check if ollama is really running, and if that does not fix it make sure the port you use in the API route is really the port on which ollama is listening.

ollama (community) provider
Next.js "API route handler" documentation

Setup PostgreSQL and pgvector

In this chapter we will install PostgreSQL on our local machine and then add the pgvector plugin. pgvector turns PostgreSQL into a vector store that lets you store and search through AI embeddings (vectors). It might not be the fastest solution, but it has the advantage that we will only need one database for both our app data as well as the vectors.

An alternative to installing PostgreSQL and pgvector manually is to download and use the pgvector docker container.

PostgreSQL
pgvector

PostgreSQL installation

Windows: download the PostgreSQL Windows installer

macOS: using the brew command:

brew install postgresql@17

Linux: several PostgreSQL Linux distributions are supported:

For example on Ubuntu/Debian you can use:

apt install postgresql

Tip

Windows: PostgreSQL will run as a service in the background, you can stop / pause / restart the service by using the Windows Services tool (click on Windows start and type "Services" to find it)

pgvector for PostgreSQL Installation

I have added instructions below for Windows, if you use macOS or Linux I recommend you follow the installation instructions from the pgvector readme.

On Windows:

First we need to make sure the we have the C++ build tools for Visual Studio installed, if you have the build tools of even the full version of Visual Studio installed then you can skip this step, else either install the Visual Studio IDE (which includes the build tools) or you can install the Build Tools for Visual Studio as standalone

When it has finished downloading, double click on the Visual Studio Installer to launch it

Next in the Installed tab you should either have the Visual Studio or the standalone build tools being listed, click on Modify and then locate the Desktop development with C++ and check the checkbox in the top right corner

Now you will see the details of what is about to get installed on the left:

visual studio installer panel in which you need to select the Desktop development with C++ and then click the install button

Finally click on Install button (in the bottom left, next to the close button) to apply the changes and install the C++ development tools

To make sure everything we need got installed, make sure the following file exists:

C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat

Note

Depending on what version you installed you might have to adjust the path, for example I download version 2022, also as I downloaded the Visual Studio Build tools we have "BuildTools", if you downloaded the Visual Studio community edition then you should have "Community" in your path.

Next step is to open the Command Prompt for VS 2022 (or the default one will work too) as administrator (right click the name, and then "Run as administrator"), you can find it clicking on start (bottom left windows icon) and then type "cmd"

Now that the VS command prompt is open we type the following commands, line by line, just make sure you adjust the versions of the tools:

cd %TEMP%
git clone https://github.com/pgvector/pgvector.git --branch v0.8.0
cd pgvector
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
set "PGROOT=C:\Program Files\PostgreSQL\17"
nmake /F Makefile.win
nmake /F Makefile.win install

What the commands will do:

first we go into the temp folder
we use git to clone the pgvector repository (here we install version 0.8.0 which as of Jan. 2025 is the latest, you may want to check out the pgvector tags on GitHub to make sure this is still the latest version)
then we go into the pgvector folder
we setup the visual studio environment by running a bat file, adjust the path based on what Visual Studio edition or tools are have installed, see the Note above
we set the path to PostgreSQL, make sure to adjust the version to match the one that is installed on your device
finally we make the pgvector build

Enabling pgvector for a database

Next we need to run a query using the pgAdmin query tool. If you haven't installed pgAdmin yet or want to update it visit the: pgAdmin 4 downloads page

Tip

When migrating to a new version of pgAdmin on Windows, you might get this Error:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process

In which case the solution is to delete everything in the %APPDATA%\pgAdmin to reset pgAdmin and then start pgAdmin

Open pgAdmin and connect to your local server

First we create a new database (right click Databases and choose Create), in this tutorial we will name it vector_store, click on Save to create the database

Next we right click on the database name and select Query Tool

Our first query will enable pgvector for the current database (copy the following query into your Query Tool and click the Execute (play) button to run the query):

CREATE EXTENSION vector;

You need to activate the vector extension for each of your databases in which you want to store vectors

AI Retrieval-Augmented Generation (RAG)

In this chapter we will use a technique called Retrieval-Augmented Generation (RAG) to provide the AI with information from our local (markdown) documents. This helps the AI answer questions using information from your own documents

To achieve this will write a script that will convert our content into embeddings and then store them in our vector database.

Embeddings convert text into numbers (vectors) that represent how similar each content chunk is to each other. Similar words get similar numbers, making it easy to find related content, for example:

"cat" → [0.2, 0.5, -0.1]
"kitten" → [0.3, 0.4, -0.2]
"vibrator" → [-0.8, 0.1, 0.9]

Finally we will improve our chat API to be able to query our vector database and hopefully find relevant content that we can add to the prompt (a prompt can contain instructions for the LLM to understand the context and what it should do, additional information from sources it did not know about and the initial user question).

Node.js RAG script (in TypeScript)

In this tutorial, we will use a Node.js version with native TypeScript support to run our script(s) (without having to compile them to javascript first).

Node.js post (typescript support)

LangChain.js Installation

Python is usually the language of choice when data engineers create a project, but as a web developer I prefer building prototypes using JavaScript (TypeScript) for the client side components as well as the Node.js backend. So instead of using the python version of the Langchain Framework we will use the LangChain.js version that is written in JavaScript (TypeScript)

LangChain is a toolset to build RAG pipelines and other workflows powered by an AI model

npm i langchain --save-exact

LangChain.js API Reference

Data loader

We create a new file scripts/embeddings.mts and then add the first feature of our RAG pipeline (the file extension of our script is .mts (typescript module), but .ts would work too):

scripts/embeddings.mts

import { DirectoryLoader } from 'langchain/document_loaders/fs/directory'
import { TextLoader } from 'langchain/document_loaders/fs/text'
import path from 'node:path'
 
async function loadDocuments(documentsPath: string) {
 
    const loader = new DirectoryLoader(
        path.join(process.cwd(), documentsPath),
        {
            '.md': filePath => new TextLoader(filePath),
            '.mdx': filePath => new TextLoader(filePath),
        }
    )
 
    const docs = await loader.load()
 
    return docs
 
}
 
async function main() {
 
    try {
        const documentsPath = 'docs'
        const documents = await loadDocuments(documentsPath)
        console.log('documents loaded: ', documents.length)
    } catch (error) {
        console.error('Error: ', error)
    }
}
 
await main()

Lines 1 to 3: we import the DirectoryLoader and TextLoader from LangChain.js, we also import the path module from Node.js

Lines 9 to 15: we use the langchainjs DirectoryLoader to scan a directory for documents and then load each document using the langchain TextLoader. In this example we scan for both, markdown and MDX documents, feel free to adjust the list of file extensions to your needs

Line 24: we set the documents path to docs as we will download and some documents to a /docs folder next (an improvement would be to add a variable for the documents path in an .env file)

Lines 25 to 26: we count the amount of documents we found and then console log that value

LangChain.js offers a variety of loaders for different file formats

Next we need some documents to test our documents loader. I used the markdown documentation from projects like Next.js app router docs, Node.js API docs and React.dev docs to test how well this RAG prototype works.

Then create a new docs folder and put the markdown documents in it. It doesn't matter if your documents are in sub folders as the directory loader will use a glob library to find all documents

Finally use the following command to run the script (using Node.js):

node --experimental-transform-types scripts/embeddings.mts

To make it easier to call the script in the future we add the command to our package.json scripts:

package.json

"scripts": {
    "dev": "next dev --turbopack",
    "build": "next build",
    "start": "next start",
    "lint": "next lint",
    "embeddings": "node --experimental-transform-types scripts/embeddings.mts"
},

And then we use our new command like this:

npm run embeddings

LangChain.js "DirectoryLoader" documentation
LangChain.js "TextLoader" documentation

Split documents into chunks

Next we will use the LangChain.js MarkdownTextSplitter to split markdown (MDX) documents into smaller chunks.

scripts/embeddings.mts

import { DirectoryLoader } from 'langchain/document_loaders/fs/directory'
import { TextLoader } from 'langchain/document_loaders/fs/text'
import path from 'node:path'
import { MarkdownTextSplitter } from 'langchain/text_splitter'
import { type Document } from 'langchain/document'
 
async function loadDocuments(documentsPath: string) {
 
    const loader = new DirectoryLoader(
        path.join(process.cwd(), documentsPath),
        {
            '.md': filePath => new TextLoader(filePath),
            '.mdx': filePath => new TextLoader(filePath),
        }
    )
 
    const docs = await loader.load()
 
    return docs
 
}
 
async function chunkNorris(documents: Document[]) {
 
    // for each document remove the frontmatter
    documents.forEach((document) => {
        document.pageContent = document.pageContent.replace(/---[\s\S]*?---/, '')
    })
 
    const markdownSplitter = new MarkdownTextSplitter({
        chunkSize: 1000,
        chunkOverlap: 0,
    })
 
    // to get an idea what the default separators are
    //console.log(MarkdownTextSplitter.getSeparatorsForLanguage('markdown'))
 
    // custom markdown separators
    markdownSplitter.separators = [
        '\n## ', // h2 headers
        '\n### ', // h3 headers
        //'\n#{1,6}s', // h1-h6 headers
        '\n```', // code blocks
        '\n\n', // paragraphs
        '\n', // line breaks
    ]
 
    const chunks = await markdownSplitter.splitDocuments(documents)
 
    return chunks
}
 
async function main() {
 
    try {
        const documentsPath = 'docs'
        const documents = await loadDocuments(documentsPath)
        console.log('documents loaded: ', documents.length)
        const chunks = await chunkNorris(documents)
        console.log('chunks created: ', chunks.length)
    } catch (error) {
        console.error('Error: ', error)
    }
}
 
await main()

Line 4: we import the LangChain Markdown (text) splitter, there are a lot of

Line 5: we import the Document type from LangChain

Lines 23 to 51: we first remove the frontmatter part from markdown documents, adding a block of frontmatter as a chunk is not recommended, which is why we remove the frontmatter

Next we specify the maximum size for our chunks, the second value tells the splitter how much overlap you want to have between two chunks, having some overlap can be useful to add context, in this example we set the value to zero to not have any overlap, but feel free to change that number for example to 100 or 200 to add some overlap

Note

Finding the right way to split documents isn't easy - there's no one-size-fits-all solution. You'll need to experiment with:

Different chunk sizes
Amount of overlap between chunks
Separators fine tuning, what is the best approach when it comes to splitting headings, code blocks, and paragraphs

If chunks are too small, they might miss important context. If too large, they'll include irrelevant information. Technical documentation with code blocks needs a different approach than medical or legal documents

Once your pipeline is complete, come back to optimize these settings for your specific content type

We have a (commented out) console log to log the default separators that the markdown splitter would use, the separator is an important part of the text splitter as it has a big influence on the quality of your chunks

Finally we customize the separators a bit, in our case we create chunks of technical documentations, so splitting by heading and also splitting at the start and end of code blocks makes sense, we chose to only split 2nd and 3rd level headings, but if you prefer you could also split all headings (no matter what level)

Lines 59 and 60: we count the amount of chunks we produced and then console log the value

LangChain.js "Document loaders" documentation
LangChain.js "Text splitters" documentation
LangChain.js "How to split code" documentation

Extracting frontmatter metadata (optional)

In the previous code example we removed the frontmatter from our markdown content, because passing it as string to the AI would diminish its semantic value.

A better solution would be to pass the frontmatter string to a parser and then use the metadata to improve the information we will give the AI.

To parse the frontmatter you could install a package like js-yaml:

npm i js-yaml --save-exact

and if you use Typescript add the types:

npm i @types/js-yaml --save-exact --save-dev

Then you edit the embeddings.mts file and first add the new import for js-yaml:

scripts/embeddings.mts

import yaml from 'js-yaml'

And then add a custom frontmatter parser to the chunkNorris function:

frontmatter parser example

documents.forEach((document) => {
    const frontmatter = document.pageContent.match(/---([\s\S]*?)---/)
    if (frontmatter) {
        const yamlContent = frontmatter[1]
        const frontmatterData = yaml.load(yamlContent)
        // we store the frontmatter in the document metadata
        // you might prefer to do something more useful with it
        // but that would go beyond the scope of this tutorial
        document.metadata.frontmatter = frontmatterData
        console.log('document.metadata: ', document.metadata)
    }
})
 
// for each document remove the frontmatter
documents.forEach((document) => {
    document.pageContent = document.pageContent.replace(/---[\s\S]*?---/, '')
})

Finally create a markdown test document:

/docs/frontmatter-example.md

---
title: Frontmatter example
keywords: ['DeepSeek-R1', 'JavaScript']
modified: 2025-02-28T23:00:00.000Z
permalink: https://chris.lu/
section: Web development
---
# Hello, World!

Now run the embeddings script (npm run embeddings) and then check what the console log has printed in your terminal

Embeddings Model

In this step we will add the @langchain/ollama dependency, which will allow LangChain.js to access our local AI embeddings model by using the ollama JavaScript (TypeScript) SDK

First we need to install @langchain/ollama:

npm i @langchain/ollama --save-exact

In this tutorial we will use nomic-embed-text embedding model, you could also re-use DeepSeek-R1 as embedding model and make some tests to compare the quality of the knowledge that gets returned, if you switch to another embedding model make sure to make the changes in both files, the embeddings script and the API route, then empty the database and regenerate the embeddings

Before we start writing more code we need to pull the nomic model using ollama, use the following command in your terminal:

ollama pull nomic-embed-text:latest

Now we can create a new function that will use @langchain/ollama to create embeddings for each of our chunks:

scripts/embeddings.mts

import { DirectoryLoader } from 'langchain/document_loaders/fs/directory'
import { TextLoader } from 'langchain/document_loaders/fs/text'
import path from 'node:path'
import { MarkdownTextSplitter } from 'langchain/text_splitter'
import { type Document } from 'langchain/document'
import { OllamaEmbeddings, type OllamaEmbeddingsParams } from '@langchain/ollama'
 
async function loadDocuments(documentsPath: string) {
 
    const loader = new DirectoryLoader(
        path.join(process.cwd(), documentsPath),
        {
            '.md': filePath => new TextLoader(filePath),
            '.mdx': filePath => new TextLoader(filePath),
        }
    )
 
    const docs = await loader.load()
 
    return docs
 
}
 
async function chunkNorris(documents: Document[]) {
 
    // for each document remove the frontmatter
    documents.forEach((document) => {
        document.pageContent = document.pageContent.replace(/---[\s\S]*?---/, '')
    })
 
    const markdownSplitter = new MarkdownTextSplitter({
        chunkSize: 1000,
        chunkOverlap: 0,
    })
 
    // to get an idea what the default separators are
    //console.log(MarkdownTextSplitter.getSeparatorsForLanguage('markdown'))
 
    // custom markdown separators
    markdownSplitter.separators = [
        '\n## ', // h2 headers
        '\n### ', // h3 headers
        //'\n#{1,6}s', // h1-h6 headers
        '\n```', // code blocks
        '\n\n', // paragraphs
        '\n', // line breaks
    ]
 
    const chunks = await markdownSplitter.splitDocuments(documents)
 
    return chunks
}
 
async function processChunks(chunks: Document[]) {
    const embeddingModel = getEmbeddingModel()
    const chunksContent = chunks.map(chunk => chunk.pageContent)
    const embeddings = await embeddingModel.embedDocuments(chunksContent)
    return embeddings
}
 
function getEmbeddingModel() {
    const options: OllamaEmbeddingsParams = {
        // https://ollama.com/library/nomic-embed-text
        model: 'nomic-embed-text:latest',
        //model: 'deepseek-r1:1.5b',
        baseUrl: 'http://localhost:11434',
    }
    const embeddings = new OllamaEmbeddings(options)
    return embeddings
}
 
async function main() {
 
    try {
        const documentsPath = 'docs'
        const documents = await loadDocuments(documentsPath)
        console.log('documents loaded: ', documents.length)
        const chunks = await chunkNorris(documents)
        console.log('chunks created: ', chunks.length)
        const embeddings = await processChunks(chunks)
        console.log('embeddings created: ', embeddings.length)
    } catch (error) {
        console.error('Error: ', error)
    }
}
 
await main()

Line 6: we import the OllamaEmbeddings package as well as the params type

Lines 54 to 59: to be able to process the chunks we first need an embedding model, then for each pageContent part of each document we create an embedding using the embedding model. An AI embeddings is a numerical representation (vectors) of the semantic meaning of a our content

Lines 61 to 70: just a small helper function (that we will re-use in a bit) to get a fully configured embedding model using the OllamaEmbeddings function from @langchain/ollama

Lines 80 and 81: we print out the amount of embeddings we have generated (should be equal to the chunks number)

LangChain.js "Ollama" documentation

Storing the vectors

Now that we have the embeddings we need to store them somewhere (so that later we can do similarity queries).

This is why in previous chapter we installed PostgreSQL with pgvector, so that PostgreSQL becomes a vector store for our embeddings.

To connect to our vector store we need install @langchain/community (the LangChain.js "community") package as it contains PGVectorStore class, which we will use to initialize a connection to our vectorstore, the second package is pg (node-postgres) a well known PostgreSQL client for Node.js:

npm i @langchain/community pg --save-exact

And then the types for pg (node-postgres):

npm i @types/pg --save-exact --save-dev

Next we add a function to use the langchain pgvector vectorstore:

scripts/embeddings.mts

import { DirectoryLoader } from 'langchain/document_loaders/fs/directory'
import { TextLoader } from 'langchain/document_loaders/fs/text'
import path from 'node:path'
import { MarkdownTextSplitter } from 'langchain/text_splitter'
import { type Document } from 'langchain/document'
import { OllamaEmbeddings, type OllamaEmbeddingsParams } from '@langchain/ollama'
import { PGVectorStore, type PGVectorStoreArgs } from '@langchain/community/vectorstores/pgvector'
import pg, { type PoolConfig, type Pool as PoolType } from 'pg'
 
async function loadDocuments(documentsPath: string) {
 
    const loader = new DirectoryLoader(
        path.join(process.cwd(), documentsPath),
        {
            '.md': filePath => new TextLoader(filePath),
            '.mdx': filePath => new TextLoader(filePath),
        }
    )
 
    const docs = await loader.load()
 
    return docs
 
}
 
async function chunkNorris(documents: Document[]) {
 
    // for each document remove the frontmatter
    documents.forEach((document) => {
        document.pageContent = document.pageContent.replace(/---[\s\S]*?---/, '')
    })
 
    const markdownSplitter = new MarkdownTextSplitter({
        chunkSize: 1000,
        chunkOverlap: 0,
    })
 
    // to get an idea what the default separators are
    //console.log(MarkdownTextSplitter.getSeparatorsForLanguage('markdown'))
 
    // custom markdown separators
    markdownSplitter.separators = [
        '\n## ', // h2 headers
        '\n### ', // h3 headers
        //'\n#{1,6}s', // h1-h6 headers
        '\n```', // code blocks
        '\n\n', // paragraphs
        '\n', // line breaks
    ]
 
    const chunks = await markdownSplitter.splitDocuments(documents)
 
    return chunks
}
 
async function processChunks(chunks: Document[]) {
    const embeddingModel = getEmbeddingModel()
    const chunksContent = chunks.map(chunk => chunk.pageContent)
    const embeddings = await embeddingModel.embedDocuments(chunksContent)
    return embeddings
}
 
function getEmbeddingModel() {
    const options: OllamaEmbeddingsParams = {
        // https://ollama.com/library/nomic-embed-text
        model: 'nomic-embed-text:latest',
        //model: 'deepseek-r1:1.5b',
        baseUrl: 'http://localhost:11434',
    }
    const embeddings = new OllamaEmbeddings(options)
    return embeddings
}
 
export function createPool(): PoolType {
 
    const postgresOptions: PoolConfig = {
        host: '127.0.0.1',
        port: 5432,
        user: 'postgres',
        password: '123',
        database: 'vector_store',
    }
 
    const pool = new pg.Pool(postgresOptions)
 
    return pool
}
 
async function storeVectors(vectors: number[][], chunks: Document[], pgPool: PoolType) {
    const embeddings = getEmbeddingModel()
    const options: PGVectorStoreArgs = {
        pool: pgPool,
        tableName: 'vectors',
        columns: {
            idColumnName: 'id',
            vectorColumnName: 'vector',
            contentColumnName: 'content',
            metadataColumnName: 'metadata',
        },
        // note to self: supported distance strategies: cosine (default),
        // innerProduct, or euclidean
        distanceStrategy: 'cosine',
    }
    const vectorStore = await PGVectorStore.initialize(embeddings, options)
    await vectorStore.addVectors(vectors, chunks)
    return vectorStore
}
 
async function endVectorStorePool(vectorStore: PGVectorStore) {
    // closes all clients and then releases the pg pool
    await vectorStore.end()
}
 
async function main() {
 
    const pgPool = createPool()
 
    pgPool.on('error', (error) => {
        console.error('Unexpected error on idle client', error)
        process.exit(-1)
    })
 
    try {
        const documentsPath = 'docs'
        const documents = await loadDocuments(documentsPath)
        console.log('documents loaded: ', documents.length)
        const chunks = await chunkNorris(documents)
        console.log('chunks created: ', chunks.length)
        const embeddings = await processChunks(chunks)
        console.log('embeddings created: ', embeddings.length)
        const vectorStore = await storeVectors(embeddings, chunks, pgPool)
        console.log('vectors stored')
        await endVectorStorePool(vectorStore)
        console.log('postgres pool released')
    } catch (error) {
        console.error('Error: ', error)
    }
}
 
await main()

Line 7: we import the PGVectorStore class from the LangChain.js community package

Line 8: we import node postgres

Lines: 74-87: we create a pool of connections using node postgres

Lines: 89-107: next we initialize the LangChain.js PGVectorStore. We first pass the PostgreSQL connections pool, then we set a name for our new table and then we define which columns we want to have in that table. The last configuration option we set is the distance strategy we intend to use later when we do our similarity searches. The most popular strategies are cosine (default), innerProduct and euclidean, in this tutorial we will use cosine (I also did some tests and euclidean performed very well too, but cosine seemed to produce slightly better results, but make sure to experiment a bit with different strategies using your own content and try to determine which one works best for you)

Lines: 109-112: another small helper function to close the client connections to our PostgreSQL database and then end the pool

Lines: 116-121: We use our createPool function to get a pool of connections and use the listener function to track any potential errors (if there is an error it means we cant connect to the database, so its best we exit the script at this point)

Lines: 131-134: finally we can store our embeddings in our vector store, when that is done we end the connections pool using a built in function from the LangChain PGVectorStore, which closes all client connections to our database and then release the node postgres pool

You can now run the final embeddings script:

npm run embeddings

Tip

If you have a ECONNREFUSED error in your terminal, make sure ollama is running and if it is make sure you are using the correct port.

If you get the error database "vector_store" does not exist make sure you followed all the steps from the Enabling pgvector for a database chapter

To check if the embeddings got stored successfully, start pgAdmin, right click on the database and choose the query tool, in the query tool run the following query:

SELECT * FROM vectors

LangChain.js "pgvector" documentation
PGVectorStore API reference

Find knowledge (and add it to the prompt)

We have created embeddings and we did store them in our vector store, which means we are ready to some similarity search queries. If the search is successful we will use the retrieved info to improve the instructions we give the AI through our prompt

First we are going to move the code for the PostgreSQL connections pool to a library (lib/postgres.ts) file (we could later update the embeddings.mts we just created so that it uses this new lib too):

/lib/postgres.ts

import pg, { type PoolConfig, type Pool } from 'pg'
 
export type PoolType = Pool
 
export function createPool(): PoolType {
 
    const postgresOptions: PoolConfig = {
        host: '127.0.0.1',
        port: 5432,
        user: 'postgres',
        password: '123',
        database: 'vector_store',
    }
 
    const pool = new pg.Pool(postgresOptions)
 
    return pool
}
 
export async function endPool(pool: PoolType) {
    return pool.end()
}

Next we edit our AI backend (the Next.js API route) and modify it like this:

/app/api/chat/route.ts

import { createOllama } from 'ollama-ai-provider'
import { streamText, embed } from 'ai'
import { type CoreMessage } from 'ai'
import { createPool, endPool } from '@/lib/postgres'
 
interface ChatRequest {
    messages: {
        role: string
        content: string
    }[]
}
 
interface EmbeddingsRow {
    content: string
    metadata: unknown
    vector: number[]
}
 
// sets max streaming response time to 30 seconds
export const maxDuration = 30
 
const findKnowledge = async (question: string) => {
 
    const pgPool = createPool()
 
    pgPool.on('error', (error) => {
        console.error('Error connecting to Postgres:', error)
    })
 
    try {
 
        const embedding = await getEmbedding(question)
 
        const embeddingString = `[${embedding.map(e => e.toFixed(6)).join(',')}]`
 
        // good old sql :)
        const query = `
            SELECT content, metadata, vector <=> $1 AS distance
            FROM vectors
            ORDER BY distance
            LIMIT 5
        `
 
        const result = await pgPool.query<EmbeddingsRow>(query, [embeddingString])
 
        if (result.rows.length > 0) {
 
            const knowledge = result.rows.map(row => ({
                content: row.content,
                metadata: row.metadata
            }))
 
            const knowledgeContent = knowledge.map(k => k.content).join(' ')
 
            return knowledgeContent
 
        } else {
            return null
        }
 
    } catch (error) {
        let message = 'Error while finding knowledge'
        const errorString = error instanceof Error ? error.message : (typeof error === 'string' ? error : 'Unknown error')
        message += errorString !== '' ? ` (${errorString})` : ''
        throw new Error(message)
    } finally {
        await endPool(pgPool)
    }
}
 
const getEmbedding = async (question: string): Promise<number[]> => {
 
    const ollamaProvider = createOllama({
        baseURL: 'http://localhost:11434/api'
    })
 
    const embeddingModel = ollamaProvider.embedding('nomic-embed-text:latest')
    //const embeddingModel = ollamaProvider.embedding('deepseek-r1:1.5b')
 
    try {
        const result = await embed({
            model: embeddingModel,
            value: question,
        })
 
        return result.embedding
    } catch (error) {
        console.error('Error getting embeddings:', error)
        throw new Error('Error getting embeddings')
    }
 
}
 
export async function POST(req: Request) {
 
    const body = await req.json() as ChatRequest
    const messages = body.messages as CoreMessage[]
 
    const ollamaProvider = createOllama({
        baseURL: 'http://localhost:11434/api',
    })
 
    const ollamaModel = ollamaProvider('deepseek-r1:1.5b')
 
    const lastMessage = messages[messages.length - 1].content
    let knowledge = null
 
    try {
        if (typeof lastMessage !== 'string') {
            throw new Error('Message content must be a string')
        }
        knowledge = await findKnowledge(lastMessage)
    } catch (error) {
        const errorMessage = error instanceof Error ? error.message : 'unknown error'
        return new Response(errorMessage, { status: 500 })
    }
 
    let prompt = `You are a helpful AI chat bot.
    You should answer the following question: "${lastMessage}"\n\n`
 
    if (typeof knowledge === 'string' && knowledge !== '') {
        prompt += `When answering you should use the following knowledge:
        ${knowledge}\n\n`
    }
 
    prompt += 'End every response with the sentence "Happy coding!"'
 
    const result = streamText({
        model: ollamaModel,
        prompt: prompt,
        temperature: 0.1,
    })
 
    return result.toDataStreamResponse({
        getErrorMessage: (error) => {
            const message = error instanceof Error ? error.message : (typeof error === 'string' ? error : 'Unknown error')
            console.error(message)
            return message
        },
    })
 
}

Line 2: we add the embed function to imports from the ai SDK

Line 4: we import our new postgres lib

Lines 13 to 17: we add a new interface for our embeddings table rows

Lines 22 to 69: we have added the findKnowledge function, which will first establish a connection to the database, then we run a query that uses the cosine distance, to find content that is similar to the question of the user. When using cosine distance the score range goes from 0 to 2, where 0 means the vectors are identical, 1 means they are not similar and 2 means they are the complete opposite (meaning a lower score is better). We turned both our content and the user question into embeddings and by using the cosine distance we calculate a score that represents how far the question is compared to the content. Is the distance big then it means that the content is likely NOT similar and if the distance is small then there is a high chance that the content is similar

Lines 71 to 92: the getEmbedding function is similar to the what we did in the embeddings script earlier, instead of converting chunks to embeddings it turns the question into an embedding

Lines 108 to 116: we call our findKnowledge function and add some basic error handling

Lines 121 to 126: we add the knowledge we found to the prompt and tell the AI that we want it to use that knowledge when creating a response to the users message

Cosine Similarity vs. Cosine Distance vs. Euclidean Distance in pgvector

In the previous chapter we used used the cosine distance, but pgvector supports several distance functions

You could experiment with other distance functions, for example by replacing the query using the cosine distance with a query using the euclidean distance (also known as L2 distance):

/app/api/chat/route.ts

const query = `
    SELECT content, metadata
    FROM vectors
    ORDER BY vector <-> $1
    LIMIT 5
`

The score for euclidean distance has no upper bound, it starts at zero when two vectors are similar and increases the further the vectors are apart (lower is better)

Or you could switch to using the cosine similarity (cosine similarity and cosine distance are inversely related):

/app/api/chat/route.ts

const query = `
    SELECT content, metadata, 1 - (vector <=> $1) AS similarity
    FROM vectors
    ORDER BY similarity DESC
    LIMIT 5
`

When using cosine similarity you will get a score that ranges from -1 to 1 where 1 is a perfect score, 0 means not similar and -1 means they are the complete opposite (higher is better)

The results will be different but not automatically better, which means you will have to run a multitude of experiments to do, to be able to determine which one works best

Note

pgvector indexes have some requirements / limitations, they only work with queries that are ordered by ASC (DESC does not work, so you probably want to avoid using cosine similarity and instead use cosine distance) and you also need to set a LIMIT to your queries

We have not added an index in this tutorial, but I recommend you do this as a next step as it will improve the performance or your similarity queries a lot

Now make sure the Next.js dev server is running and then open localhost:3000 in your browser to ask the AI some questions about things it can only know because you have added that knowledge to the prompt

Congratulations 🎉 you just built a chatbot and a RAG pipeline in javascript, that runs on your own hardware

pgvector "querying" readme

What's next?

You could improve the chatbot itself, by adding more options to your UI. For example you could add a select box to the chatbot UI that lets the user choose from a list of options, then you use that information to search for documents that have the same value in their frontmatter metadata.

Having a working prototype is a great first step but as we saw earlier in the tutorial, how you split the documents can have a big impact on the quality of the knowledge you will be able to retrieve later, so in my opinion this is where you should start applying optimizations. You could also install other embedding models and compare the quality of the embeddings they produce.

Another area that needs some improvement is the part uses the user message to find relevant content. In this tutorial we only use the last message, but what if the user ask a question by entering multiple messages that are related to each other. There is also a lot of room for improvement regarding how you analyze the words that are in the user message, you could add sentiment analysis to find out if the user is frustrated, confused or hopefully satisfied and then tell the AI in the prompt to respond accordingly.

Or you could improve the prompt by providing the AI with more or better quality knowledge. Building a RAG does not mean you have to solely rely on similarity searches. For example could also query additional sources, like a third party API, use a local search engine like Elasticsearch or scrap data from the web.

Later when improving the MVP to build something that can go into production you will have to think about how to scale your project. The RAG pipeline could get improved by first making a list of all the documents that need to get parsed and then store the list in a database. Next you could use that list to build a queue of tasks where each tasks consists of calling a serverless instance that will load a single document, split it into chunks, create the embeddings and then store them using a serverless PostgreSQL service with support for pgvector.

Finally what is very important is that you come up with a framework to measure the quality of the responses the AI gives you, so that you can constantly verify if the changes you just made have improved or maybe decreased the quality of answers the AI can produce.