In Part 1, we explored the theoretical foundations of RAG systems. Now, let’s build a working implementation! We’ll create a RAG system that answers questions about recent Wikipedia articles, showing how LLMs can work with fresh data.

1. System Architecture

A RAG system operates in two distinct phases: indexing and querying. Understanding these phases is crucial for both development and production deployment.

1. Indexing Phase

The indexing phase prepares your documents for efficient retrieval. During this phase, the system:

Fetches and processes documents from your data sources
Splits them into optimal chunks for context retention
Generates vector embeddings for semantic understanding
Stores them in a vector database for quick access

In production environments, this phase typically runs as a background process. You might schedule it to run on a fixed schedule or trigger it when source documents update. This ensures your knowledge base remains current without impacting real-time query performance.

2. Query Phase

The query phase handles real-time user interactions. When a query comes in:

The system processes the user’s question
Retrieves the most semantically relevant chunks
Uses an LLM to generate a coherent response

This process happens in real-time as part of your application’s request-response cycle, typically completing within seconds.

diagram

2. Building the Indexing Pipeline

2.1 Project Setup

We’ll use Next.js as our application framework and LangChain as our RAG toolkit. This combination provides a robust foundation for building production-ready RAG systems.

npx create-next-app@latest rag-tutorial --javascript --tailwind --eslint
cd rag-tutorial
npm install @langchain/openai @langchain/community langchain

Environment Setup

Proper environment configuration is essential for development and production environments. We need to manage API keys and other sensitive configuration securely.

npm install dotenv

Create your environment files:

# .env (for development)
OPENAI_API_KEY=your_api_key_here

# .env.local (for Next.js)
OPENAI_API_KEY=your_api_key_here

Important: These files contain sensitive credentials. Add them to your .gitignore to prevent accidental commits.

For scripts outside Next.js, import the environment configuration:

import 'dotenv/config';

2.2 Getting Our Data

We’re using Wikipedia as our data source for several practical reasons:

The API provides clean, structured content
Articles are regularly updated
Content is well-formatted and requires minimal preprocessing
Perfect for demonstrating how RAG handles current events

Let’s create a simple script to fetch and prepare data from Wikipedia. This is a great example of how RAG systems can work with current information that exists outside of the LLM’s training cutoff date. Create a new file scripts/fetch-articles.js:

import { writeFileSync } from 'fs';

async function fetchArticles() {
  // Wikipedia API endpoint for full article content
  const url = 'https://en.wikipedia.org/w/api.php?action=query&format=json&prop=extracts&titles=Donald_Trump&explaintext=1&origin=*';
  
  try {
    const response = await fetch(url);
    const data = await response.json();
    
    // Extract the page content (need to get the first page from pages object)
    const pageId = Object.keys(data.query.pages)[0];
    const article = {
      title: data.query.pages[pageId].title,
      content: data.query.pages[pageId].extract
    };
    
    // Save to file
    writeFileSync(
      './data/article-1.txt',
      `${article.title}\n\n${article.content}`
    );
    
    console.log('Article saved successfully');
  } catch (error) {
    console.error('Error fetching article:', error);
  }
}

fetchArticles();

Create a data directory and run the script:

mkdir data
node scripts/fetch-articles.js

Output:

Article saved successfully

Important: Since we’re using ES modules outside of Next.js, add "type": "module" to your package.json file:
{
  "name": "rag-tutorial",
  "type": "module",
  ...
}

Why Wikipedia?

While we could fetch data from various news sources, Wikipedia offers several advantages for this tutorial:

Reliable API: Wikipedia provides a stable, well-documented API that’s easy to work with
Structured Data: The content is already well-formatted and cleaned
Current Events: Wikipedia pages are regularly updated with recent information
Educational Purpose: Using Donald Trump’s page as an example helps demonstrate how RAG systems can provide information about recent events that occurred after the LLM’s training cutoff date

2.3 Document Processing

The document processing step is critical for retrieval quality. We implement intelligent chunking that:

Maintains semantic coherence with 1000-token chunks
Preserves context through 200-token overlaps
Retains metadata for traceability
Uses natural break points like paragraphs

First, let’s create a robust document loader that handles chunking and metadata. Create lib/document-loader.js:

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import * as fs from 'fs';

export async function loadDocuments() {
  const text = fs.readFileSync('./data/article-1.txt', 'utf8');
  
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 200,
    separators: ['\n\n', '\n', '. ', ' ', '']
  });
  
  const documents = await splitter.createDocuments(
    [text],
    [{
      source: 'wikipedia',
      date: new Date().toISOString(),
      type: 'article'
    }]
  );
  
  console.log(`Split into ${documents.length} chunks`);
  return documents;
}

export function analyzeChunks(documents) {
  return documents.map((doc, index) => ({
    chunkIndex: index,
    length: doc.pageContent.length,
    completeSentences: doc.pageContent.split('. ').length,
    text: doc.pageContent,
    metadata: doc.metadata
  }));
}

Chunk Quality Note: Analyzing your chunks helps identify potential issues like:

Chunks that are too short or too long

Chunks with incomplete sentences

Uneven chunk distribution

Poor chunking can lead to lower quality responses, so it’s worth monitoring during development.

Let’s analyze our chunks. Create scripts/analyze-chunks.js:

import { loadDocuments, analyzeChunks } from '../lib/document-loader.js';

async function analyzeDocument() {
  const documents = await loadDocuments();
  const analysis = analyzeChunks(documents);
  
  console.log(`\nDocument Analysis Results:`);
  console.log(`Total chunks: ${documents.length}`);
  
  // Add some statistics
  const avgLength = Math.round(
    analysis.reduce((sum, chunk) => sum + chunk.length, 0) / analysis.length
  );
  const avgSentences = Math.round(
    analysis.reduce((sum, chunk) => sum + chunk.completeSentences, 0) / analysis.length
  );
  
  console.log(`\nStatistics:`);
  console.log(`- Average chunk length: ${avgLength} characters`);
  console.log(`- Average sentences per chunk: ${avgSentences}`);
  
  console.log('\nSample chunks:');
  analysis.slice(0, 3).forEach(chunk => {
    console.log(`\nChunk ${chunk.chunkIndex}:`);
    console.log(`- Length: ${chunk.length} characters`);
    console.log(`- Sentences: ${chunk.completeSentences}`);
    console.log(`- Preview: "${chunk.text?.substring(0, 100)}..."`);
    console.log('---');
  });
}

analyzeDocument().catch(console.error);

Run the analysis:

node scripts/analyze-chunks.js

Example output:

Split into 165 chunks

Document Analysis Results:
Total chunks: 165

Statistics:
- Average chunk length: 624 characters
- Average sentences per chunk: 5

Sample chunks:

Chunk 0:
- Length: 12 characters
- Sentences: 1
- Preview: "Donald Trump..."
---

Chunk 1:
- Length: 774 characters
- Sentences: 6
- Preview: "Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman..."
---

Chunk 2:
- Length: 871 characters
- Sentences: 9
- Preview: "presidential campaign.
Trump won the 2016 presidential election. His election and policies sparked n..."
---

2.4 Vector Store Management

The vector store serves as our primary retrieval mechanism. It:

Stores document embeddings efficiently
Enables semantic similarity search
Persists data between application restarts
Supports incremental updates

Create lib/storage.js to handle vector store operations:

import { OpenAIEmbeddings } from "@langchain/openai";
import { HNSWLib } from "@langchain/community/vectorstores/hnswlib";

let vectorStore = null;

export async function setupVectorStore(documents) {
  if (vectorStore) return vectorStore;
  const embeddings = new OpenAIEmbeddings();

  // HNSWLib parameters explanation:
  // - Space: 'cosine' for cosine similarity
  // - numDimensions: 1536 for OpenAI embeddings
  vectorStore = await HNSWLib.fromDocuments(documents, embeddings);  
  return vectorStore;
}

export async function saveVectorStore(directory = "./data") {
  if (!vectorStore) throw new Error("Vector store not initialized");
  await vectorStore.save(directory);
}

export async function loadVectorStore(directory = "./data") {
  const embeddings = new OpenAIEmbeddings();
  vectorStore = await HNSWLib.load(directory, embeddings);
  return vectorStore;
}

// Helper function to manage vector store lifecycle
export async function getOrCreateVectorStore(documents) {
  try {
    // Try to load existing vector store
    return await loadVectorStore();
  } catch (error) {
    // Create new vector store if loading fails
    console.log("Creating new vector store...");
    const store = await setupVectorStore(documents);
    await saveVectorStore();
    return store;
  }
}

Create a script to build our index, scripts/build-index.js:

import { loadDocuments } from '../lib/document-loader.js';
import { setupVectorStore } from '../lib/storage.js';

async function buildIndex() {
  console.log('Loading documents...');
  const documents = await loadDocuments();
  
  console.log('Building vector store...');
  await setupVectorStore(documents);
  
  console.log('Index built successfully!');
}

buildIndex().catch(console.error);

Run it:

node scripts/build-index.js

Output:

Loading documents...
Split into 165 chunks
Building vector store...
Index built successfully!

3. Building the Query Pipeline

3.1 Query Engine

The query engine coordinates the retrieval and response generation process. It:

Manages the retrieval chain
Handles prompt templates
Orchestrates the LLM interaction
Formats the final response

Create lib/prompts.js for our prompt templates:

import { PromptTemplate } from "@langchain/core/prompts";

export const PROMPT_TEMPLATES = {
  default: `Answer the question based only on the following context:
{context}

Question: {question}

Answer:`,

  detailed: `You are analyzing Wikipedia content to provide accurate information.

Context information:
-------------------
{context}
-------------------

Please provide a detailed answer that:
1. Directly answers the question
2. Notes dates and timeline information
3. Expresses uncertainty when appropriate
4. Mentions if events are recent/ongoing

Question: {question}

Detailed answer:`,
};

export function createPrompt(type = 'default') {
  return PromptTemplate.fromTemplate(PROMPT_TEMPLATES[type]);
}

Create lib/query-engine.js:

import { ChatOpenAI } from "@langchain/openai";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence, RunnablePassthrough } from "@langchain/core/runnables";
import { formatDocumentsAsString } from "langchain/util/document";
import { createPrompt } from './prompts';
import { loadDocuments } from './document-loader';
import { getOrCreateVectorStore } from './storage';

let chain = null;

export async function initializeChain(promptType = 'default') {
  if (chain) return chain;
  
  const documents = await loadDocuments();
  const vectorStore = await getOrCreateVectorStore(documents);
  const retriever = vectorStore.asRetriever();
  
  // Temperature of 0.7 balances creativity and accuracy:
  // - 0.0: Most deterministic, best for fact retrieval
  // - 0.7: Good balance for natural responses while maintaining accuracy
  // - 1.0: Maximum creativity, may diverge from context
  const model = new ChatOpenAI({
    modelName: 'gpt-3.5-turbo',
    temperature: 0.7,
  });
  
  const prompt = createPrompt(promptType);
  
  chain = RunnableSequence.from([
    {
      context: retriever.pipe(formatDocumentsAsString),
      question: new RunnablePassthrough(),
    },
    prompt,
    model,
    new StringOutputParser(),
  ]);
  
  return chain;
}

export async function queryDocuments(question, options = {}) {
  const queryChain = await initializeChain(options.promptType);
  
  try {
    const response = await queryChain.invoke(question);
    return {
      answer: response,
      metadata: {
        promptType: options.promptType || 'default',
        timestamp: new Date().toISOString()
      }
    };
  } catch (error) {
    console.error('Query failed:', error);
    
    // Handle specific error cases
    if (error.message.includes('rate limit')) {
      throw new Error('Rate limit exceeded. Please try again later.');
    }
    if (error.message.includes('maximum context length')) {
      throw new Error('Question too complex. Please try a simpler query.');
    }
    throw error;
  }
}

3.2 Testing the System

Create scripts/test-queries.js:

import { queryDocuments } from '../lib/query-engine.js';

async function testQueries() {
  const questions = [
  "What are Donald Trump's most recent political activities?",
    "What legal challenges is Trump currently facing?",
    "Who won and who as the runner up in the 2024 Republican primaries?"
  ];
  
  console.log('Testing RAG system with sample queries...\n');
  
  for (const question of questions) {
    console.log(`Question: ${question}`);
    console.log('Thinking...');
    
    const startTime = Date.now();
    const answer = await queryDocuments(question);
    const duration = Date.now() - startTime;
    
    console.log('\nAnswer:', answer);
    console.log(`Duration: ${duration}ms\n`);
    console.log('-'.repeat(50), '\n');
  }
}

testQueries().catch(console.error);

Run the tests:

node scripts/test-queries.js

Example output:

Testing RAG system with sample queries...

Question: What are Donald Trump's most recent political activities?
Thinking...
Split into 165 chunks

Answer: {
  answer: 'In the 2022 midterm elections, Donald Trump endorsed over 200 candidates for various offices, most of whom supported his false claim that the 2020 presidential election was stolen from him. He continued fundraising, raised more than twice as much as the Republican Party itself, and focused on how elections are run and on ousting election officials who had resisted his attempts to overturn the 2020 election results.',
  metadata: { promptType: 'default', timestamp: '2024-11-25T21:12:02.827Z' }
}
Duration: 2170ms

-------------------------------------------------- 

Question: What legal challenges is Trump currently facing?
Thinking...

Answer: {
  answer: 'Trump is currently facing a civil fraud case filed by the attorney general of New York, in which he was found liable, ordered to pay a penalty of over $450 million, and barred from serving as an officer or director of any New York corporation or legal entity for three years. Additionally, he is facing federal charges related to national defense information under the Espionage Act, making false statements, conspiracy to obstruct justice, and other charges.',
  metadata: { promptType: 'default', timestamp: '2024-11-25T21:12:04.824Z' }
}
Duration: 1995ms

-------------------------------------------------- 

Question: Who won and who as the runner up in the 2024 Republican primaries?
Thinking...

Answer: {
  answer: 'Donald Trump won the 2024 Republican primaries and Kamala Harris was the runner up.',
  metadata: { promptType: 'default', timestamp: '2024-11-25T21:12:05.914Z' }
}
Duration: 1090ms

--------------------------------------------------

4. Best Practices and Optimization

Based on our implementation, here are key practices we’ve followed:

Chunk Size Optimization
- We used 1000-token chunks with 200-token overlap
- This balanced context retention with processing efficiency
- Larger chunks would provide more context but increase API costs
- Smaller chunks might miss important relationships
Prompt Engineering
- Our prompt template explicitly bounds the context
- We instruct the model to acknowledge when information might be too recent
- The temperature of 0.7 balances accuracy with natural responses
Error Handling
- Each component has proper error handling
- The system gracefully handles missing or invalid data
- API rate limits and timeouts are considered

5. Common Issues and Solutions

When implementing this RAG system, you might encounter these common challenges:

1. OpenAI API Rate Limits

Error: Rate limit exceeded. Please try again later.

Solution: Implement exponential backoff or use token bucket rate limiting:

const wait = (ms) => new Promise(resolve => setTimeout(resolve, ms));
const queryWithRetry = async (question, retries = 3) => {
  for (let i = 0; i < retries; i++) {
    try {
      return await queryDocuments(question);
    } catch (error) {
      if (error.message.includes('rate limit') && i < retries - 1) {
        await wait(Math.pow(2, i) * 1000);
        continue;
      }
      throw error;
    }
  }
};

2. Vector Store Persistence

If you see Error: Vector store not initialized, check:

Directory permissions
File corruption in the ./data directory
Sufficient disk space

3. Large Documents

For maximum context length errors:

Reduce chunk size (try 500 tokens)
Decrease chunk overlap
Use more selective retrieval

4. Memory Issues

If processing large articles:

Process documents in batches
Implement streaming for large files
Monitor memory usage during embedding generation

6. What’s Next in the Series?

This tutorial got you started with a working RAG system that can process Wikipedia articles and answer questions about them. In the upcoming parts of this series, we’ll expand this foundation:

Part 3: Advanced RAG Features

Hybrid search implementations
Response streaming
Advanced caching strategies
Multi-document querying
Cost optimization techniques

Part 4: Building the Web Interface

Creating the Next.js API endpoint we prepared
Implementing real-time streaming responses
Building an interactive UI with:
- Query input
- Response streaming display
- Source attribution
- Confidence scoring
Adding proper error handling and loading states

Try It Now

While waiting for the next parts, you can experiment with this implementation:

Try different chunk sizes and overlap settings
Test with other Wikipedia articles
Modify the prompt templates
Add more error handling cases

Pedro Alonso

RAG Systems Deep Dive Part 2: Practical Implementation with LangChain

Table of Contents