RAG Systems Deep Dive Part 3: Advanced Features and Performance Optimization

·12 min read

blog/building-rag-system-langchain-js-part-3

In Part 2, we built a functional RAG system using LangChain to answer questions about Wikipedia articles. Now, let’s enhance our implementation with advanced features, better retrieval, and production-ready optimizations.

Table of Contents

1. System Architecture Overview

Our enhanced RAG system builds on Part 2’s foundation with three major improvements:

  1. Hybrid Retrieval: Combines vector and keyword search for better accuracy
  2. Streaming Responses: Provides real-time feedback for long responses
  3. Production Optimizations: Adds caching, monitoring, and error handling

Diagram showing component interaction

Your project structure will look like this:

Your project structure should look like this:
```text
rag-tutorial/
├── lib/
│   ├── hybrid-retriever.js   # Enhanced retrieval
│   ├── enhanced-query.js     # Query processing
│   ├── cache.js              # Response caching
│   ├── metrics.js            # Performance monitoring
│   └── optimized-engine.js   # Production engine
├── scripts/
│   └── test-enhanced-rag.js  # Testing script
└── package.json

2. Enhancing Retrieval Quality

Let’s start by improving how our system finds relevant information. Our Part 2 implementation used basic vector similarity search, but we can do better.

2.1 Hybrid Search Implementation

Vector search, while powerful for semantic understanding, has limitations. When you search “Who won the 2024 election?”, a pure vector search might miss documents containing “victory” or “defeat” if those exact terms weren’t part of the training data. Additionally, vector search can sometimes miss exact phrase matches that would be obvious to humans.

Hybrid search addresses these limitations by combining two approaches:

  1. Vector (Semantic) Search:

    • Uses embeddings to understand meaning
    • Great for conceptual matching
    • Handles paraphrasing and synonyms
    • Can miss exact matches
  2. Keyword (Lexical) Search:

    • Uses traditional text matching (TF-IDF)
    • Excellent for exact matches
    • Handles specific terms and phrases
    • Misses semantic relationships

By combining both vector and keyword search, we get the best of both worlds: semantic understanding and precise matching. While the Python version of LangChain includes TF-IDF (Term Frequency-Inverse Document Frequency) for sophisticated keyword matching, in JavaScript we’ll implement a simpler but effective text matching approach. This demonstrates the core concepts while remaining practical for production use.

Our implementation focuses on:

  • Parallel execution of both search types
  • Smart result combination
  • Configurable weighting
  • Clean interface integration

Create lib/hybrid-retriever.js:

import { MemoryVectorStore } from "langchain/vectorstores/memory";

export class HybridRetriever {
  constructor(documents, embeddings, options = {}) {
    this.vectorStore = new MemoryVectorStore(embeddings);
    this.documents = documents;
    this.options = {
      vectorWeight: options.vectorWeight || 0.7,
      keywordWeight: options.keywordWeight || 0.3,
      minScore: options.minScore || 0.3
    };
  }

  static async fromDocuments(documents, embeddings) {
    const hybrid = new HybridRetriever(documents, embeddings);
    await hybrid.vectorStore.addDocuments(documents);
    return hybrid;
  }

  // Simple keyword search
  keywordSearch(query, documents, k = 2) {
    const queryTerms = query.toLowerCase().split(' ');
    
    // Score documents based on term matches
    const scoredDocs = documents.map(doc => {
      const content = doc.pageContent.toLowerCase();
      const score = queryTerms.reduce((acc, term) => {
        // Count exact matches
        const exactMatches = (content.match(new RegExp(term, 'g')) || []).length;
        return acc + exactMatches;
      }, 0);
      
      return { document: doc, score };
    });

    // Return top k results
    return scoredDocs
      .filter(doc => doc.score > 0)
      .sort((a, b) => b.score - a.score)
      .slice(0, k)
      .map(doc => ({
        ...doc.document,
        metadata: { ...doc.document.metadata, score: 0.8, source: 'keyword' }
      }));
  }

  async getRelevantDocuments(query, options = { vectorK: 3, keywordK: 2 }) {
    // Get results from both methods
    const [vectorDocs, keywordDocs] = await Promise.all([
      this.vectorStore.similaritySearch(query, options.vectorK),
      Promise.resolve(this.keywordSearch(query, this.documents, options.keywordK))
    ]);

    // Combine results, prioritizing vector search
    const seenContents = new Set();
    const combined = [];

    // Add vector results first
    for (const doc of vectorDocs) {
      if (!seenContents.has(doc.pageContent)) {
        seenContents.add(doc.pageContent);
        combined.push({
          ...doc,
          metadata: { ...doc.metadata, score: 1.0, source: 'vector' }
        });
      }
    }

    // Add keyword results
    for (const doc of keywordDocs) {
      if (!seenContents.has(doc.pageContent)) {
        seenContents.add(doc.pageContent);
        combined.push(doc);
      }
    }

    return combined;
  }
}

2.2 Enhanced Query Engine

The Enhanced Query Engine implementation builds on this hybrid retriever to create a more sophisticated query processing pipeline.

Update our query engine to use hybrid search. Create lib/enhanced-query.js:

import { ChatOpenAI } from "@langchain/openai";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";
import { OpenAIEmbeddings } from "@langchain/openai";
import { PromptTemplate } from "@langchain/core/prompts";
import { HybridRetriever } from './hybrid-retriever.js';

const ENHANCED_PROMPT = `You are analyzing Wikipedia content to provide accurate information. The context comes from both vector (semantic) and keyword searches.

Context information (ranked by relevance):
-----------------------------------------
{context}
-----------------------------------------

Please provide a detailed answer that:
1. Directly answers the question
2. Notes dates and timeline information
3. References which search method found the information
4. Expresses uncertainty for very recent events

Question: {question}

Detailed answer:`;

async function formatDocuments(docs) {
  return docs.map((doc, i) => {
    const source = doc.metadata?.source || 'unknown';
    const score = doc.metadata?.score || 'N/A';
    return `[${i + 1}] (${source} search, score: ${score})\n${doc.pageContent}\n`;
  }).join('\n');
}

export async function createEnhancedQueryEngine(documents) {
  const embeddings = new OpenAIEmbeddings();
  const retriever = await HybridRetriever.fromDocuments(documents, embeddings);
  
  const model = new ChatOpenAI({
    modelName: 'gpt-3.5-turbo',
    temperature: 0.3,
  });

  const prompt = PromptTemplate.fromTemplate(ENHANCED_PROMPT);

  const chain = RunnableSequence.from([
    {
      // This properly formats the input for the prompt
      context: async (input) => {
        const docs = await retriever.getRelevantDocuments(input.question);
        return formatDocuments(docs);
      },
      question: (input) => input.question,
    },
    // The prompt template converts the input into a prompt
    prompt,
    // The model processes the prompt
    model,
    // The output parser converts the model output to a string
    new StringOutputParser()
  ]);

  return chain;
}

2.3 Testing the Enhanced Query Engine

Create a new file scripts/test-hybrid.js:

import { loadDocuments } from '../lib/document-loader.js';
import { createEnhancedQueryEngine } from '../lib/enhanced-query.js';

async function testHybridQueries() {
  console.log('Loading documents...');
  const documents = await loadDocuments();
  
  console.log('Initializing enhanced query engine...');
  const queryEngine = await createEnhancedQueryEngine(documents);

  const questions = [
    "What are Donald Trump's most recent political activities?",
    "What legal challenges is Trump currently facing?",
    "Who won and who was the runner up in the 2024 Republican primaries?"
  ];

  console.log('\nTesting enhanced retrieval with sample queries...\n');

  for (const question of questions) {
    console.log(`Question: ${question}`);
    console.log('Processing...');
    
    const startTime = performance.now();
    const response = await queryEngine.invoke({ question });
    const duration = performance.now() - startTime;
    
    console.log('\nAnswer:', response);
    console.log(`Duration: ${Math.round(duration)}ms\n`);
    console.log('-'.repeat(50), '\n');
  }
}

testHybridQueries().catch(console.error);

Run the script with node scripts/test-hybrid.js.

Output:

Loading documents...
Split into 165 chunks
Initializing enhanced query engine...

Testing enhanced retrieval with sample queries...

Question: What are Donald Trump's most recent political activities?
Processing...

Answer: Donald Trump's most recent political activities include his continued dominance within the Republican Party, where he has been described as a modern party boss. He has focused on fundraising, raising more than twice as much as the Republican Party itself, and profiting from fundraisers held at Mar-a-Lago. Trump has also been actively involved in endorsing over 200 candidates for various offices in the 2022 midterm elections, most of whom supported his false claim that the 2020 presidential election was stolen from him. Additionally, he has been focused on how elections are run and has been involved in efforts to oust election officials who resisted his attempts to overturn the 2020 election results.
Duration: 2540ms

-------------------------------------------------- 

Question: What legal challenges is Trump currently facing?
Processing...

Answer: As of September 2023, Donald Trump is facing multiple legal challenges. In September 2022, the attorney general of New York filed a civil fraud case against Trump, his three oldest children, and the Trump Organization. Trump was fined $110,000 for failing to turn over records subpoenaed by the attorney general. In February 2024, the court found Trump liable, ordering him to pay a penalty of more than $350 million plus interest, totaling over $450 million. Additionally, Trump was barred from serving as an officer or director of any New York corporation or legal entity for three years. Trump has stated his intention to appeal the verdict. Furthermore, Trump was sued for violating the Domestic and Foreign Emoluments Clauses of the U.S. Constitution, with cases being dismissed by the U.S. Supreme Court as moot after the end of his term.
Duration: 2811ms

-------------------------------------------------- 

Question: Who won and who was the runner up in the 2024 general election?
Processing...

Answer: In the 2024 general election, Donald Trump was elected as the 47th president of the United States, defeating incumbent vice president Kamala Harris. Trump won with 49.9% of the vote and a margin of 1.6% over his opponent, making him the first Republican to win the popular vote since 2004. Kamala Harris was the runner-up in the election. The Associated Press and BBC News described Trump's victory as an extraordinary comeback for a former president.
Duration: 2134ms

-------------------------------------------------- 

Analyzing Our Results

Let’s examine how well our hybrid retrieval system performed:

  1. Recent Political Activities Query:
What are Donald Trump's most recent political activities?

The response focuses on older events (2022 midterms and fundraising) but misses the most significant recent developments: his 2024 presidential campaign, victory, and upcoming inauguration. This shows a limitation in our system where recency isn’t properly prioritized in the retrieval process.

  1. Legal Challenges Query:
What legal challenges is Trump currently facing?

While it provides good detail about the New York civil fraud case, it misses several major legal challenges including his criminal conviction, the Georgia case, and the E. Jean Carroll verdicts. This suggests our retrieval system isn’t effectively gathering information spread across multiple chunks.

  1. 2024 Election Query:
Who won and who was the runner up in the 2024 general election?

This response is the most accurate and complete, correctly identifying Trump’s victory over Kamala Harris and including specific vote percentages. It works well because the information is likely contained in a single, well-defined chunk of text.

These results show that while hybrid retrieval can be effective, it still faces challenges with:

  • Prioritizing recent information
  • Gathering comprehensive information spread across multiple chunks
  • Consistently finding the most relevant content

Key differences from Part 2’s implementation:

  • Hybrid Retrieval: Combines both vector and keyword search results
  • Source Attribution: The prompt and formatting show which retrieval method found each piece of information
  • Scoring: Documents are scored based on their retrieval method and relevance
  • Enhanced Context: The prompt template is designed to make better use of the dual retrieval sources

The hybrid approach should provide more comprehensive results, especially for:

  • Questions with specific names or dates
  • Queries that combine conceptual and factual elements
  • Recent events that might be described with varying terminology

3. Production Optimizations

Let’s explore two key optimizations that make our RAG system production-ready: streaming responses and basic caching.

3.1 Streaming Responses

Long responses can take several seconds to generate, leading to poor user experience. Streaming solves this by sending chunks of the response as they’re generated. This provides immediate feedback and better user engagement.

Update lib/enhanced-query.js to support streaming:

import { ChatOpenAI } from "@langchain/openai";
import { StringOutputParser } from "@langchain/core/output_parsers";
import { RunnableSequence } from "@langchain/core/runnables";

export async function createStreamingQueryEngine(documents) {
  const retriever = await HybridRetriever.fromDocuments(documents, embeddings);
  
  const model = new ChatOpenAI({
    modelName: 'gpt-3.5-turbo',
    temperature: 0.1,
    streaming: true
  });

  const chain = RunnableSequence.from([
    {
      context: async (input) => {
        const docs = await retriever.getRelevantDocuments(input.question);
        return formatDocuments(docs);
      },
      question: (input) => input.question,
    },
    prompt,
    model,
    new StringOutputParser()
  ]);

  return chain;
}

The streaming implementation will be fully utilized in Part 4’s web interface. For now, we can test it with a simple script:

import { createStreamingQueryEngine } from '../lib/enhanced-query.js';

async function testStreaming() {
  const engine = await createStreamingQueryEngine(documents);
  
  const stream = await engine.stream({
    question: "What are Trump's current legal challenges?"
  });
  
  for await (const chunk of stream) {
    process.stdout.write(chunk);
  }
}

3.2 Basic Response Caching

Implementing a simple cache can significantly improve response times for repeated queries. Here’s a straightforward caching implementation using LRU (Least Recently Used) strategy:

import { LRUCache } from 'lru-cache';

export class SimpleCache {
  constructor(options = {}) {
    this.cache = new LRUCache({
      max: options.maxSize || 100,
      ttl: options.ttl || 1000 * 60 * 60  // 1 hour default
    });
  }

  async get(query) {
    return this.cache.get(query);
  }

  async set(query, response) {
    this.cache.set(query, response);
  }
}

Integration with our query engine:

export class CachedQueryEngine {
  constructor(documents) {
    this.cache = new SimpleCache();
    this.queryEngine = createEnhancedQueryEngine(documents);
  }

  async query(question, options = {}) {
    // Skip cache for streaming requests
    if (options.stream) {
      return this.queryEngine.stream({ question });
    }

    const cachedResponse = await this.cache.get(question);
    if (cachedResponse) {
      return cachedResponse;
    }

    const response = await this.queryEngine.invoke({ question });
    await this.cache.set(question, response);
    return response;
  }
}

4. Best Practices and Common Challenges

Based on our implementation experience, here are key considerations for production RAG systems:

4.1 Error Handling

Common error scenarios to handle:

  • API rate limits
  • Token context length exceeded
  • Network timeouts
  • Invalid or missing documents

Example error handling implementation:

async function queryWithRetry(question, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await queryEngine.query(question);
    } catch (error) {
      if (error.message.includes('rate limit') && i < maxRetries - 1) {
        await new Promise(r => setTimeout(r, Math.pow(2, i) * 1000));
        continue;
      }
      throw error;
    }
  }
}

4.2 Performance Considerations

Key areas to monitor and optimize:

  1. Document Retrieval

    • Tune chunk sizes (1000 tokens works well)
    • Adjust vector vs keyword result ratios
    • Monitor retrieval times
  2. Response Generation

    • Use streaming for long responses
    • Implement basic caching
    • Set appropriate timeouts
  3. Resource Usage

    • Monitor memory usage for vector storage
    • Track API token consumption
    • Cache hit rates

5. Next Steps

In Part 4, we’ll build a complete web interface for our RAG system, including:

  • Real-time streaming display
  • Source attribution
  • Error handling
  • Performance monitoring

The interface will demonstrate how to effectively use the streaming and caching capabilities we’ve implemented here.

Conclusion

Our enhanced RAG system now provides better search accuracy through hybrid retrieval, improved user experience with streaming, and better performance through caching. While there’s always room for more sophisticated optimizations, this implementation provides a solid foundation for production use.

Enjoyed this article? Subscribe for more!

Stay Updated

Get my new content delivered straight to your inbox. No spam, ever.

Related PostsTags: AI, Generative AI, Development, RAG, LLMs, Llamaindex

© 2024 Comyoucom Ltd. Registered in England & Wales