RAG Systems Deep Dive Part 1: Core Concepts and Architecture

·10 min read

blog/rag-systems-part-1-core-concepts-architecture

Large Language Models (LLMs) are incredibly powerful, possessing vast knowledge from their training data. However, they face a fundamental limitation: they can’t access or reason about your specific, private, or recent data. Even with today’s models supporting massive context windows of 100K+ tokens, organizations often need to work with datasets that are orders of magnitude larger.

Table of Contents

Quick Reference: Key Terms

  • RAG: Retrieval-Augmented Generation - combining LLMs with the ability to search through your own data
  • Embeddings: Numerical representations of text that capture meaning (think: words → numbers)
  • Vector Database: Specialized storage for embeddings that enables semantic search
  • Semantic Search: Finding information based on meaning rather than exact keyword matches
  • Context Window: The amount of text an LLM can process at once
  • Chunks: Smaller pieces of documents optimized for retrieval and processing

1. Why RAG?

Imagine you’re building an AI assistant that needs to answer questions about your company’s internal documentation, which spans thousands of documents and is updated daily. You have two options:

  1. Fine-tuning the LLM: This means retraining the model on your data. While effective, it’s:

    • Expensive and computationally intensive
    • Time-consuming to implement
    • Difficult to update (requires retraining for new information)
    • Challenging to maintain version control
  2. Using RAG: This approach dynamically fetches relevant information and feeds it to the LLM as context. It’s:

    • More cost-effective
    • Easy to update (just add new documents to your index)
    • Maintains clear data lineage
    • Flexible and scalable

2. How RAG Works: A Simple Example

Let’s say you’re building an AI assistant for a conference organization. Someone asks: “Who attended last year’s conference?”

Here’s how RAG handles this:

  1. Query Processing: The question is converted into a format that can be used to search your data
  2. Retrieval: The system searches a vector database containing your conference records
  3. Context Assembly: Relevant information (e.g., “The 2023 conference attendees were: Alice, Bob, Carol…”) is retrieved
  4. Generation: The LLM uses this specific context along with its general knowledge to formulate a natural response

RAG-Architecture

This simple flow masks sophisticated technology working behind the scenes. At its core, RAG depends on two key innovations:

  1. Vector Embeddings: A way to convert text into numerical representations that capture meaning
  2. Semantic Search: The ability to find information based on meaning rather than just keywords

3. The Building Blocks

To implement RAG, we need to:

  1. Prepare Data: Convert our documents into a format that’s optimized for semantic search

    • Break documents into appropriate chunks
    • Convert text into vector embeddings
    • Store in a vector database
  2. Query System: Create a pipeline that can:

    • Process user questions
    • Find relevant information
    • Combine context with LLM capabilities
    • Generate accurate responses

In the following sections, we’ll dive deep into each of these components, understanding how they work and why they matter. While this article focuses on the theoretical foundations, Part 2 of this series will provide hands-on implementation using LlamaIndex.js, showing you how to build these systems in practice.

But first, let’s understand the fundamental concept that makes RAG possible: vector embeddings and semantic space.

4. Introduction to Vector Space and Embeddings

Before diving into RAG implementation, it’s crucial to understand the fundamental concepts that make it possible. At the heart of modern RAG systems lies the concept of vector embeddings and semantic space.

4.1 Understanding Vector Space

Vector space is a mathematical construct where text is represented as high-dimensional vectors. When we talk about “embeddings,” we’re referring to the process of converting text into these vectors in a way that preserves semantic meaning. For example:

  • The sentence “I love programming” might be converted into a vector like [0.2, -0.5, 0.8, …]
  • A similar sentence “I enjoy coding” would have a similar vector representation
  • An unrelated sentence would be represented by a very different vector

This is why vector similarity (often measured by cosine similarity) can capture semantic similarity - sentences with similar meanings end up close to each other in this high-dimensional space.

LlamaIndex.js typically uses models like text-embedding-3-large, from OpenAI, which can convert text into 1536-dimensional vectors. These vectors capture nuanced semantic relationships:

  • Synonyms cluster together in vector space
  • Related concepts have smaller angular distances
  • Antonyms tend to point in opposite directions

This mathematical representation enables “semantic search” - finding relevant information based on meaning rather than just keywords.

Cost Considerations for Embeddings

While LlamaIndex supports various embedding models, be aware of cost implications:

  • OpenAI’s text-embedding-3-large: Highest quality but most expensive (~$0.13/1M tokens)
  • text-embedding-3-small: Good balance of quality and cost (~$0.02/1M tokens)
  • Open-source alternatives (like Sentence Transformers): Free but require own infrastructure

For development and testing, consider using smaller models or caching embeddings to manage costs. Production systems should balance quality needs with budget constraints.

5. RAG Architecture Deep Dive

Now that we understand why RAG is useful and how vector embeddings work, let’s explore how LlamaIndex helps us build these systems. LlamaIndex abstracts away much of the complexity, but understanding the key components helps us make better implementation choices.

5.1 Document Processing with LlamaIndex

LlamaIndex provides built-in tools for handling documents through its SimpleDirectoryReader and various data connectors. Let’s understand what happens when you process documents:

Document Loading

LlamaIndex handles various document types automatically:

  • PDF files
  • Markdown documents
  • Text files
  • Word documents
  • And many more through LlamaHub connectors

Instead of worrying about different file formats, you can focus on your data:

Key Benefits:

  • Automatic format detection
  • Metadata preservation
  • Built-in error handling
  • Extensible through custom connectors

Note: Additional connectors from LlamaHub require separate installation. We’ll cover this in Part 2.

Text Chunking

LlamaIndex manages document chunking for you, but understanding the options helps choose the right approach:

  1. Default Chunking What LlamaIndex Does:

    • Splits text into manageable pieces
    • Preserves paragraph boundaries when possible
    • Handles overlap automatically
  2. Customizable Options You Can Configure:

    • Chunk size (default is optimized for most cases)
    • Overlap between chunks
    • Splitting strategy (character, token, or sentence-based)

Quick tip:

  • Use default chunking (512 tokens) for general Q&A
  • Use smaller chunks (256 tokens) for precise retrieval
  • Use larger chunks (1024 tokens) for summarization
  • Increase overlap (from default 20 tokens to 50-100) when context preservation is crucial

5.2 Storage Options in LlamaIndex

LlamaIndex supports various ways to store your processed documents and their embeddings:

Built-in Storage

  1. Simple Storage Best for:

    • Development and testing
    • Small to medium datasets
    • Quick prototypes
  2. Persistent Storage Supported options:

    • Local disk storage
    • Vector databases (Chroma, Pinecone)
    • SQL databases with vector extensions

Choosing Storage

Your choice depends on your needs:

Development:

  • Use simple storage
  • Local persistence for iteration

Production:

  • Vector databases for scale
  • Consider managed solutions

Storage Options Compared:

  1. Simple Storage

    • Uses local JSON files by default
    • Good for up to ~10,000 documents
    • No additional setup required
  2. Vector Databases Quick Guide:

    • Chroma: Open-source, easy setup, good for small-medium projects
    • Pinecone: Managed service, scalable, better for production
    • PGVector: PostgreSQL extension, good if you’re already using Postgres

Choose Chroma when:

  • Starting a new project
  • Need quick local setup
  • Working with <500K vectors

Choose Pinecone when:

  • Need high availability
  • Working with >1M vectors
  • Require managed scaling

Choose PGVector when:

  • Already using PostgreSQL
  • Need SQL querying capabilities
  • Want to keep everything in one database

5.3 Retrieval in Practice

LlamaIndex provides several ways to retrieve information:

Basic Retrieval

Default approach:

  1. Convert query to embedding
  2. Find similar chunks
  3. Return top matches

Advanced Options

LlamaIndex offers enhanced retrieval methods:

  1. Similarity Search

    • Default method
    • Works well for most cases
    • Configurable similarity threshold
  2. Hybrid Search

    • Combines keyword and semantic search
    • Better for specific terms
    • More complex but more accurate

Choosing Retrieval Methods:

Similarity Search is better for:

  • General questions
  • Conceptual queries
  • When exact matches aren’t required

Hybrid Search is better for:

  • Technical documentation
  • When specific terms matter
  • Questions containing proper nouns or exact phrases

5.4 Managing Context

LlamaIndex helps manage how much context is sent to the LLM. This is crucial because:

  • Every LLM has a maximum context window (e.g., 4K, 16K, or 128K tokens)
  • More context means higher API costs
  • Too much context can dilute the relevance of information

Context Optimization

LlamaIndex automatically handles these challenges:

  • Maximum Context Length

    • Respects LLM token limits
    • Automatically truncates when needed
    • Preserves most relevant information
  • Smart Selection

    • Prioritizes recent and relevant chunks
    • Removes redundant information
    • Balances context quality and quantity

Understanding Token Limits

Token limits affect both cost and quality:

  • Too few tokens → Missing context → Incomplete answers
  • Too many tokens → Higher costs → Potential information overload

LlamaIndex’s default settings optimize for:

  • GPT-3.5: 4K tokens context window
  • GPT-4: 8K tokens context window
  • Claude: 8K-100K tokens depending on version

You can adjust these based on your specific LLM and needs.

Smart Context Features

  1. Auto-retrieval

    • Automatically fetches relevant context
    • Manages token limits based on LLM capacity
    • Optimizes for response quality while controlling costs
    • Handles pagination for large result sets
  2. Context Reranking

    • Orders information by relevance score
    • Removes redundant content
    • Preserves important details
    • Ensures context stays within token limits

Pro Tip: Monitor your token usage patterns in development. This helps optimize costs and performance when moving to production. LlamaIndex provides built-in token tracking that we’ll explore in Part 2.

6. Looking Ahead: Advanced Features

While LlamaIndex handles the basics well, it also supports advanced use cases:

Multi-Modal Support

Beyond Text:

  • Image processing (coming soon)
  • PDF with images
  • Structured data

Integration Options

Extend functionality:

  • Custom embedding models
  • Different LLM providers
  • External tools and services

These concepts form the foundation for Part 2 of our series, where we’ll implement these features using LlamaIndex.js code. Understanding these components helps us make better choices in our implementations, even though LlamaIndex handles much of the complexity for us.

7. Conclusion: Building Effective RAG Systems

We’ve covered the fundamental concepts and architecture of RAG systems, focusing on how LlamaIndex makes implementation accessible while maintaining flexibility for advanced use cases. Key takeaways include:

  1. RAG’s Advantage:

    • Provides a practical alternative to fine-tuning
    • Enables dynamic access to your data
    • Offers cost-effective scaling
  2. Core Components:

    • Vector embeddings for semantic understanding
    • Efficient document processing and storage
    • Smart retrieval and context management
  3. Implementation Considerations:

    • Choose appropriate chunking strategies
    • Select storage based on scale requirements
    • Configure retrieval methods for your use case

In Part 2 of this series, we’ll translate these concepts into practice, building a complete RAG system with LlamaIndex.js. I’ll show you how to:

  • Set up document processing
  • Configure vector storage
  • Implement retrieval strategies
  • Optimize system performance

Stay tuned for hands-on implementation examples and practical tips for building production-ready RAG applications.

Enjoyed this article? Subscribe for more!

Stay Updated

Get my new content delivered straight to your inbox. No spam, ever.

Related PostsAI, Generative AI, Development, RAG, LLMs, Llamaindex

Pedro Alonso's profile picture

Pedro Alonso

Technical Architect & AI Solutions Consultant

I build modern web applications and AI solutions for companies that want to move fast and scale efficiently. I've spent the last decade leading technical projects and turning complex problems into clean, maintainable systems.

Currently running FastForward IQ, and writing guides about Stripe and AI development.

© 2024 Comyoucom Ltd. Registered in England & Wales