Introduce PageIndex to me. Why pursue vectorless RAG?
This is a very cutting-edge and pertinent question. The rise of PageIndex and No-Vector RAG (Vectorless RAG) is essentially a reflection on and correction of the limitations of the current mainstream Vector RAG (Retrieval-Augmented Generation based on vector databases).
Simply put, PageIndex is a technology based on Inverted Index, aiming to precisely locate relevant "Pages" or complete contexts through keyword matching, rather than relying on fuzzy semantic vector search on slices.
Below, I will break down the concept of PageIndex in detail for you, as well as why we are starting to shift towards or combine "vectorless RAG."
In traditional Vector RAG, we chop documents into small pieces (Chunks), convert them into vectors, and then search for similar vectors. In the PageIndex mode, the logic returns to the origin of search engines:
Indexing Stage: The system no longer slices and vectorizes documents. Instead, it builds a global inverted index (similar to the index page at the end of a book). It records which pages (Pages) or documents each keyword appears in.
Retrieval Stage: When a user asks a question, the system extracts keywords from the query and uses the inverted index to directly locate the complete pages containing these keywords.
Generation Stage: The system feeds these complete page contents (not fragmented Chunks) as context to the LLM (Large Language Model), allowing the LLM to read, understand, and extract the answer itself.
Core Difference: Vector RAG feeds the LLM "shredded paragraphs"; PageIndex feeds the LLM "complete pages containing clues."
Although Vector RAG is popular, it exposes three core pain points in practical implementation. Vectorless RAG aims to solve precisely these problems:
Pain Point of Vector RAG: To store in a vector database, long documents must be chopped into 500-1000 word chunks. This leads to loss of context. For example, the "breach of contract clause" of an agreement might be on page 5, but the "compensation amount" is on page 10. Vector retrieval might only retrieve the chunk from page 5, causing the LLM to be unable to answer a complete logical question.
Advantage of No-Vector: PageIndex retrieves Pages or Documents. The LLM possesses a longer context window (feasible as models now support 128k/200k/1M tokens), enabling it to understand logical relationships across paragraphs.
Pain Point of Vector RAG: Vector search is fuzzy search based on "semantic similarity."
Example: You search for "error code 1024." The vector might retrieve content related to "error code 1025" because they are mathematically close in space, but for a programmer, this is completely wrong.
Failure with Proper Nouns: Vector search often fails when searching for specific SKU numbers, names, or extremely low-frequency vocabulary.
Advantage of No-Vector: The inverted index is exact matching. You search for "1024," and it only gives you pages containing "1024." This is crucial for fields requiring rigor, such as law, finance, and code.
Pain Point of Vector RAG: Vectors are a black box. You don't know why the system thinks A and B are similar. Furthermore, maintaining a high-performance vector database (Vector DB) is costly, and updating the index (adding, deleting, modifying, querying) is more complex than with traditional databases.
Advantage of No-Vector: Keyword-based retrieval is completely explainable ("because this page contains these three search terms"). Moreover, inverted index technology (like Lucene, ElasticSearch) is already very mature, cheap, and efficient.
For a more intuitive understanding, let's compare these two architectures:
Feature Vector RAG (Traditional Mainstream)No-Vector RAG / PageIndex
Retrieval UnitChunks (Text Slices)Pages / Full Documents (Complete Pages)
Matching LogicSemantic Similarity (Cosine Similarity)Keyword Exact Match (Exact Match / BM25)
Context Integrity****Low (Prone to taking out of context)High (Preserves complete contextual logic)
Suitable ScenariosOpen-ended Q&A, Intent Understanding, Fuzzy Queries Precise Data Queries, Long Document Analysis, Proper Noun Retrieval
Dependent TechnologyEmbedding Model + Vector DB Inverted Index (e.g., ElasticSearch) + LLM Long Context
Computational CostSlow indexing, inference requires vector computation Fast indexing, inference mainly consumes LLM Tokens
Export to Sheets
You might ask: "Isn't inverted index technology decades old? Why is it being highlighted now?"
This is because the context window of LLMs has become larger.
Before (Early 2023): LLMs could only see 4k Tokens. If you stuffed an entire page (maybe 2k Tokens) in, two pages would fill it up. So slicing (Vector RAG) was necessary.
Now: Claude 3 supports 200k, Gemini 1.5 Pro supports 1M+ Tokens.
Result: We no longer need to painstakingly "shred" knowledge. We just need to use PageIndex to quickly locate even 10 relevant pages, then throw all 10 pages to the LLM and say: "Read within this scope and find the answer for me."
This "Retrieve Pages -> LLM Reads All" model often performs better than the "Embed Chunks -> Retrieve Chunks -> LLM Guesses" model.
Pursuing vectorless RAG (PageIndex) is not about completely abandoning vectors, but about solving the problems of vector retrieval being "not precise enough" and "losing context."
Vector RAG is like a librarian with good intuition who roughly knows which books are related to your question, but only tears out a few pages for you.
PageIndex is like a rigorous indexing system that opens all pages mentioning the keywords for you and lets you (the LLM) read the entire chapter yourself.
The future trend is hybrid (Hybrid RAG): First, use PageIndex to ensure keyword coverage (precision), then use vectors to supplement semantic omissions (generalization).
Next, would you like me to show you a simple Python code example comparing the differences in actual code implementation between "vector-based retrieval" and "keyword-index-based retrieval"?
How was the original vector RAG done? Why does it sound so bad? How was long text compressed?
This is a very sharp question. The reason vector RAG (Vector RAG) "sounds bad" is because we were attacking its fuzziness (the weakness of Vector RAG) with precision (the strength of PageIndex).
In fact, vector RAG is the most mainstream technology currently; it's not "bad," but it sacrifices "precision" for "generality."
To answer your question, we need to delve into the underlying logic of vector RAG to see how it actually "compresses" and "dismembers" long text.
The core logic of vector RAG is transforming text into mathematical coordinates. Its workflow typically involves four steps:
First, chop a long document (like a 100-page PDF) into numerous small paragraphs.
For example, cut every 500 words into a chunk.
Consequence: The originally coherent logic of the article is forcibly interrupted.
Use a model (like OpenAI's text-embedding-3) to turn those 500 words into a set of numbers (typically 1536 floating-point numbers).
This set of numbers represents the "semantic position" of that text.
When you ask a question, your question is also turned into a set of numbers.
The database calculates which text segment's numbers are closest (cosine similarity) to your question's numbers and retrieves those few segments.
This is the most hardcore part of your question. In this process, the text is compressed twice, which is also the source of information loss.
Imagine watching a movie (long text). An editor cuts the film into countless 30-second short videos (Chunks).
This is the most abstract step. So-called "vectorization" is actually an extremely lossy semantic compression.
Principle: The embedding model reads those 500 words and then tries to summarize what those 500 words are about using 1536 dimensions (numbers).
Analogy: Suppose you want to introduce your friend (long text) to someone.
Complete Introduction (Original Text): "His name is Xiaoming. He likes spicy food. He was bitten by a dog when he was young, so he's afraid of dogs. He just went through a breakup recently..."
Vectorization (After Compression): [Height: 180, Weight: 70kg, Gender: Male, Mood Index: 0.2]
Why is it "bad"?
This compression loses details. If your question is "What happened to Xiaoming when he was young?", looking solely at that set of numbers (height, weight) cannot deduce the detail "bitten by a dog."
Embedding essentially compresses a rich piece of text into a "fuzzy main idea." It remembers "this passage is about personal information," but might forget the specific "phone number."
It's not that the technology itself is bad, but it has inherent flaws when handling knowledge-intensive tasks, mainly manifested in the following three points:
Vector search is based on "similarity."
Example: You search for "drinks without sugar."
Vector Result: Might recommend "sugary drinks."
Reason: In mathematical space, the words "with sugar" and "without sugar" are very close (because they both talk about sugar and drinks). Vectors struggle to distinguish that crucial word "without." Inverted index (keyword matching) wouldn't make this mistake because it must match the word "without."
Many complex documents (like legal contracts, technical manuals) require understanding across paragraphs.
Scenario: The definition is on page 1, and the specific execution is on page 10.
Vector RAG: It might retrieve the execution steps from page 10, but because it cut off page 1, the LLM has no idea what the proper nouns mentioned mean, so it starts fabricating.
When a user asks a question and the vector database retrieves the wrong fragment, developers find it hard to debug.
You don't know why this string of 1536 numbers and that string of numbers are judged as "similar." This makes the system difficult to optimize.
Despite its many shortcomings, vector RAG has one advantage that PageIndex cannot match: It understands the implied meaning of "human language."
User: "My car won't start."
Doc: "A depleted battery may cause engine start failure."
In this case:
PageIndex (Keyword) might fail because the user didn't say "battery" or "start failure."
Vector RAG (Vector) would succeed because it knows "won't start" and "start failure" are semantically the same thing.
Conclusion: The reason the original vector RAG seems "bad" to you is because it is a "fuzzy generalization." PageIndex is a "precise localization." The current trend is to stop compressing text. As LLMs can read 1 million words at once, we increasingly prefer throwing "uncompressed original text" directly to the LLM, letting the smartest model read it itself, rather than relying on that "intermediary who only calculates height (the embedding model)."
Why not perform semantic chunking before doing RAG? Why not cut paragraphs finer?
Your intuition is very sharp. In fact, "Semantic Chunking" and "cutting into finer granularity (Small-to-Big)" are precisely the two main directions for current RAG optimization.
It's not that people "don't do" these; it's that they each bring new side effects.
It's like cutting a cake:
Fixed-size chunking (traditional): Cut every 5 cm regardless. Simple and crude, likely to ruin the pattern.
Semantic chunking: Cut along the pattern. Effective, but slow, and hard to define where a "pattern" ends.
Cutting finer: Cut into crumbs. You can eat each crumb, but you don't know if it belonged to the strawberry on top or the crust below.
Let me break down in detail why these two solutions haven't completely solved the problem:
Semantic chunking uses NLP models to judge "has this passage finished?" and cuts only when it's done, rather than rigidly cutting by word count.
While it sounds perfect, there are three major pitfalls in engineering implementation:
Traditional word-count-based chunking can be done with one line of Python code text[0:500], taking 0.0001 seconds.
Semantic chunking requires a model to "read" the article, calculate the similarity between adjacent sentences, or have an LLM judge "is this a topic change?" Processing a large file might take minutes or longer. For systems requiring high real-time performance, this is unacceptable.
Example: A passage first discusses "product pricing," then immediately discusses "refund policy."
Do you cut after "pricing" is finished? But if you cut, when a user asks "What price is used for refunds for this product?", RAG is stumped because "price" is in the previous chunk and "refund" is in this chunk, the connection is broken.
Even if you perfectly chunk by paragraph, this paragraph might still depend on definitions from pages earlier.
For example, a paragraph on page 10 says: "Execute according to the aforementioned agreement..."
Semantic chunking ensures this passage is complete, but it still doesn't include the "aforementioned agreement" from page 1.
You might think: "If cutting large leads to noise, why not cut at the sentence level? Then use whichever sentence is retrieved, wouldn't that be most precise?"
This touches upon the classic paradox in the RAG field: Retrieval Granularity vs. Comprehension Granularity.
Cutting too fine (e.g., by sentence) leads to the following fatal problems:
Original Text: "Musk founded SpaceX. It significantly reduced rocket launch costs."
After Fine-Grained Chunking:
Chunk A: "Musk founded SpaceX."
Chunk B: "It significantly reduced rocket launch costs."
Search: User asks "What reduced launch costs?"
Result: Vector finds Chunk B.
Show to LLM: LLM sees "It reduced costs." LLM asks: "Who is 'it'?"
Outcome: Because it's cut too fine, the referential relationship is lost. This fragment becomes useless data.
Vector search requires a passage to have sufficient "information density" for accurate positioning.
If cut into short sentences: "Yes, I agree." or "According to the regulations as follows:"
The vectors generated from these short sentences are extremely generic and lack distinctive features. When users search, these short sentences appear as noise, crowding out the ranking of truly useful information.
Suppose the LLM's window can hold 5 chunks.
Large Chunks: You can see 5 complete paragraphs and roughly understand the context.
Extremely Fine Chunks: You can see 5 isolated sentences. It's like giving you 5 jigsaw puzzle pieces and asking you to guess the whole picture—extremely difficult.
To solve the two problems you raised, the most mature solution now is called "Small-to-Big" (retrieve small chunks, feed large chunks to LLM).
This is a compromise that combines the advantages of "cutting fine" and "not cutting apart":
Child Chunk (Small Chunk): Cut the document into extremely fine sentences or small paragraphs (e.g., 100 words). Use these for vectorization and retrieval. Because small chunks have pure semantics, retrieval precision is high.
Parent Chunk (Large Chunk): Record which larger parent paragraph (e.g., 1000 words) or even the entire document this small chunk belongs to.
The system does not give this small chunk to the LLM.
The system finds its "parent" (Parent Chunk) via ID and throws the complete 1000-word paragraph containing the context to the LLM.
Summary:
Not using semantic chunking -> Because it's too slow; overlapping chunking (Sliding Window) is used instead to mitigate boundary issues.
Not using fine granularity for generation -> Because there's no context, it's incomprehensible.
So, the current trend is actually: Use extremely fine granularity to "aim" (index), and use extremely coarse granularity to "read" (generate). PageIndex is essentially the ultimate form of this logic: use keywords to aim, use the full text to read.
Does vectorless RAG solve these problems?
This is an excellent follow-up question. The simple answer is: It solves the "chunking" and "context" problems extremely perfectly, but it also introduces a fatal new flaw.
Vectorless RAG (PageIndex) is not a panacea; it's more like a "counterintuitive" brute-force aesthetic.
We can view it as a trade-off of "exchanging computational power for intelligence."
By abandoning "preprocessing chunking" and directly leveraging the LLM's super-long context capability, vectorless RAG uproots the previously mentioned stubborn issues:
Vector RAG Problem: Cutting breaks mid-sentence, cross-paragraph logic is lost.
Vectorless RAG Solution: Since we don't know where to cut, simply don't cut.
It directly throws the entire page, or even entire chapter, containing the keywords to the LLM.
Effect: Problems like "unclear pronoun reference" and "cross-paragraph definitions" instantly disappear. Because the LLM sees coherent original text, it can understand who "it" refers to and what "the aforementioned agreement" is.
Vector RAG Problem: Searching for "1024" returns "1025"; searching for rare names fails.
Vectorless RAG Solution: Returns to inverted index (Ctrl+F logic).
Effect: Only pages containing the exact word "1024" are retrieved. For hard criteria like contract numbers, SKUs, error codes, and names, accuracy jumps from 70% to 100%.
Vector RAG Problem: The vector database is a black box; it's hard to know why it retrieved that nonsense.
Vectorless RAG Solution: The logic is transparent.
Effect: Why was this page retrieved? Because this page has these three keywords. If retrieval is wrong, it's an issue with the keyword extraction strategy, which is easy to fix.
Everything has a cost. Vectorless RAG essentially sacrifices "semantic understanding" to gain "precise context." This leads to two new pain points:
This is the biggest weakness of vectorless RAG.
Scenario: User searches for "how to save money?" The document says "reduce costs by optimizing processes."
Vector RAG: Can find it. Because it knows "save money" ≈ "reduce costs."
**Vectorless RAG:**Cannot find it. Because the document doesn't contain the words "save money."
Remedy: Need to perform "Query Expansion" using an LLM before searching, rewriting the user's question into multiple keywords (save money -> reduce costs, cut expenses, economize). But this adds complexity and latency.
Vector RAG: Only shows the LLM 5 fragments (approx. 1000 Tokens). Cheap, fast.
Vectorless RAG: Might show the LLM 10 complete pages (approx. 10,000 to 20,000 Tokens).
Money: Your API bill could skyrocket 10 to 20 times.
Slow: The latency for the first token generation (TTFT) when an LLM reads 20,000 words versus reading 1,000 words is completely different.
Risk of Getting Lost (Lost in the Middle): Although LLMs claim to support 200k context, experiments show that when the context is too long, LLMs tend to ignore information in the middle and only remember the beginning and end.
To help you see the essence, let's use an "open-book exam" analogy:
Mode Analogy Scenario Result
Vector RAG****Flashcard Cheating: The teacher doesn't let you bring the book, only allows you to cut the book into pieces and make cheat sheets (Chunks) before the exam. During the exam, you guess which cheat sheet to use based on the question. If the cheat sheet is cut badly, or you guess wrong, you fail.
Vectorless RAG****Open-Book Exam: The teacher allows you to bring the whole book. You quickly flip to the relevant chapter using the table of contents (index), then read from beginning to end to find the answer. You can understand the complete logic and will never take things out of context. But flipping and reading is slow, and it's tiring.
Export to Sheets
No one "replaces" the other. The strongest architecture now is Hybrid RAG, combining strengths and compensating for weaknesses:
One path uses Vector to find "semantically relevant" content (solving the "save money" vs. "reduce costs" problem).
One path uses Keyword to find "exactly matching" content (solving the "contract number" problem).
Conclusion: Vectorless RAG solves the two most painful engineering problems: "context fragmentation" and "precision." But it heavily relies on the LLM's long-text capability and the thickness of your wallet.
Next, would you like to learn about how the specific architecture of "Hybrid RAG" is built? Or see how to optimize the cost of long-text RAG?