This article was last updated 177 days ago. The information in it may have developed or changed. If it is invalid, please leave a message in the comment section.

Article Summary

This paper addresses the pain points of embedded model deployment in the RAG workflow and proposes a solution based on Ollam for building a local embedded model. By comparing the limitations of cloud service APIs in terms of cost, privacy, and controllability, Ollam was chosen as a lightweight deployment tool. Combined with the Chatbox knowledge base, it enables document vectorization storage and enhanced question answering for retrieval. Practical verification shows that this solution effectively solves technical challenges such as long text chunking and cross-model compatibility, improving the semantic retrieval efficiency of personal blog knowledge bases by more than 100 times. Simultaneously, local deployment ensures data security, providing a reusable minimum viable loop for RAG system construction.

Qwen3-14B · 2026-06-18

Contents

1 Introduction
2 From Text to Vectors: Understanding Embedding Models
- 2.1 What is an Embedding Model?
- 2.2 Why do we need embedding models?
3. Ollama Deployment Embedded Model Practice
4. Chatbox creates a knowledge base
- 4.1 Adding Ollama’s embedding model to Chatbox
- 4.2 Create a knowledge base and import markdown documents
5. Using the Knowledge Base in Chatbox
6. Afterword

1 Introduction

In the previous article, I sorted out the core concepts and processes of RAG (see article:Home Data Center Series: Understanding RAG from Scratch (Part 1): Principles and Complete Process Analysis), but that was still theoretical, empty talk. To truly get RAG off the ground, we had to take the first step in practice. The question was: where was the best place to start?

I think the breakthrough should be Embedding Model,because in the RAG process, the retrieval effect is highly dependent on how the text is converted into vectors. This step determines whether the relevant content can be recalled later.

When it comes to embedding models, many people's first reaction may be to directly call the API of the cloud service, such as:OpenAI’s text-embedding-3-small and text-embedding-3-largeThese two models are the most widely used in the community (Microsoft Azure actually hosts the same series); Anthropic, Cohere, and even some domestic vendors (such as Baidu's bge series and Zhipu's embedding-2) also provide similar embedding APIs. While this approach is indeed hassle-free and can be used directly without any environment configuration, the problems are also obvious:

Cost constraints: As the data scale increases, API call fees will continue to accumulate.
Privacy concerns: The corpus needs to be transferred to a third-party server and may not be suitable for internal or sensitive data.
External reliance: Once the service adjusts prices, limits traffic or even goes offline, your system will be affected.
Difficult to compare and replace: Switching between different suppliers is not easy and the cost of experimentation is high.

For these reasons, I preferSelf-built embedding modelThis not only allows you to fully grasp the data and environment, but also allows you to flexibly switch models and freely compare effects, which is more suitable for learning and exploration.

In the self-built solution,Ollama(Friends who are not familiar with Ollama can refer to my other article:Home Data Center Series: Building Private AI: A Detailed Tutorial on Building an Open Source Large Language Model Locally Based on Ollama) is a good starting point: it is designed specifically for local deployment and encapsulates the entire "download-load-call" process very concisely. A few commands can run an embedding model on your computer without having to deal with the underlying inference framework or worry about compatibility issues. It is very suitable for readers who want to get started quickly.

However, before practicing "Ollama self-built embedding model", you need to understand what an embedding model is.

2 From Text to Vectors: Understanding Embedding Models

2.1 What is an Embedding Model?

If I were to explain it in a very simple sentence:Embedding model is a model that converts text into a string of digital vectors, does it sound a bit abstract?

For example, suppose you have a sentence - "Apple is a fruit", when you give it to the embedding model, it will return a set of numbers, which may have hundreds or even thousands of dimensions. For example:

[0.12, -0.08, 0.95, ... , 0.34]

Regarding "dimension", we can understand it this way: if there are only two numbers, it is like marking a point on a two-dimensional plane (X, Y coordinates); if there are three numbers, it is a point in three-dimensional space (X, Y, Z); and when the numbers become hundreds or thousands, it means that the sentence is placed in a "high-dimensional semantic space". It is difficult for us humans to imagine what 1000 dimensions would look like, but for computers, it is just a lot more coordinate axes. At the same time,The more dimensions there are, the more delicate the expression will be.

For example, describing a person with only "height and weight" is a two-dimensional feature. Adding "age, occupation, and hobbies" increases the dimensionality and makes the description more accurate. Embedding models work similarly, using hundreds or thousands of dimensions to characterize text allows for more comprehensive capture of semantic details.

They themselves have no intuitive meaning, but they determine "where this sentence is located in the semantic space." In other words, the embedding model is like a translator, translating the natural language we can understand into "semantic coordinates" that computers can operate on.

With this representation, the computer can determine the similarity between two texts by comparing the distance between the coordinates. For example:

“Apple is a fruit” and “Banana is a fruit” are very close in the embedding space;
"Apple is a fruit" and "Apple has released a new mobile phone" both contain "apple", but their positions are relatively far apart.

This is the magic of the embedding model: it no longer relies on simple keyword matching, but can capture the proximity at the "semantic level".

In RAG, the role of the embedding model is:Translate your questions and the documents in the knowledge base into the same semantic space, and then complete the search based on "which one is closer".

Of course, why do we need this method instead of using traditional search directly? This involves the limitations of keyword search and the advantages of semantic vectors.

2.2 Why do we need embedding models?

2.2.1 Limitations of Keywords

In the previous section, we mentioned that embedding models can convert text into numerical vectors. We also know that the higher the dimension of the vector, the more refined the semantic expression. So, the question is, why do we need this kind of vector representation instead of using traditional keyword search? This is because keyword search has its limitations. Simply put, keyword search only looks at the literal meaning, not the meaning. This has several obvious problems:

Mismatch: For example, if you search for "nutritional value of apples", and the document says "Apple releases new products", the keyword "apple" is matched, but the semantics are completely irrelevant.
Insufficient recalls: You searched for "benefits of bananas", and the document said "bananas are rich in potassium". The keyword match may not be accurate enough, and the ranking is low, or even not retrieved.

2.2.2 Advantages of Semantic Vectors

Embedding models map text into a high-dimensional space and use "semantic distance" to measure similarity. In other words, they don't look at the literal meaning of the text, but rather at the similarity of its meaning. For example:

Text A: "Apple is a fruit"
Text B: "Banana is a fruit"
Text C: "Apple released a new phone"

In the vector space, the vectors of A and B are very close because they both describe the semantics of fruit; the distance between A and C is farther, even though they both contain the keyword "apple".

In this way, when searching for "nutritional value of apples", the system can prioritize recalling truly relevant fruit documents rather than news about Apple.

Note: Vector concepts are a universal language in the AI world and are very important. Those interested can refer to my other article:Home Data Center Vector Series: The Universal Language in the AI World.

2.2.3 The Value of Embedded Models in the RAG Process

In the Retrieval-Augmented Generation (RAG) process, embedding models play a crucial role: they not only convert text into vectors but also determine what information the model "sees" during the retrieval phase, directly impacting the quality of the generated results. Imagine asking the model, "What is the nutritional value of apples?" The knowledge base contains both documents such as "Apples are rich in vitamin C" and news reports such as "Apple releases a new phone." If the embedding model is accurate, the model will primarily retrieve documents related to the former, and the generated answer will naturally be correct. However, if the embedding model is not good enough, the model may also consider irrelevant news as reference, resulting in a biased response.

The embedding model also improves the controllability of RAG: it clearly divides the semantic space, allowing the generative model to only consider relevant content and reduce interference from irrelevant information. For beginners, this can be understood as "helping the model filter out noise," making the answers more reliable and predictable.

Embeddings are also highly interpretable: you can quantify the relevance between questions and documents, view the top five most relevant documents and their similarity scores, and clearly understand the model's input. This is extremely helpful for experimentation and tuning: switching embedding models might reveal completely different top-relevant documents, and thus, the generated answers, too. This makes performance differences observable and analyzable.

Finally, the value of embedding models lies in their scalability: vectors can be stored in a database and used in a variety of scenarios, including question-answering, recommendation, and similarity search. Once a reliable embedding system is established, the retrieval and generation loop can be smoothly extended by adding new documents or switching to a generative model.

In short, for beginners, mastering the embedding model first is like laying the foundation for RAG. With this solid foundation, subsequent adjustments to the segmentation strategy, search parameters, or model generation will yield more controllable and intuitive results. This is precisely why I chose the embedding model as my first step in practicing RAG.

3. Ollama Deployment Embedded Model Practice

3.1 Preparing the Ollama environment

This part of the operation varies according to the system platform you use. For example, I use the m4pro Macmini. For detailed steps on downloading, deploying Ollama and using the large language model, please refer to my previous article:Home Data Center Series: Building Private AI: A Detailed Tutorial on Building an Open Source Large Language Model Locally Based on Ollama, I will not repeat it here.

However, the macOS version of Ollama now allows you to directly set the model download location in the GUI settings and enable the "cross-domain access" function, which is much more convenient:

As for the cross-domain settings of other versions of Ollama, you can refer to my other article:Deploying Llama 3.2 on a Home Data Center Mac mini (M4 Pro): A Complete Guide to High-Efficiency Operation and Cross-Domain Access Optimization with Ollama.

3.2 Choose an embedding model according to your needs

In the models supported by Ollama (see:https://ollama.com/library), there are several mainstream embedding models to choose from, each with different accuracy, scale, and resource consumption. To make it easier to understand, I will categorize them into small, medium, and large:

Scale Type	Model Name	Model size	Suitable for scenes
Small	nomic-embed-text	36.4M	Blog articles, personal notes, small knowledge bases, fast deployment, low resource consumption
Small	snowflake-arctic-embed 22M/33M	22M / 33M	Small text base, fast experimentation, low hardware requirements
Small	granite-embedding 30M	30M	Single-machine small-scale text vectorization, CPU can run
Medium	snowflake-arctic-embed 110M/137M/335M	110M / 137M / 335M	Enterprise documents, cross-departmental knowledge bases, hundreds of thousands of documents, requiring high semantic accuracy
Medium	granite-embedding 278M	278M	Medium-sized enterprise documents, multi-document semantic retrieval, requires GPU or high CPU memory
Large	mxbai-embed-large	335M	Large-scale enterprise knowledge bases, complex question-answering systems, multi-language scenarios, and high-memory GPUs are recommended.
Large	snowflake-arctic-embed2	568M	Multilingual knowledge base, large-scale vector library, high precision requirements, high-performance server required

1. Small Model

They are usually small in size, such as nomic-embed-text (36M), snowflake-arctic-embed's 22M or 33M, and granite-embedding's 30M version. Their advantages are lightweight and fast, and they can run on almost any modern personal computer. For blog posts, personal notes, or small-scale knowledge bases, this type of model is more than enough. The deployment cost is low, and the CPU can run it. If there is a GPU, the speed will be even faster. The memory usage is also low, usually 8-16GB is enough. If you want to vectorize the blog .md file, this type of small model is the most labor-saving and practical choice.

2. Medium-sized model

Larger models, such as the 110M, 137M, and 335M versions of snowflake-arctic-embed or the 278M version of granite-embedding, are suitable for processing enterprise-level documents or cross-departmental knowledge bases, maintaining high semantic accuracy even with hundreds of thousands of documents. These models have slightly higher hardware requirements: a GPU with 4–12GB of video memory is ideal; a CPU can run them, but at a slower speed, requiring approximately 16–32GB of memory. Medium-sized models offer a good balance between accuracy and resource consumption, making them the preferred choice for formal business scenarios.

3. Large Models

Models such as mxbai-embed-large (335MB) or snowflake-arctic-embed2 (568MB, supports multiple languages) are designed for large-scale enterprise applications. They can handle document repositories exceeding one million, offer strong semantic understanding capabilities, and are suitable for complex question-answering systems or multilingual search scenarios. However, they also require significantly higher hardware requirements: a high-performance GPU with at least 16GB of video memory, at least 32GB of RAM, and increased vector storage space. For typical personal blog posts, using large models is not only a waste of resources, but also expensive to deploy and slow to run.

For most personal or small-scale text libraries, especially for vectorizing blog post .md files,A small model is enoughIt is lightweight, fast, and low-cost, and can complete the complete RAG process experience locally. Medium or large models are only necessary when processing large-scale enterprise documents or in scenarios with extremely high requirements for semantic accuracy.

Based on the above analysis, my requirement is to vectorize local ".md" format blog articles, and the most suitable embedding model is "nomic-embed-text".

3.3 Pulling the appropriate embedding model in Ollama

According to the analysis in the previous section, the most suitable embedding model in my scenario is "nomic-embed-text", so next I need to pull down the model in Ollama:

ollama pull nomic-embed-text

If successful, the result is as follows:

You can view the model information using the following command:

ollama show nomic-embed-text

3.4 Testing the Embedding Model Usability

Use the following command to test whether the "nomic-embed-text" embedding model can properly vectorize the sentence "The sky is blue because of Rayleigh scattering" through the API:

curl http://localhost:11434/api/embeddings \ -d '{"model": "nomic-embed-text", "prompt": "The sky is blue because of Rayleigh scattering"}'

If successful, you will get a set of vector outputs:

Note: If you use other embedding models, you need to replace the vector model in the above command with your actual model name.

3.5 Tips: How Ollama works

In Docker, the familiar workflow is to first pull an image using docker pull , then start a container using docker run . Without the run step, the image simply sits on your local disk, completely unusable. This logic can easily lead to a stereotype: running is the only key action that truly brings something to life.

Although Ollama's command names look similar, they have completely different operating mechanisms. Although it also provides the "ollama pull" command to download models and the "ollama run" command, which seems very similar to Docker's run command on the surface, the essence is completely different: the core of Ollama is actually a resident service that runs in the background of the local machine after installation (the default listening is on http://127.0.0.1:11434As long as the service is running, the local model can be called at any time through the API without having to be manually started through the command line. Therefore, the "ollama run" command is more of an experiential interaction method.

For example, when you execute "ollama run llama3" in the terminal, it essentially just sends a request to the backend service and prints the result in the command line, making it easy to try it out. For generative models, this is indeed an intuitive way to experience it. However, for embedding models, it does not support generating text at all, so using the command "ollama run nomic-embed-text" will result in an error:

In real-world applications, both model generation and embedding are performed through API calls in production environments. Generate models output text via POST /api/generate , while embedding models return vectors via POST /api/embeddings . These APIs are Ollama's delivery interface and the approach you should rely on in production environments. Therefore, unlike Docker, the "run" command in Ollama is not a required switch; it simply allows users to experiment with models in the command line.

According to this mechanism, many friends may worry that if Ollama can be called without running, does it mean that all models are running in the background all the time? In fact, this is not the case.

Ollama's models are stored locally as files on disk and are only loaded into memory or the GPU when an API call is made. After the call is complete, the models are temporarily retained in memory as a cache to ensure a fast response next time. However, if resources are limited or the service is restarted, these models are unloaded to free up system resources. In other words, the models do not consume computing power endlessly, but are instead started and released on demand. This mechanism ensures immediacy of calls while avoiding resource waste.

4. Chatbox creates a knowledge base

4.1 Adding Ollama’s embedding model to Chatbox

After deploying the local Ollama embedding model and testing its usability, the next step is to connect it to Chatbox (friends who are not familiar with Chatbox can refer to my other article:The most convenient AI app front-end for the home data center series: Chatbox: A comprehensive introduction and usage guide), as a callable model provider.

First, open the model management interface of Chatbox and find the "Add model provider" or similar entry. Select Ollama as the provider and fill in the address of the local service, which is usually http://localhost:11434Next, click the "Get" button to refresh and select the embedding model you downloaded previously in the list of available models, such as "nomic-embed-text", as shown below:

Then, choose to set the model type to embedding and click the settings button to the right of "nomic-embed-text":

Select "Embedding" from the "Model Type" drop-down menu and save:

success:

4.2 Create a knowledge base and import markdown documents

The knowledge base is created:

After successfully importing all .md format articles:

Note: The chunking (in the red box) in the image above means breaking a long article into smaller segments, much like breaking a book into pages. This is because embedding models can only process text of a limited length at a time. If the entire article isn't chunked, the model won't be able to fit the content and will likely miss information. After chunking, each segment is converted into a vector and stored in the knowledge base. Later, when a user asks a question, the system can quickly find the most relevant segments, rather than searching through the entire article.

5. Using the Knowledge Base in Chatbox

Then, construct a reasonable question, prompting the AI to refer to the content in the knowledge base to make a judgment, such as: "Refer to my blog posts in the knowledge base, compare my understanding of different technical fields, and point out my strengths and areas for improvement." The result is as follows:

The effect is very good. My biggest headache before was to let ChatGPT analyze the advantages and disadvantages of some of my articles. It was okay for one article, but it was very troublesome to analyze a series of articles. Now that I have a knowledge base and imported all the blog articles, this requirement has become very simple.

In addition, the total tokens I spent this time asking the question were only 4345:

Compared to the hundreds of thousands of tokens that may be required to directly input all articles into the model at once, this embedding vector-based retrieval enhancement methodSave hundreds of times the tokenThis not only significantly reduces costs, but also ensures that the model answers only focus on the most relevant content, making the evaluation both efficient and accurate.

Why does this work? In the RAG process, the model doesn't cram the entire knowledge base into the context. Instead, it first uses embedding vectors to quickly identify the document fragments most relevant to your question. These fragments are then combined with the prompt word and processed by the model. This way, irrelevant content doesn't occupy tokens, giving the model a cleaner "view" and a more focused reasoning space. This results in both cost-effectiveness and reliability.

In a previous article, I mentioned that the RAG process can be divided into five core steps:Split text → Vectorize → Vectorize → Store → Retrieve → Generate answersAs for the knowledge base function provided by Chatbox, it actually helps us solve theText segmentation, vector storage and retrievalThese three steps allow documents to be neatly stored and the most relevant content to be found quickly.The second step, "vectorization", still requires us to provide an embedding model ourselvesto complete.

Note: LobeChat's knowledge base and Chatbox's knowledge base are similar in core functionality: both segment and vectorize documents for storage, retrieving the most relevant content when users ask questions. They differ slightly in how they handle embedding models: LobeChat has a built-in embedding model by default, automatically generating vectors upon uploading documents without user intervention; Chatbox, on the other hand, can use user-provided embedding models (such as Ollama's local model). While theoretically allowing for control over the model version and quality of the vectorized model, in practice, beyond selecting a model and uploading documents, adjustable options such as block size, overlap parameters, and vector storage methods remain limited. In other words, LobeChat prefers to be ready to use out of the box, while Chatbox offers the possibility of model selection, but its customization and controllability aren't significantly better than building your own LobeChat server.

It's important to note that LobeChat's knowledge base has a certain deployment barrier: full functionality requires enabling a server-side database and configuring an S3-compatible object storage (such as MinIO, COS, or OSS). Domain name and cross-domain access settings also require handling, making LobeChat's knowledge base somewhat cumbersome for personal or lightweight scenarios. Chatbox, on the other hand, is more lightweight and particularly well-suited for personal knowledge bases that require occasional updates. The only common complication is the need to provide your own embedded model.

6. Afterword

After this series of operations, I finally managed to run a complete, minimally viable RAG system: from deploying the embedding model in Ollama, to creating a knowledge base in Chatbox, to uploading blog posts and conducting search-enhanced conversations. While the entire process may seem step-by-step, the core logic is very clear:First, make the embedding model stable and controllable, then build the knowledge base, and finally generate it through a large language model.This is the first bridge from theory to practice, and also the most “visible” breakthrough point in the RAG process.

In practice, I've also discovered some interesting insights. For example, many readers and beginners may assume that a vector database must be built before using the knowledge base. In fact, solutions like Chatbox, which come with built-in SQLite, are more than sufficient for data volumes at the level of a personal blog. As long as the embedding model works properly, the system will immediately generate vectors upon uploading a document and make them searchable. This facilitates rapid verification and iteration, significantly lowering the barrier to entry for RAG implementation.

Another lesson I learned is that many people (including me), influenced by the "docker run" mindset, mistakenly believe that embedded models must be started from the command line to be invoked. In reality, Ollama's embedded model is essentially a service that can be invoked by frontends across domains via an API. "ollama run" simply provides a local interactive experience and is not required. This design makes deployment more flexible and facilitates the use of the same embedded model across multiple terminals and frontends.

Finally, I want to emphasize that while this article demonstrates a minimal viable pipeline, RAG's potential is far greater than this. You can subsequently optimize the partitioning strategy, introduce larger vector databases, adjust search parameters, or even enhance search capabilities by combining multiple models. What we've demonstrated today is the first step in implementing the theory—a "minimum viable closed loop." Subsequent iterations and optimizations are entirely in your hands.

Note: Starting from which version of Chatbox, its built-in document parser may generate text structures that do not conform to the expected input of the embedding model when processing long Chinese articles. This causes the locally deployed Ollam embedding model (such as nomic-embed-text) to frequently trigger the "the input length exceeds the context length" error when adding new knowledge base articles.

It's worth noting that when directly calling the same embedding model via the Ollam API and passing in equivalent or even longer text content, the embedding request completes normally. This indicates that the problem is not with the embedding model itself, nor is it entirely caused by the runtime environment (macOS/Linux) or hardware computing power, but is more likely due to anomalies in intermediate results introduced by the Chatbox's built-in parser during document parsing, text concatenation, or chunking stages.

Because the parsing logic, segmentation strategy, and intermediate text of this parser are all unobservable and unconfigurable, this problem is currently difficult to avoid by changing the embedding model or the operating platform when using the built-in parser. Therefore, at least under current conditions, there is a clear compatibility boundary between the Chatbox knowledge base functionality and the local Ollam embedding model when processing long Chinese documents.

In simpler terms: before Chatbox officially fixed or refactored its built-in document parser, its knowledge base functionality was not suitable for directly using the local Ollam embedding model to process long Chinese articles. In this setup, the problem wasn't something that could be solved by simply "adjusting it a bit more," but rather was limited by the implementation of the parsing chain itself.

📌 Content Structure Hints:

This content belongs to "AI Learning MapThis is part of the document; you can view the full content path here: AI Learning Map .

Share this article