Contents
- 1. When embedding moves from experimentation to fundamental capabilities
- 2. Deploying embedding services in a controlled environment
- 3. Summary and Conclusion: Embedding as the Operating Location of Basic Capabilities
1. When embedding moves from experimentation to fundamental capabilities
In a previous article introducing RAG, I mentioned that the core tool for vectorizing segmented text is…”Embedding Model“(See article:)Understanding RAG from Scratch (Part 1): Principles and Complete Process AnalysisThis was also introduced in a subsequent practical article: the steps to implement a simple local knowledge base using chatbox and Ollama's self-built embedding model (see article:Practical application of Ollam's self-built embedded model + Chatbox knowledge base),
At that stage, embedding was more like an "experimental capability": it was used to verify whether the process was feasible, whether the model's performance was acceptable, and whether local deployment could functionally replace cloud services. As long as a complete process could be run once, issues such as stability, operating environment, and long-term maintenance were often not given high priority.
However, as its use cases expand, the role of embedding is subtly changing: whether it's a continuously maintained local knowledge base, or the potential introduction of RAG, full-text search, similarity calculation, or even more engineering-oriented knowledge graphs or AI-enhanced blog functions, embedding is no longer a temporary step that can be "generated once," but is gradually evolving into a...Basic capabilities for permanent presence.
When embedding changes from "occasional use" to "repeated use," some previously overlooked issues begin to emerge: For example, should it continue to be attached to a desktop system (e.g., using an Ollam instance on macOS as a long-term runtime environment)? Should it share the same runtime boundary with the operating system, toolchain, and desktop environment used daily? Once system updates, environment changes, or tool upgrades introduce uncertainty, will this fundamental capability be unintentionally compromised?
From an engineering perspective, embedding has very distinct characteristics: it is a non-autoregressive computation process, not part of the real-time user interaction chain, and therefore...Stability, reproducibility, and consistency of long-term operating behaviorIt has higher requirements. In other words, it is closer to "infrastructure" than an "application" suitable for long-term operation on a personal desktop system.
Based on this judgment, I began to re-examine the deployment method of embedding: rather than letting it continue to depend on desktop environments such as macOS and passively drift with system upgrades and tool changes, it is better to clearly separate it and put it into an independent environment with clear operating boundaries, predictable behavior, and controllable update rhythm, so that it can exist as a dedicated service for a long time.
Meanwhile, from a resource perspective, the embedding function is not highly dependent on GPUs, and its overall computing resource requirements are relatively controllable. In most practical scenarios, even a regular Intel CPU or similar configuration is sufficient to support the vectorization needs of daily knowledge base articles. This makes deploying embedding independently on general-purpose computing nodes a cost-effective and stable option from an engineering perspective.
Therefore, I ultimately deployed a fixed version of Linux on my home Intel CPU mini host using LXC via PVE, specifically to host Ollam's embedded model service. This reduces the impact of system updates and changes in the runtime environment on the knowledge base building process, while ensuring the controllability and reproducibility of the service operation.
In subsequent chapters, I will start with an analysis of engineering characteristics to explain why embedding is suitable for independent deployment. I will also demonstrate, in conjunction with a real-world environment, how to deploy a Linux environment specifically for providing embedded services in PVE using LXC, laying a stable foundation for the subsequent knowledge base and RAG applications.
Note:
From a purely engineering efficiency perspectiveThe most economical and stable way to implement embedding functionality is actually to directly use the embedding model provided by mature commercial services.For example, the embedding API provided by OpenAI. Such services are far superior to self-built solutions in terms of model quality, stability, and long-term maintenance costs, and almost do not have the runtime uncertainty issues discussed in this article.
However, in practical use in China, this path is not always feasible. On the one hand, there are objective restrictions on bank cards and settlement methods in the payment process; on the other hand, even if non-mainstream access methods exist, the availability and stability of official APIs are difficult to guarantee. For a basic capability like embedding, introducing additional uncertainty and maintenance costs is actually not worthwhile. Furthermore, some commonly used third-party API service providers (such as OpenRouter and OhMyGPT) almost exclusively provide interfaces for large language models, while embedding models are typically not provided due to limitations in call volume, computing power, and application scenarios.
For this reason, I chose to abandon reliance on external commercial services and instead deploy a Linux environment via LXC in PVE to build my own Ollam embedded model service. This choice was not based on pursuing optimal performance or effects, but rather on...Controllability, sustainability, and overall cost balanceA comprehensive consideration.
2. Deploying embedding services in a controlled environment
2.1 Create an LXC in PVE that "exists only for embedding services".
After clarifying the need to separate embedding capabilities from the desktop system, the next question is not "how to deploy Ollam", but rather...In what kind of environment should embedding capabilities operate?.
In my actual scenario, this problem can be broken down into three more specific engineering decisions:
1. Why use LXC?
2. Why choose Debian 12?
3. How many resources does this environment actually need to be considered "reasonable"?
1. Why use LXC instead of VMs or container orchestration solutions?
The operational characteristics of embedding services are actually quite clear: they run persistently; they are stable over a long period; they do not require elastic scaling; and they do not need to be frequently rebuilt or destroyed. It's more like a...Infrastructure capabilitiesInstead of one-off tasks or short-lifecycle services.
Under these circumstances, LXC's advantages over a full virtual machine are very direct:
- It providesA complete Linux system perspectiveInstead of being a highly abstract runtime
- Ollam can be natively managed using systemd as a long-running system service.
- It has lower resource consumption, and its startup and maintenance costs are closer to those of a "real server".“
As for Docker or more complex container orchestration solutions, they seem over-designed in the embedding scenario: there is no need for multi-instance expansion, rapid image iteration, or complex dependency isolation. Introducing an additional abstraction layer will only increase the mental burden of long-term maintenance.
Therefore, the LXC is not the "technologically more advanced" choice here, but rather...A choice that more closely reflects the actual operation of embedding services.
2. Why choose Debian 12?
After deciding to use LXC, the choice of operating system also follows the same set of judgment logic:Stability takes precedence over new features.
There are several main reasons why Debian 12 became my final choice:
- The issuance pace is stable, the life cycle is clear, and the long-term behavior is predictable.
- It has extremely high popularity in server and container environments.
- Ollama and its underlying components (such as llama.cpp) have higher compatibility and test coverage on Debian-based distributions.
For embedding services, the operating system is not the subject of study, but rather a...Platform with carrying capacityChoosing Debian 12 is essentially choosing a path with the slowest changes and the fewest surprises.
Note: I chose Debian simply because it's the one I'm most familiar with. If you're familiar with other Linux distributions, you can choose the one you're most comfortable with; there's no need to worry about it.
3. Why don't resource specifications need to be too large?
The last issue is resource allocation. The computational characteristics of embedding have been analyzed in Chapter 2: it is a non-autoregressive, linear computational process that is not on the main chain of real-time user interaction and does not experience continuous high concurrency pressure.
Therefore, the resource allocation goal of this LXC is not "to run as fast as possible", but rather:Stable, predictable, and capable of long-term operation without frequent intervention..
In actual deployment, I assigned the following specifications to this LXC as a reference configuration:CPU: 2 cores;Memory: 4 GB;Disk: 20 GB.
This configuration is more than sufficient for the embedding service: the overall embedding time is mainly affected by the length of the input text and the computational characteristics of the model itself, rather than system resource bottlenecks. In daily use, even processing multiple medium-length Chinese articles consecutively will not put continuous pressure on the CPU or memory.
In other words, configuring excessively high-specification resources for the embedding service will not only fail to bring substantial benefits, but may also obscure its true role in the overall system—it should be a...“Basic services for "quiet work"It is not a performance node that needs to be focused on and optimized.
It should be noted that this resource allocation is not "extreme compression," but rather a conscious choice based on the actual use case of embedding. In the LXC environment of PVE, the cost of resource adjustment and expansion is inherently low; adjustments can be made only if the model size or usage changes in the future.
Having clarified the above three prerequisites, the creation process of LXC becomes relatively simple:



I've explained the specific steps for creating an LXC in PVE in many previous articles, so I won't go into detail here.
2.2 Deploying Ollam in Debian 12 LXC
In the previous section, the role of LXC was clearly defined: it is not a general-purpose computing node, nor an interaction portal, but rather a...Long-term resident components that exist solely for the embedding serviceUnder these circumstances, Ollam's role here becomes extremely singular—it is only responsible for loading the embedded model and providing a stable vectorized interface to the outside world.
Because the goal is so clear, Ollam's deployment doesn't require a complicated installation process. From an engineering perspective, the most important thing here is not "playability" or "customizability," but rather the predictability of its behavior: whether the installation process is clear, whether the operation is stable, and whether the upgrade path is well-defined.
The official installation script provided by Ollama perfectly aligns with this approach. It installs pre-compiled official binaries, runs as a systemd service by default, does not rely on GPU or platform features, and does not make any deep, intrusive modifications to the system. For a basic service running in LXC with a single responsibility and requiring long-term stable operation, this "restrained" installation method is actually the most suitable choice.
Therefore, in this LXC, the entire installation process can be completed directly using the official script (Ollama recently changed the installation package format from tar.gz to tar.zst, but many Linux distributions do not include zstd by default, so it needs to be installed separately):
apt update apt install -y zstd apt install -y curl curl -fsSL https://ollama.com/install.sh | sh

In a normal network environment, the installation script can be used directly to complete the process (as shown in the image above). Unfortunately, the network environment in China is not ideal, so a VPN or similar tool is required, using a proxy address.http://192.168.1.1:8080For example:
export http_proxy="http://192.168.1.1:8080" export https_proxy="http://192.168.1.1:8080" curl -fsSL https://ollama.com/install.sh | sh
It's worth noting that directly using a global transparent proxy on the outgoing router can also cause problems because `curl | sh` has strict network requirements: the download must be complete, TLS cannot be interrupted, and the shell cannot receive half-finished scripts. The router's transparent proxy may experience instability at the TCP/TLS layer due to resets, fragmentation, or IPv6 fallback, causing the script to fail midway.
Therefore, the safest approach is to explicitly specify the proxy through environment variables, so that all download requests within the script go through the same stable channel.
After installation, Ollama will start automatically as a system service. You can check its running status using the systemctl command to confirm that the service is working properly.

Note 1: If you need to update the Ollam version, simply rerun the installation script.
Note 2: In this article, when deploying Ollam within an LXC container on Debian 12, the following method was used: Default CPU-only operating modeIn embedding scenarios, this configuration is sufficient for most knowledge base construction and daily vectorization needs, thus no additional acceleration dependencies are introduced. However,If the host machine itself is equipped with an NVIDIA or AMD graphics cardFurthermore, for applications with higher throughput or concurrency requirements, Ollam also supports enabling GPU acceleration through appropriate parameters and runtime environment configuration (e.g., using NVIDIA Container Toolkit or ROCm). However, this typically involves a series of prerequisites such as host machine drivers, container permissions, and runtime parameters, significantly increasing configuration complexity and maintenance costs. Since this article focuses on… Long-term stable operation and boundary control of embedding servicesRather than pursuing extreme performance, the GPU acceleration path is not explained in detail here. Readers with relevant needs can refer to the Ollam official documentation and deployment instructions for their corresponding GPU platforms based on their own hardware conditions.
2.3 (Optional) Configure Ollama's listening address and model download directory
After completing the Ollam installation in Debian 12 LXC, its services only listen to [service name] by default. 127.0.0.1:11434This address means that it can only be accessed internally by LXC. If you want to access the service from the host machine or other devices on the local area network, you need to change the listening address to the network interface IP assigned by LXC (such as 192.168.xy).
Ollama's listening behavior is not exposed through configuration files, but rather controlled through environment variables of the systemd service. Its service file path is:
/etc/systemd/system/ollama.service
The default value is empty; you need to add the following content yourself:
Environment="OLLAMA_HOST=192.168.xy:11434" # listens on the LXC network interface IP, port can be customized. Environment="OLLAMA_ORIGINS=*" # allows all cross-origin requests.
As shown below:

To specify the download location for Ollam models (by default, Ollam stores model files in the system directory. For long-running infrastructure services like LXC, decoupling model data from system files facilitates subsequent migration, backup, and disk management), you need to create the directory first and then configure permissions correctly:
mkdir -p /ollama/models chown -R ollama:ollama /ollama chmod -R 755 /ollama
At the same time, as before, add the following environment variables to the ollama.service file:
Environment="OLLAMA_MODELS=/ollama/models""
As shown below:

Finally, restart the ollama service:
systemctl daemon-reload systemctl restart ollama
You can check the Ollam monitoring status using `ss -ntpl` (or `netstat -ntpl`) to see if the power distribution is functioning correctly.
ss -ntpl netstat -ntpl
Ollama outputs the following when running normally:


At this point, Ollam is running stably as a system-level service in Debian 12's LXC. It's still just a "runtime environment" and hasn't loaded any models yet. The next section will introduce the model files needed for embedding and allow LXC to begin handling the actual vectorization tasks.
2.4 Download and run the embedded model
2.4.1 Download the embedding model
In the preceding steps, Ollam has been running stably as a systemd service in Debian 12's LXC, listening on the specified network address. However, it's important to note that this environment is still just an "empty runtime" at this point—it hasn't loaded any models yet, nor has it actually undertaken any vectorization tasks.
Ollam's current model list doesn't offer many embedding options. In contrast,nomic-embed-text It's a relatively mature and widely used embedding model, and its performance in Chinese scenarios is quite stable. Therefore, when migrating to the Linux environment, I didn't change the model but continued to use it.nomic-embed-text As the foundation for embedding capabilities, the model is downloaded via the ollama pull command:
ollama pull nomic-embed-text
However, before actually executing this command, there is a very easily overlooked but crucial premise that needs to be explained: Ollam is designed as a typical client-server architecture—ollama serve is responsible for the long-running service process, while commands such as ollama pull and ollama embed are just client calls. They do not automatically start the service, but communicate with the already running Ollam service through network requests.
In the previous section, to facilitate LAN access, I changed the Ollam service's listening address from the default 127.0.0.1:11434 to the actual IP address of the LXC network interface. However, it's important to note that the `Environment` variable in systemd only applies to that specific service. Therefore, the listening address is configured for the service process via systemd environment variables and is not automatically injected into the interactive shell.
This means that if you execute `ollama pull` directly in the terminal, the client will still attempt to connect to the default 127.0.0.1:11434, resulting in the error message "could not connect to ollama server", even though Ollama is actually running normally at the correct address.

Therefore, before pulling the model, you need to explicitly specify the Ollam service address in the current shell, for example:
export OLLAMA_HOST=192.168.10.104:11434
Alternatively, specify temporarily within a single command:
OLLAMA_HOST=192.168.10.104:11434 ollama pull nomic-embed-text
Once it's confirmed that the client and server are using the same address, the model download can proceed normally.

2.4.2 Embedded Model Loading and Execution Method
The responsibilities of the ollama pull phase are limited toDownload the model file and save it locally.It does not load the model immediately, nor does it actively launch any new running instances. Whether it's model downloading or subsequent calls, the ollama CLI is essentially just a client; it always relies on...Ollama Server is already available.
The actual model loading occursThe first time the model is successfully calledWhen an embedding model is first invoked, Ollam loads it into memory in the background and puts it into a persistent state. As long as the instance is not garbage collected, subsequent embedding requests will reuse the same running instance without having to re-initialize the model for each request.
Therefore, verifying whether a model is "truly usable" is not about whether the model was successfully pulled, but rather about...Is it possible to complete a full call chain?Within LXC, this can be verified using the following minimal command:
OLLAMA_HOST=192.168.10.104:11434 ollama run nomic-embed-text "Hello world""
Note: The current Ollam CLI does not provide a separate embed subcommand, so ollama run is used here as the minimal verification method.
If the command successfully returns a set of vector data, it means that three things are true simultaneously: the Ollam Server is connectable; the embedding model has been successfully loaded into memory; and the entire embedding path from the client call to the vector output is established. A normal input is shown in the following figure:

During this process, I deliberatelyNo dialogue model, web UI, or RAG logic is running in this LXC.Its responsibility is strictly limited to one thing: providing embedding capabilities stably and predictably. This trade-off is not a functional compromise, but a layered choice in engineering—embedding is more like an infrastructure than an experimental object that requires frequent intervention and debugging.
At this point, this Debian 12 LXC truly begins to take on the task of vectorization.
3. Summary and Conclusion: Embedding as the Operating Location of Basic Capabilities
At this point, the Ollam embedding model service based on PVE + LXC has achieved its engineering goals: to run stably in a Linux environment with clear operating boundaries and predictable behavior, and to continuously provide embedding capabilities through a network interface.
This article does not focus on "how to make embedding work," but rather on how embedding evolves from a one-off experimental capability into a repeatedly invoked fundamental capability.In what kind of environment should it run?When stability, reproducibility, and long-term consistency become core requirements, the flexibility inherent in desktop systems can actually become a source of uncertainty.
Separating embedding from desktop environments like macOS and placing it in a self-contained environment with a controlled update schedule and clear dependencies essentially prepares the subsequent knowledge base and RAG process.Reduce long-term risksThis adjustment does not aim for optimal performance or results, but prioritizes ensuring predictable behavior and sustainable operation.
It's worth noting that, when conditions permit, directly using the embedding API provided by mature commercial services remains the more efficient and stable option. The solution discussed in this article is primarily applicable to scenarios where relying on external services is neither feasible nor convenient, but where long-term stable embedding operation is desired.
Based on this foundation, further discussion of model performance, vector database selection, and the specific implementation of RAG becomes meaningful. At least in my experience, clarifying the embedding's execution location is far more important than prematurely stacking higher-level structures.