Contents

1 Preface
2 Build your own open source large language model based on Ollama
3 Installation model llama3.2 in Ollama and general use
4 Afterword

Preface

I have written several articles about the basics of AI big language models, including an introduction to local big language model UI and API providers (see article:Starting the AI journey: A detailed introduction to local large language model UI and large language model API providers), Local large language model UI: Lobechat installation and deployment tutorial (see article:Docker series based on the open source large language model UI framework: Lobechat detailed deployment tutorial) and detailed settings and usage tutorials (see article:Home Data Center Series Unlock the full potential of Lobechat: A complete guide from setup to actual use). However, the API providers involved in writing these contents are basically well-known commercial API providers, such as OpenAI (GPT series), Google (Gemini series), Anthropic (Claude series), etc.

However, in addition to commercial API providers, there are also open source large language model frameworks that can be built locally, such as the famous Ollama. This article uses the deployment of a large language model under the Ollama framework as an example to demonstrate how to build your own open source large language model locally.

Note: The hardware requirements for building a local open source large language model are not low. Running a model with a large parameter scale has high requirements for memory. If the response speed of a large language model is required, the support of Nvidia graphics card is required, so there is still a high threshold. Because I installed it on a mini host with an inter CPU and no independent graphics card, I can only rely on the CPU, so the response speed is hard to describe.

Build your own open source large language model based on Ollama

About Ollama

Ollama It is a framework focused on running and managing large language models (LLM) locally (official website address:https://ollama.com/), providing a user-friendly interface and tools for conveniently loading, managing, and running a variety of large language models. Without relying on cloud servers, Ollama allows users to efficiently run LLM directly on their own computers, while also having a wealth of extended functions. The core features of Ollama are as follows:

Run locally: Ollama supports running large language models directly on local hardware, which is particularly attractive for users who are sensitive to data privacy because the data is completely saved locally.
Model support：Ollama is compatible with a variety of popular models, and users can choose to load the appropriate model and adjust it according to different task requirements. The model range includes open source LLM and high-quality models similar to GPT-4.
Friendly interface: Provides intuitive UI and CLI (command line interface). Users can interact with the model through a graphical interface and execute model commands in the terminal.
Easy installation and management: Ollama allows users to easily download, install and update models, manage multiple model instances, and improve the ease of use and flexibility of local models.
Customization and Extension: Supports fine-tuning and parameter adjustment of multiple models, and users can optimize the model effect according to specific tasks. In addition, Ollama allows local models to be used for personalized application development, allowing them to be seamlessly integrated into different projects.

Ollama’s application scenarios include but are not limited to:

• Data privacy protection: Ollama is suitable for use in enterprise, organizational, and personal development environments where high privacy and data locality are required.

• Development and testing：Ollama is an ideal tool for researching and testing LLM, helping developers debug model performance locally.

• Resource Optimization：Can run complex models without relying on cloud services, suitable for projects that require local computing.

Overall, Ollama is suitable for users who want to use large language models in a local private environment and want flexible management of model selection and parameters.

Ollama supports large local language models

Ollama supports a variety of open source and optimized large language models, which can be easily run locally without relying on network connection. These models can replace the commercial large language models on the market in terms of function and performance. The following are some common local runnable models and their applications supported by Ollama:

1. LLaMA (Large Language Model Meta AI)

• Introduction：The open source model family launched by Meta provides models of different sizes (such as LLaMA-7B, LLaMA-13B, LLaMA-30B, etc.), has good performance and can run locally. LLaMA's model architecture performs well under multiple tasks and is suitable for common language tasks such as text generation and question answering. Among them, Llama 3 is the third-generation Llama series model open sourced by Meta AI. Its new 8B and 70B parameter Llama 3 models have achieved greater performance improvements based on Llama 2. Due to the technical improvements in pre-training and post-training, its Llama 3 model is the best model with 8B and 70B parameter scales today. The improvements to the Llama 3 model have greatly reduced the false rejection rate, improved consistency, and increased the diversity of model responses. The Llama 3 model has also been greatly improved in functions such as reasoning, code generation, and instruction tracing.

• Alternative business models: Similar to OpenAI's GPT-3 and GPT-3.5, it is especially suitable for small projects and personal use by developers.

2. Mistral

• Introduction：Mistral is a performance-optimized open source model that achieves excellent performance with less computation and can run efficiently on resource-constrained devices. Its structure is particularly suitable for achieving a good balance in multi-tasking.

• Alternative business models：It can replace commercial models similar to GPT-3, and is especially suitable for applications that require small model deployment and low computing resources.

3. Falcon

• Introduction：The Falcon model is a powerful open source model launched by TII (Technology Innovation Institute) with strong natural language generation and understanding capabilities. Falcon 40B performs well on multiple benchmarks and is suitable for a variety of complex text tasks.

• Alternative business models: It can replace commercial models at the GPT-3.5 and GPT-4 levels. Its performance is close to ChatGPT in some scenarios, and it performs well in processing complex tasks.

4. Vicuna

• Introduction：This is a model that performs instruction fine-tuning based on LLaMA, and specifically optimizes the performance in instruction following tasks. Vicuna's training data comes from large-scale user instruction data, making it more suitable for conversation scenarios.

• Alternative business models: Suitable for replacing Claude series models like ChatGPT or Anthropic, especially performing well in interactive tasks such as conversation, question and answer, and customer support.

5. Code LLaMA

• Introduction：Meta is an LLaMA variant model optimized for code generation tasks, supporting code completion, generation, and interpretation in multiple programming languages. Designed specifically for developers, it improves the accuracy of code generation tasks.

• Alternative business models: A suitable alternative to GitHub Copilot or OpenAI's Codex, with similar capabilities in code completion and programming support.

6. GPT-J & GPT-NeoX

• Introduction: A family of open source models developed by EleutherAI with different parameter scales (such as GPT-J-6B and GPT-NeoX-20B, etc.), designed to support various natural language generation tasks with excellent performance.

• Alternative business models: GPT-J and GPT-NeoX can be used to replace GPT-3-level commercial models, suitable for daily conversations and generative tasks.

7. Dolly

• Introduction: An open source model launched by Databricks that supports localization and customized fine-tuning and is good at handling simple dialogue tasks and question and answer.

• Alternative business models: Suitable for replacing models like Bard or GPT-3, especially in small projects that require personalized fine-tuning.

8. Bloom

• Introduction: A large open source model launched by the BigScience collaborative project that supports multi-language processing, excels in long text generation tasks, and performs well in multi-language applications.

• Alternative business models: It can be used to replace commercial models such as GPT-3.5 or Claude 2, and is particularly suitable for tasks that require multilingual support.

Summarize

There are other models, which I will not list one by one. These models can be run locally with the support of Ollama, providing a flat alternative to commercial models, especially suitable for application scenarios that require data privacy, high security, customization or cost sensitivity. These models are usually smaller in resource consumption than commercial models and are suitable for users who need specific tasks or customized needs.

Ollama Deployment

Deployment methods and differences

Ollama can be deployed in either Docker or source code. In terms of performance, there is usually no significant difference between Ollama source code deployment and Docker deployment, because the underlying model reasoning depends on the hardware rather than the operating environment itself. However, in specific scenarios, the following points may produce some minor differences:

Resource Allocation: Docker containers sometimes limit direct access to hardware resources, especially in the default configuration. Although you can adjust Docker settings to provide greater resource access, resource management for Docker deployments is slightly more indirect than running source code directly on the host.
System Overhead: Docker containers add a small amount of system overhead (such as storage layer and network layer). These overheads are usually small, but may accumulate into tiny delays in high-performance computing scenarios.
Driver compatibility: If you use a GPU, source code deployment can directly access the local GPU driver, while Docker requires a dedicated container and driver configuration for GPU support. NVIDIA provides the nvidia-docker tool to optimize Docker's GPU support, so it can usually achieve performance close to that of source code deployment.

Brief summary: In an environment without a GPU, there is basically no difference between source code deployment and docker deployment. In this case, docker deployment is more recommended. If there is a GPU, source code deployment is more recommended.

Deployment in Docker mode

CPU only

docker run --name ollama -d --restart=always --net=public-net \
-v /docker/ollama:/root/.ollama \
-p 11434:11434 \
-e OLLAMA_ORIGINS="*" \
ollama/ollama

Parameter Notes:
-v /docker/ollama:/root/.ollama: maps "/docker/ollama" on the host to "/root/.ollama" inside the container
-e OLLAMA_ORIGINS="*": The domain name that allows cross-domain access. The default is "local access". Configure as needed

Nvidia GPU

Install NVIDIA Container Toolkit (configured on the host)

Install using Apt

Configuring the warehouse

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update

Install NVIDIA Container Toolkit

sudo apt-get install -y nvidia-container-toolkit

Install using Yum or Dnf

Configuring the warehouse

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
    | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

Install NVIDIA Container Toolkit

sudo yum install -y nvidia-container-toolkit

Configure Docker to use Nvidia drivers

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Create and run the container

docker run --name ollama -d --restart=always --net=public-net \
--gpus=all \
-v ollama:/root/.ollama \
-p 11434:11434 \
-e OLLAMA_KEEP_ALIVE="10m" \
-e OLLAMA_ORIGINS="*" \
ollama/ollama

Parameter Notes:

–gpus=all: Allows Docker containers to access all Nvidia GPUs on the host. NVIDIA GPU support is enabled, instructing Docker to assign all GPUs available on the host to the container. If there are multiple GPUs on the host, this option gives the container access to all GPU devices. This option relies on installing "nvidia-container-toolkit" and requires the host to support NVIDIA drivers. If there is no GPU or the NVIDIA support library is not installed, –gpus=all will not work and the command will ignore GPU support.
-e OLLAMA_KEEP_ALIVE="10m": The duration that the model will remain loaded in the video memory, the default is "5m", change as needed

AMD GPU

To run Ollama with AMD GPU using Docker, use rocm tags and the following commands:

docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm

Source code deployment

For the deployed operating system, the default source code installation in Ollama supports three types by default: macos, Linux, and Windows. Both macos and Windows are traditional application installation methods, and Linux is just an installation command:

curl -fsSL https://ollama.com/install.sh | sh

The installation steps are very simple, so I won’t waste space introducing them here.

The source code installation and docker installation methods are different, and there will be more environment parameter variables specified, such as "OLLAMA_HOST": the bound host and port, the default is "127.0.0.1:11434", configure as needed. If the local large language model UI and Ollama are on the same device, use the default value; if they are not on the same device and there is a problem with remote access, you can try to set this parameter.

Installing a ready-made model version

Next, run the large model version you need locally. For the latest model version, model parameter scale, and model size of the large model, please refer to the parameters on the official GitHub page (https://github.com/ollama/ollama):

Note: The unit "B" here refers to the number of parameters of the model, which is measured in "Billion". In other words, 1B means that the model has 1 billion parameters, and 7B means 7 billion parameters. The number of model parameters generally reflects the complexity and capabilities of the model - the more parameters, the richer the language features that the model can capture and express, but at the same time it requires more computing resources and memory. In large language models, parameters represent learned values such as weights and biases in the model. These parameters determine how the model processes and generates text data. Generally speaking, the larger the number of parameters, the better the model performs, but it requires higher hardware performance support to run efficiently.

Different model rules have requirements for host memory:

Although the above figure only mentions 7B, 13B, and 33B, we can usually roughly estimate the required memory based on the linear ratio of the model parameter scale. For example, assuming that a 7B model requires 8 GB of memory, there is a roughly linear relationship between model parameters and memory requirements, and we can deduce:
• 1B Model: Approximately 1.1 – 1.2 GB RAM
• 2B Model: Approximately 2 – 2.3 GB RAM
• 3B Model: Approximately 3 – 3.5 GB RAM
• 8B Model: Approximately 9 – 9.5 GB RAM

These are just rough estimates, and specific requirements may vary depending on model architecture, optimization methods, or loading patterns. Actual memory requirements may vary slightly depending on the actual distribution and segmentation of the model, but the overall calculation idea can be used as a reference.

Custom Model

Additional knowledge: GGUF model file format

"GGUF" refers to a model file format optimized for local use, especially for running LLMs (large language models) on devices with limited resources. It is a binary format with cross-platform and model compatibility advantages, so it is faster to load and reason. This format is becoming increasingly popular in the open source large language model community (Reddit, Discord, and other AI community forums) because it can improve local performance without relying on cloud resources.

Some open source model libraries also provide a large number of open source models that support the GGUF format. Take "Hugging Face" (one of the most popular platforms at present) as an example. Through the Hugging Face model library, you can search and download pre-trained models shared by many communities and institutions, sometimes directly including the GGUF format or corresponding conversion tools.

In addition, some models that support the GGUF format (such as open source models such as LLaMA) will be published directly by the development team on platforms such as GitHub. The official repository or release page usually lists the supported formats and provides download links.

Custom Model

Ollama can create custom models by importing GGUF format model files in the Modefile file. The steps are as follows.

First create a file named "Modelfile" and add the following FROM instruction to the file to specify the local file path of the model to be imported:

 FROM ./vicuna-33b.Q4_0.gguf

Create a model named example in Ollama (the model name can be set as you like):

ollama create example -f Modelfile

Running a custom model

ollama run example

Note: Modelfile is a configuration file that contains the core settings and hyperparameters of the model. Common configuration options include the model's base source path (specified by FROM), hyperparameters such as the number of layers, hidden units, and the number of attention heads. You can also define training parameters such as the location of the training data, learning rate, batch size, and optimizer. Other settings in Modelfile include input format, output type, device type (such as CPU or GPU), memory limit, and model metadata such as author, version, and description. These parameter combinations provide flexible customization to make model loading, training, and running more efficient and adaptable to different application needs. Since there is a lot of content and I can't get involved for the time being, I won't say much here. Friends who need it can study it by themselves.

Installation model llama3.2 in Ollama and general use

Install llama3.2

Taking the "llama3.2 3B" model as an example, since I deployed it through the CPU-only docker method, the command to run the model is:

docker exec -it ollama ollama run llama3.2

Then there is a period of waiting. The specific waiting time depends on the model size you choose and the network speed:

anddocker runThe command is similar to,ollama runThe command actually includes firstpullOperation pulls the model, and thenrun, so you can actually pull the image separately and run it when neededollama runOrder:

docker exec -it ollama ollama pull llama3.2

Then you can useollama listCommand to view the downloaded large language model:

docker exec -it ollama ollama list

useollama psCommand to view the running large language model:

docker exec -it ollama ollama ps

useollama stopCommand andlooama rmThe command combination can delete a running large language model:

docker exec -it ollama ollama stop model ID docker exec -it ollama ollama rm model ID

You can also use the following command to seeollama -hWhat parameters does the command support?

docker exec -it ollama ollama -h

If you install Ollama from source code, just run the corresponding model download and installation command in the shell:

ollama run llama3.2

The other commands are the same as those under Docker, except that they can be run directly.

You can also try other different models. The commands corresponding to the specific models can be seen in the picture above.

In fact, running the commandollama run llama3.2The container has entered the interactive interface with llama3.2:

However, this interaction method is not practical and can only be used for testing. To be truly practical, it still needs to rely on a local large language model UI, such as Lobechat.

Add llama 3.2 model from Ollama to Lobechat

Method 1: Build your own llama 3.2 model (not recommended, troublesome)

At the time of writing this article, Lobechat does not yet have built-in support for llama 3.2 models, so you can use it by building your own model:

According to my previous article (Home Data Center Series Unlock the full potential of Lobechat: A complete guide from setup to actual use), you can create a new llama 3.2 3B model version:

Then there is the Llama 3.2 3B model version:

Click "Llama 3.2 3B" in the red box above to enter the specific model setting page. After configuring as needed, click "OK" to save:

Method 2: Get the downloaded model list directly from Ollama (recommended)

This method is the simplest. If the communication with ollama is normal and passes the inspection, you can directly click the "Get Model List" button to get the models that have been downloaded from ollama. For example, I downloaded 2 models, llama3.2 and gemma2:2b, so click Get Model List and it will appear directly, as shown below:

Note: The only issue that needs attention is what I noted in the picture above. If Ollama and Lobechat are on the same device, and the "Ollama Service Address" box is left blank (indicating the use of the system defaulthttp://127.0.0.1:11434Address), Ollama is normal and usedollama runThe command enabled the corresponding large model, but it failed to pass the inspection.

If you encounter this problem, you can try to change the address from 127.0.0.1 to the real IP address of the device network card. It is probably a bug.

Select the Ollama 3.2 model in the Lobechat "Conversation" interface

In the chat section on the right side of the "Conversation" interface of Lobechat, select the corresponding model on the right side of the assistant's avatar. I choose Llama 3.2 3B here, as shown below:

Then try a simple conversation:

Directly increase the CPU of the docker-dedicated virtual machine to 78%:

Write the simplest shell script:

The CPU of the docker-dedicated virtual machine reached 90%~:

CPU deployment alone was still not enough, so I tried a smaller model "llama3.2:1b". Compared with 3B, 1B requires fewer resources and should be faster in theory, but it only feels that the output speed is a little faster:

The CPU utilization is slightly lower, but not much of a difference actually:

I think the bottleneck is my CPU. My crappy CPU doesn't make any difference whether it's 1B or 3B.

Note: There are also many Chinese fine-tuned versions of llama3 on GitHub. After optimization, the response speed and support for Chinese have been greatly improved under the same configuration. If you are interested, you can look for them. Just search for "llama3" on GitHub. Some of the platforms and communities mentioned above can also be a search channel.

Afterword

Originally, I wanted to study whether Intel's core graphics can achieve acceleration, but after some research, it seems that it is not very meaningful, because Ollama's deployment mainly supports NVIDIA GPUs because CUDA support is required to accelerate deep learning tasks (CUDA is NVIDIA's proprietary GPU computing platform, widely used for training and reasoning acceleration of AI models). Intel's core graphics lack support for CUDA. Although Intel is also developing GPU solutions for AI reasoning (such as Intel Arc discrete graphics and its OneAPI framework), Ollama has not yet announced support for these hardware and frameworks, let alone Intel's core graphics. Even if it is tossed out, it is not very meaningful.

However, running Ollama on the Apple M series (such as the M3 Pro) can achieve better performance, especially for small or medium-sized large language models (such as models with less than 7B parameters): Apple's M series chips (including M1, M2, M3 and their Pro, Max, and Ultra variants) have built-in powerful neural network engines and GPUs, which are specially optimized for machine learning tasks. Therefore, the performance is "good" when running large language models locally, but I don't know how "good" it is. My MacBook Pro is only M1, and I really have no desire to test it.

I have decided to buy a mac mini with M4 pro next year to test it. I am trying my best to do research. This is a dedication to science, not a waste of money.

In addition, here is a simple comparison between Apple M series (taking M3 Pro as an example) and NVIDIA GPU running Ollama:

• Inference speed

For models that are not too large (such as 7B or smaller models), the performance of M3 Pro can be close to that of entry-level NVIDIA graphics cards, or even faster than some low-end GPUs. However, for large-scale models (such as 13B or larger), the CUDA cores and Tensor cores of NVIDIA professional GPUs (such as A100, H100, RTX 4090, etc.) can provide significant acceleration, especially in terms of video memory and parallel processing capabilities, so they will be faster.

• Memory Limits

M3 Pro usually has a large memory capacity (16GB to 36GB), which can support small and medium-sized models to run in memory, but very large models (such as 33B, 65B) will be limited by memory and may not be fully loaded into memory, resulting in operation failure or the need to load data in batches. Professional NVIDIA GPUs have larger video memory (such as A100 has 40GB video memory) and can handle larger models.

• Power efficiency and quietness

M-series chips such as the M3 Pro are very efficient and quiet, with low heat and power consumption, making them suitable for long-term operation. In contrast, NVIDIA professional graphics cards consume more power and generate more heat, and usually require active cooling.

Summary: Running Ollama on the M3 Pro can achieve good CPU and neural network engine acceleration, which is suitable for daily use and small model reasoning (the threshold is not low, and the M3 (Pro) machine is not cheap~, and the M2 (Pro) should also be able to compete. In short, the better the performance, the better the experience). If you pursue higher reasoning speed and smooth operation of large models, NVIDIA's professional graphics card is still a stronger choice.

Preface

Build your own open source large language model based on Ollama

About Ollama

Ollama supports large local language models

Ollama Deployment

Deployment methods and differences

Deployment in Docker mode

CPU only

Nvidia GPU

Install NVIDIA Container Toolkit (configured on the host)

Configure Docker to use Nvidia drivers

Create and run the container

AMD GPU

Source code deployment

Installing a ready-made model version

Custom Model

Additional knowledge: GGUF model file format

Custom Model

Installation model llama3.2 in Ollama and general use

Install llama3.2

Add llama 3.2 model from Ollama to Lobechat

Method 1: Build your own llama 3.2 model (not recommended, troublesome)

Method 2: Get the downloaded model list directly from Ollama (recommended)

Select the Ollama 3.2 model in the Lobechat "Conversation" interface

Afterword

Send Comment Edit Comment

Related Posts