Contents
Preface
In previous articles, I have written about how to run the llama3.2 3B model locally based on Ollama and call it using Lobechat (see article:Home Data Center Series: Building Private AI: A Detailed Tutorial on Building an Open Source Large Language Model Locally Based on Ollama). However, due to hardware limitations, I could only use Docker on an inter CPU mini host to deploy in CPU-only mode. Without GPU support, the output of 1-2 words per second when using an i5 13400 CPU was very frustrating, so I was thinking about using an m4 MacMini to run Ollama and wondered what it would feel like.
After two weeks of patience, I finally placed an order. However, considering that I might have to run local large models frequently and even learn some video operations in the future, I gritted my teeth and ordered the minimum configuration version of m4pro:
When running large local models (such as Llama 3.2), the performance of the M4 and M4 Pro will be different, even though both are equipped with a 16-core neural network engine. This is mainly due to the differences in CPU and GPU performance:
1.CPU performance: The number of CPU cores and performance of the M4 and M4 Pro are different. The M4 Pro will be equipped with more high-performance cores, while the M4 is limited in the number of cores or performance. This means that the performance of the M4 Pro will be stronger than the M4 in tasks that require multi-threaded parallel processing (such as data preprocessing during deep learning inference).
2. GPU performance:M4 Pro has a higher GPU core count and higher computing power. In some AI model reasoning processes that require a lot of matrix calculations, the M4 Pro's GPU can provide faster computing speeds. Therefore, the M4 Pro is usually faster when performing reasoning on large deep learning models (such as Llama 3.2).
3. Memory Bandwidth: The M4 Pro is equipped with greater memory bandwidth, which is also important for running large models. The larger the bandwidth, the faster data can be transferred between the CPU, GPU, and neural network engine, thereby reducing the latency of model loading and inference.
4. Utilization of the Neural Network Engine: Although the M4 and M4 Pro have the same number of neural network engine cores, the optimization of the overall system architecture and data processing speed will affect their actual performance. Therefore, the M4 Pro will have more advantages when performing complex AI reasoning tasks.
So overall, although the neural network engines of the two are the same, the overall performance of the M4 Pro is better than that of the M4 when performing large model inference, especially when dealing with more complex tasks that require more resources. The difference will be more obvious, which just proves that I didn’t waste money by gritting my teeth and buying the M4 pro.
After it arrived, after two days of tinkering, the initialization of regular productivity tools was basically completed, and I could finally start to verify the effect of using ollama to run llama3.2.
I originally thought that the installation and operation must be very simple, so there was no need to write a new article. I could just add a dynamic picture of the actual operation effect in my previous article about Ollama after it was completed. However, when I really started to work on it, I found that there were quite a few pitfalls. Finally, I thought about it and decided to write a special article.
Mac version Ollama deployment process
Install ollama and download and run the llama3.2 model
I mentioned the difference between Docker and source code deployment in my previous article:
So this time, I must have deployed it in source code mode, on the official website of Ollama (https://ollama.com/download)Download the installation file for macos (I downloaded version v0.4.1):
In fact, it is a file that can be run directly. Double-click it and it will be copied directly to the application folder. I am too lazy to take a screenshot here. Then use the following command to download the llama3.2 3B model and run it:
ollama pull llama3.2 ollama run llama3.2
After the operation is successful, you can already conduct Q&A locally:
Note 1: Direct operationollama run llama3.2
When the command is run, Ollama will automatically pull the model file if it finds that there is no llama3.2 model file locally. In fact, it is similar todocker run
and1 TP4Tbddocker pull
.
Note 2: You can also download other models. For specific models, versions, parameter scales, sizes, and commands, see the figure below:
Note 3: The models in Ollama all use 4-bit quantization by default (otherwise you can’t play it~).
Additional knowledge: "Quantization" of the model is a technique that reduces model calculation and memory consumption by reducing the precision of model weights and activation values. Traditionally, the weights and activation values of deep learning models are usually stored and calculated as 32-bit floating point numbers (FP32), while quantization reduces the number of bits of these values to 16 bits, 8 bits, or even 4 bits or 2 bits. After quantization, the memory requirements and computational load of the model will be greatly reduced, allowing large models that were originally difficult to run to run on resource-constrained hardware (such as ordinary graphics cards or mobile phones). At the same time, quantization will minimize the impact of precision loss by optimizing the algorithm.
Calling using "locally deployed" Lobechat
Note 1: The so-called "local deployment" of Lobechat here means that the device where Ollama is deployed, the device where Lobechat is installed, and the device that uses the browser as a client to access are all the same device, so the browser here is usedhttp://127.0.0.1:3210
Visit Lobechat.
Note 2: For detailed deployment and settings of Lobechat, please refer to my other two articles:Docker series based on the open source large language model UI framework: Lobechat detailed deployment tutorialandHome Data Center Series Unlock the full potential of Lobechat: A complete guide from setup to actual useFor the sake of brevity, this article will not go into detail and assume that everyone is an expert.
Use as followsdocker run
The format command deploys a local version of Lobechat on the device where Ollama is deployed:
docker run --name lobe-chat -d --restart=always \ -p 3210:3210 \ -e ACCESS_CODE=xxx \ lobehub/lobe-chat
usehttp://127.0.0.1:3210
Visit, and the "Check" in the Ollama section of the "Language Model" section can pass:
Since Lobechat had not updated the model ID of Llama 3.2 when I wrote this article, I needed to create one myself (this step is not necessary. If the model has been downloaded in ollama and can be automatically obtained in Lobechat through the "Get Model List" in the above picture, you can also directly select it, but I feel that this automatic acquisition is a bit "ineffective"):
Configure the newly created Llama 3.2 3B model:
As usual, let's generate a simple shell script to see the generation speed:
It can be estimated that it is about 5 seconds, while the previous completion time using the CPU-only mode was 1 minute and 41 seconds. This speed is many times faster than the previous CPU-only mode.
Note: I originally wanted to post an animated picture of the screen recording converted to GIF when using only CPU, but the 1 minute and 41 second screen recording video is too large to convert to GIF, so I won’t post it. Anyway, the output speed is about 2 words per second.
Let’s take a look at the knowledge base training time of llama3.2:
Almost a year ago, it was okay. After all, it was free, which was pretty good.
So far, for friends who do not have high requirements, the defaulthttp://127.0.0.1:3210
The address is sufficient to access ollam through the local version of Lobechat.
However, this also has a very big limitation: when accessing Lobechat, the browser address can only be usedhttp://127.0.0.1:3210
When I visit, the language model ollama part of Lobechat can pass the check; if I change the address, for example, I use the real IP address of the local network card,http://192.168.10.115:3210
To access Lobechat, the check fails:
This is because Ollama restricts cross-domain access for security reasons by default, allowing only calls from "127.0.0.1", so when using the local loopback IP address to access the local Lobechat (http://127.0.0.1:3210
), the call to ollama can pass the check; when using the IP address of the local physical network card to access the local Lobechat (http://192.168.10.130:3210
) won’t work.
The trouble caused by this is: if my Ollama wants to provide services to Lobechat deployed on other hosts in the LAN, it is not possible by default.
This may not be a problem for ordinary friends, but the problem is: I have already deployed the server-side database version of Lobechat in my home data center, and normal use will definitely be called from that device, so for me, the cross-domain access problem must be solved.
Solve the problem of ollama cross-domain access
Solutions to cross-domain issues in other deployment modes
Regarding the issue of Ollama allowing cross-domain access, I have actually mentioned it briefly in my previous article, but at that time I used Docker to deploy it for testing, and it had no practical value, so I did not study it in depth. In short, the Docker deployment method only requires adding environment parameters during deployment.-e OLLAMA_ORIGINS="*"
and-e OLLAMA_HOST="0.0.0.0"
This can solve the cross-domain problem.
If Ollama is deployed in source code under Linux, you only need to follow the steps below to allow cross-domain:
1. Edit the service file corresponding to ollama
sudo systemctl edit ollama.service
2. Add the following content
[Service] Environment="OLLAMA_HOST=0.0.0.0" Environment="OLLAMA_ORIGINS=*"
3. Save and exit
4. Overload systemd
And restart Ollama
sudo systemctl daemon-reload sudo systemctl restart ollama
Setting up Ollama cross-domain on Windows is similar to that on Linux, and is also simple. The steps are as follows:
Solve cross-domain issues when deploying Ollama from source code on MacOS
The key is how to solve the cross-domain problem on Mac. In fact, the official also said that you only need to run the following command and then restart the Ollama application:
launchctl setenv OLLAMA_ORIGINS "*" launchctl setenv OLLAMA_HOST "0.0.0.0"
Of course, if you only need to allow specific domain names, run the following command:
launchctl setenv Ollama_ORIGINS "google.com,apple.com"
However, I found that it didn't work after practice, but when I followed the steps below:
export OLLAMA_ORIGINS="*" export OLLAMA_HOST=0.0.0.0 ollama serve
You can pass the inspection:
I'm a little confused, let me sort them out in order first.
1. This is the blank before opening the Ollama app:
2. Only open the Ollama app, there will be 3 processes:
At this time, if you run the command
ollama run llama3.2
, and we get the following result:3. Run the command
ollama serve
after:Run again
ollama run llama3.2
, you can succeed:The process at this time is as follows:
4. Kill all ollama-related processes and return to a blank state. Just run the command
ollama serve
:Run again
ollama run llama3.2
, you can succeed:Therefore, it can be concluded that the Ollama APP is actually useless after the first run to complete the installation task (some previous versions seemed to automatically run the command when opening the APP).ollama serve
, that still makes sense, but what I observed here does not run automatically). The key to running the local large model later isollama serve
Commands andollamm run
These 2 commands.
Note: When usingollama run
When the command runs the local large model, the Ollama APP will be automatically started. At this time, exiting the Ollama APP will also exit the running large model. So it is not right to say that the Ollama APP is completely useless. At least it can be used as a quick exit command for the large model.
Additional knowledge: launchctl setenv is used in macOS to setSystem-level environment variablesThese environment variables will take effect on the entire system, including all user processes. Environment variables set with launchctl setenv can "theoretically" be read by all applications and terminal sessions.
Why before?launchctl setenv OLLAMA_ORIGINS "*"
andlaunchctl setenv OLLAMA_HOST "0.0.0.0"
Will the setting fail?
On macOS, environment variables set with launchctl setenv are not always immediately recognized by all applications or command-line tools. This is because there are some specific limitations in macOS's environment variable management, especially for commands and services run directly from the terminal:
1. Scope of environment variables
The environment variables set by launchctl setenv are onlylaunchdIt is available in processes started by the command line, but it is not passed directly to commands you start from the terminal. For example, using ollama serve
The launched ollama process may not obtain environment variables from launchctl, but from the session environment of the current terminal.
So if you previously started the Ollama app, the command would automatically runollama serve
, then the launchctl setenv method is still valid, but now starting the Ollama app will not automatically run the commandollama serve
, and you can only run it manually in the terminal, so the launchctl setenv method is naturally useless (at least for me).
2. Isolation of terminal sessions
When running in terminalexport OLLAMA_ORIGINS="*"
When ollama is started from a terminal, it can directly obtain the environment variable without reading it from launchctl. Therefore, export is usually more effective for commands run directly from the terminal.
Note: Uselaunchctl setenv
The command to set the environment variable of Ollama may not be ineffective. It still depends on the different macOS versions, so you can try it first. If it succeeds, you don’t need to bother.export
Ordered.
To sum up, if you just want to start ollama temporarily and need to call it across domains, you only need to run the commands in sequence in one terminal:
export OLLAMA_ORIGINS="*" export OLLAMA_HOST=0.0.0.0 ollama serve ollama run llama3.2
If you need the environment variable to take effect on all terminals and started processes, you canexport
Add the command to your ~/.zshrc or ~/.bash_profile configuration file (depending on the type of shell you use) so that the environment variable will be automatically loaded every time you open a new terminal. You can refer to my steps.
The default shell used by macos iszsh
Of course, you can also confirm it with the following command:
echo $SHELL
For example, my Mac output is as follows:
After confirming the shell to be used, you can modify the configuration file corresponding to the shell (the configuration file of zsh is .zshrc in the user's working directory). You can use the text editing tool that you are familiar with. I am used to vim:
vim ~/.zshrc
Then add the following content in the editing interface and save and exit:
export OLLAMA_ORIGINS="*" export OLLAMA_HOST=0.0.0.0
Finally reload the configuration:
source ~/.zshrc
After that, every time you open the terminal, runollama serve
andollama run
If you need to allow cross-domain commands, you do not need to re-runexport
Ordered.
In addition: If you need to automatically run a large model when you start the computer, you only need to create a script with the following content:
export OLLAMA_ORIGINS="*" export OLLAMA_HOST=0.0.0.0 ollama serve ollama run llama3.2
Then set it to start at boot, but generally Macs are not shut down, so I don't think it makes much sense.
Although I think that for general personal use, llama3.2 3B (Llama 3.2's small text-only language model, available in two sizes: 1B and 3B) should be enough, however, in the spirit of "if there are difficulties, we must go for it; if there are no difficulties, we must create difficulties and go for it", I set my sights on the llama 3.2 vision 11B model that was only released on November 6:
High Difficulty Challenge
llama 3.2 vision model introduction
Llama 3.2 Vision released by Meta is a powerful open source multimodal model that combines visual understanding and reasoning capabilities. It is capable of tasks such as visual localization, document question answering, and image-text retrieval. By applying Chain of Thought (CoT) reasoning, this model excels in visual reasoning. In terms of architecture, Llama 3.2 Vision is actually composed of Llama 3.1 LLM, visual tower, and image adapter.
Llama 3.2 Vision supports a combined input mode of image and text, and can also process plain text input. In image-text mode, the model accepts English input; in plain text mode, it supports multi-language processing, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
Field test
Run the llama 3.2 vision model using the following command and ensure it runs successfully:
ollama run llama3.2-vision
Then correctly configure the relevant configuration in Lobechat (as introduced earlier in the article, such as creating a new model in the Lobechat language model's ollama and customizing the model configuration):
And call it in the session interface:
Then try the same requirements as before when testing llama3.2-3B to see the generation speed:
It feels no different from 3B. Let’s take a look at image recognition:
not bad.
Note: During the process, I also paid attention to the GPU utilization in the Activity Monitor. During the period from receiving the command to outputting text, the GPU utilization hovered between 0-33%. Since the utilization is dynamic and difficult to capture, I was too lazy to take a screenshot. But in general, the load is not high (after all, the questions raised are not difficult), and there is still much potential to be tapped.
Summarize
In general, I am satisfied with the efficiency of ollama in running large models on m4pro. llama3.2-vision is enough for me. I will try models with higher parameters in the future to see the limit of m4pro.
Why not buy an Orin Nano, which can run 7~8b models and is cheaper
In fact, the most important thing is whether the Mac is a production machine. The main reason I changed to M4 pro macmini is that I think the production machine M1 Macbook pro should also be replaced. As for running local large models, it is just a secondary purpose. If you buy a device specifically to run local large models, Mac has no advantages in both running performance and price.
Hello author, I am now wondering whether the M4PRO chip or the M4 ordinary chip + 32G memory will have a greater improvement/better cost performance? I mainly want to use it to run local large models + RAG search summaries. Can you give me some suggestions? Thank you very much
It mainly depends on how large you want to run. Generally speaking, the larger the scale parameter (for example, above 20B), the higher the memory requirement. At this time, the role of 32G memory is more important than the performance drop from M4pro to M4 regular version. If it is a moderate-sized RAG of 7B-13B, then the lowest-equipped M4Pro+24G memory can meet it (even if vector retrieval index is included).
Thank you very much! Very detailed answer
The lowest-spec Mac mini can run the 8b model. Your 24G M4pro is very powerful. The 3B model is not enough for daily use and is not smart enough.
That's true. As for the subsequent strength arrangement, I'm just trying it out first, just to test the machine.