Article Summary

To address the issues of insufficient utilization of explicit structured recommendations and blank areas in blog recommendation systems, this paper constructs a semantic recommendation chain based on embedding. The core semantics of the article are extracted to generate a `semantic-source.json` file, vector data is generated using a nomic-embedded-text model, and then cosine similarity is used to construct a `semantic-path.json` file, ultimately achieving semantic space-based association recommendations. This approach retains the original explicit structured recommendation while adding a "Guess What You Want" module that can discover potential cross-class and cross-tag associations, making the recommendation results more diverse and in-depth. Furthermore, offline computation optimizes real-time recommendation performance.

Qwen3-14B · 2026-06-18

1. Why is a "guess what you're thinking" feature still needed?“

Because it overlaps with the "Right-side Menu" feature added to the article page, after some consideration, we decided to remove the "Guess What You Think" feature. However, this article will still be retained as a reference example for implementing similar features.

In previous articles (see:Building a Lightweight Knowledge Index for Blogs (Part 3): Design and Implementation of a Lightweight Recommendation System for WordPress Based on Pre-computed Semantic IndexesI added a "Related Articles" feature below the "Content Structure Tips" (it was previously called "Recommended Reading," but I felt that the term "Recommended Reading" was too broad, so I changed the name to the more accurate "Related Articles").

After running it for a while, I am quite satisfied with this feature. Since the recommendation logic is based on explicit structures such as article title, tags, categories, and body content, it can recommend content that is highly relevant to the current article in most cases.

However, there is one issue that has always bothered me: the large blank area to the right of "Related Articles" always gives the impression that it's not fully filled.

Since the page is empty anyway, I naturally wondered if I could add some more recommended content. However, the problem is that the "Related Articles" section already does a good job of finding similar articles. If we continue to rely on explicit structures like categories, tags, series articles, and knowledge maps for recommendations, the final results will likely still highly overlap with the "Related Articles" section, potentially resulting in a large amount of duplicate content. In other words, if the new feature simply uses a similar algorithm to re-recommend the same articles, then its purpose is actually quite limited.

Since the "Related Articles" on the left are responsible for uncovering explicit connections, the content on the right should perhaps take on another responsibility: attempting to discover potential connections hidden beneath the surface of the articles, connections that cannot be directly described by categories and tags. Among the more mature technical solutions currently available, the most suitable for this task is precisely the semantic space formed after vectorization (for those unfamiliar with the concept of vectors, please refer to my other article: [link to article]).Vectors: The Universal Language of the AI World).

Therefore, I decided to add a brand new recommendation module on the right side – “Guess What You Want”. Unlike ”Related Articles”, which relies on an explicit structure, it will be entirely based on vector data generated from article embeddings, using semantic similarity to find articles that may be related but do not necessarily belong to the same category or tag system.

2. Implementation ideas for the "Guess What You Think" function

2.1 Why not just use the existing article-index.json?

Following the approach of the previous articles in this series, both "related articles" and "series article cards" are ultimately based on WordPress article content to generate semantic-index.json, which is then processed to obtain article-index.json. Finally, the WordPress frontend directly reads article-index.json to complete the display.

Intuitively, "Guess What You Think" seems to be able to directly reuse this structure: article-index.json already contains information such as article title, URL, and summary. Theoretically, with a few more fields added, it could also be used for embedding calculations. However, after actual evaluation, I abandoned this approach. The problem isn't whether there are enough fields, but that its design goals are inherently unsuitable for embedding.

The `article-index.json` file is essentially a data structure for front-end display, and its core value is "displayability" rather than "semantic integrity." It retains primarily organized result information, such as titles, links, summaries, category tags, series relationships, and topic spaces. This content is well-suited for building "related articles" because it relies on explicit structural relationships.

However, the input requirements for Embedding are completely different; it relies on the original semantic expression of the article, rather than a structured result. Continuing to use `article-index.json` introduces a direct problem: the semantics have already been compressed once before entering the Embedding layer. For example, HTML tags, code blocks, and unstructured content are cleaned up or weakened at this layer, and contextual information is also summarized. This is reasonable for presentation systems, but for semantic modeling, it means information loss.

More importantly, `article-index.json` itself already contains a complete explicit structure system (categories, tags, series, knowledge maps, etc.). This system is an advantage in "Related Articles" because it makes recommendations stable and interpretable, but it becomes a constraint in "Guess What You Want"—because the goal of this feature is to go beyond the explicit structure and discover potential relationships that cannot be expressed by categories or tags. If we still rely on these structural fields, the final result is likely to converge to patterns like the same category, the same tag, and the same series. In this way, the difference between "Guess What You Want" and "Related Articles" will be significantly weakened.

If we trace back to the `semantic-index.json` file generated in the first step of the "Related Articles" feature, technically it already has the conditions for embedding, as it still retains information such as article titles, body content, and summaries. However, the design goal of `semantic-index.json` is not for embedding, but rather to serve as an intermediate layer for the subsequent semantic indexing system. In addition to the body content, it also incorporates data structures geared towards indexing and recommendation systems, such as keywords, tags, and related articles.

In other words, while semantic-index.json can be used for embedding, it is itself a "multi-purpose index data" rather than a semantic input layer specifically prepared for vectorization.

Therefore, there are actually two different design approaches here: the first approach is to directly reuse semantic-index.json, extract the content or summary fields from it, and then generate a vector; the second approach is to build a separate data source specifically for Embedding.

Ultimately, I chose the latter—instead of expanding the existing indexing system, I rebuilt a separate data processing chain dedicated to embeddings:

WordPress original content ↓ semantic-source.json ↓ semantic-vector.json ↓ semantic-path.json

Reasons for choosing to rebuild the Embedding data link

The embedding chain is likely to continue to expand in the future. For example, capabilities such as article-level vectors, paragraph-level vectors, knowledge node vectors, and topic clustering may all rely on the same set of semantic input sources.

If we continue to reuse semantic-index.json, the vector system will become coupled with the existing index system; however, by introducing semantic-source.json separately, the embedding chain will have an independent data entry point, and subsequent additions of new semantic computing capabilities or adjustments to the index system architecture will not affect each other.

2.2 semantic-source.json: Constructing a semantic source suitable for embedding

The first step in this new data pipeline is to generate semantic-source.json. It is not a continuation of semantic-index.json, but rather a semantic input layer specifically for embedding, extracted from the original WordPress content, bypassing the original indexing system.

The reason such an intermediate layer is needed is that the raw post content stored in WordPress is not suitable for direct use in embedding calculations. Besides the main text, an article often contains a large number of HTML tags, code blocks, image links, shortcodes, and various formatting information. While this content is essential for webpage display, a significant portion of it constitutes noisy data for semantic models.

Therefore, before entering the Embedding stage, the article content needs to undergo a preprocessing specifically for semantic computation to retain as much of the article's true meaning as possible, while removing those parts that are not closely related to semantic understanding.

After processing, each article will eventually be organized into a very simple structure:

{ "id": "14257", "title": "Building a Lightweight Knowledge Index for Blogs (Part 3): Design and Implementation of a Lightweight Recommendation System for WordPress Based on Pre-computed Semantic Indexes", "url": "/technology/homedatacenter14257/", "content": "..." }

in:

The ID is used for subsequent linked articles;
`title` saves the article title;
The URL is used for the final redirect;
The content stores the semantic text used to generate the embedding.

The core of the entire file actually lies in this content field.

Simply exporting the full text of an article can generate vectors, but the results may not be ideal: firstly, blog articles vary greatly in length, with some being only a few hundred words long and others reaching tens of thousands of words; secondly, not all content has the same value for semantic modeling.

Taking technical articles as an example, while numerous configuration files, command-line outputs, and code examples are important to readers, their contribution to determining the article's theme is often far less than that of the article title, core arguments, and key paragraphs.

Therefore, the goal of semantic-source.json is not to save the full text of the article, but to extract an information summary that can represent the core semantics of the article.

In the design phase, I adopted a relatively simple strategy: first, retain the article titles; then, extract the most representative content from the body text; for excessively long articles, control the final output size through length limits to avoid a single article occupying too much contextual space. The goal of this approach is not to pursue absolute completeness, but to find a balance between semantic fidelity and processing efficiency.

Based on the results, semantic-source.json is more like a "semantically simplified article library" specifically prepared for embedding. It retains the core content of the articles while avoiding a large amount of noise information unrelated to semantic calculation.

With this organized semantic source data, the next step is to further convert these texts into vector representations, thus entering the actual embedding stage.

2.3 semantic-vector.json: Converts the article into a vector

Once the semantic-source.json is ready, the next step is clear: convert this text content into vectors that machines can understand and compute.

For humans, understanding an article relies on language, experience, and context. For example, when we see terms like "home data center," "PVE virtualization platform," and "Cloudflare Tunnel," we naturally associate them with each other and can even determine whether two articles are discussing similar topics.

However, computers lack this capability. In their view, text is essentially just a sequence of characters, devoid of any concept of "meaning." Therefore, if we want the system to automatically discover potential connections between articles, we must first map the text into a representation that can be mathematically calculated.

The essence of an embedding model is to perform this transformation. It reads the article content and encodes it into a high-dimensional vector. For the nomic-embedded-text model in Ollama, each article is ultimately converted into a 768-dimensional floating-point array. This array itself is not readable, but its position in the vector space reflects the semantic features expressed by the article.

Simply put, articles that are semantically similar are usually closer in the vector space; while articles with greater thematic differences will be distributed further apart.

For example:

In the two articles about Cloudflare Tunnel, the vector positions are usually quite close;
Articles about PVE virtualization and home data centers may also form a relatively concentrated area;
In contrast, non-technical topics such as Buddhist philosophy and life insights are often located further away from these technical articles.

This is also the core reason why Embedding can achieve semantic recommendation: the system no longer relies on categories, tags or series relationships, but directly finds related content based on the semantic similarity of the articles themselves.

After vectorization, the final output will be semantic-vector.json.

Compared to `semantic-source.json`, this file no longer stores the complete text, but instead stores the vector results corresponding to the article and necessary metadata. For example:

{ "id": "14289", "title": "Reconstructing Creativity in the AI Era: From Information Processing to Cognitive Collaboration", "url": "/technology/cognition14289/", "vector": [0.0123, -0.0841, ...] }

in:

The `id` is used to associate a post with a WordPress post.
The title and URL are mainly used for subsequent display and debugging;
The vector is a semantic vector generated by the Embedding model.

The semantic-vector.json can be understood as the "computation layer" in the entire semantic recommendation system: if the semantic-source.json stores the original semantic expression of the article, then the semantic-vector.json stores the coordinate position of these semantics in the vector space - its responsibility is not to display the content, but to provide basic data for subsequent similarity calculation.

However, even at this stage, the system still cannot directly provide recommendations to the user. This is because `semantic-vector.json` only records the vector for each article, but has not yet calculated the relationship network between the articles.

Therefore, the next final step is to calculate the nearest neighbor node for each article based on the distance between these vectors and save the results.

This is what we will introduce in the next section: semantic-path.json.

2.4 semantic-path.json: Vector-based semantic associations

Once the semantic-vector.json file is generated, the semantic computation chain has completed its most crucial step: each article has been mapped into a high-dimensional vector space.

However, if we only stay at this level, the system remains unusable. The reason is simple: vectors themselves cannot be directly used for recommendations. They can only express "what this article is," but cannot answer "who it is more similar to." What "guessing what you want" truly needs is the relationship between articles, not the semantic description of a single article.

Therefore, the problem this layer needs to solve is very clear: to transform the distance relationships in the vector space into directly usable structured associations. The implementation is essentially a batch similarity calculation. The system uses each article as a benchmark to perform vector comparisons with all articles, typically using cosine similarity as the metric, and selects the Top-K most similar articles as its semantic neighbors.

In a real-world storage structure, `semantic-path.json` solidifies this relationship into a very straightforward data format. For example, the structure of an article might look like this:

{ "id": "14289", "links": [ { "target_id": "14136", "score": 0.87 }, { "target_id": "14228", "score": 0.82 }, { "target_id": "14164", "score": 0.79 } ] }

in:

`id` represents the current article;
Links represent a collection of articles that are semantically closest to it;
target_id is the associated article;
The score is a matching score calculated from vector similarity, used for sorting and filtering Top-K results.

In this way, the "distance relationship" that originally existed in high-dimensional space is compressed into a structured data, which is the source of semantic-path.json.

From a system design perspective, the role of this layer is to advance the similarity query, which originally needed to be calculated in real time during access, to the construction phase. At runtime, vector calculations are no longer performed; instead, pre-calculated relationship results are directly read. Without this layer, "Guess What You Think" would have to load all vectors and perform full calculations and sorting on every request. This isn't a major issue when the content size is small, but as the number of articles increases, it quickly becomes a performance bottleneck.

The significance of `semantic-path.json` lies in "moving" this step forward: trading offline computation for online performance, making the recommendation logic a pure data reading process. Of course, this design also implies a trade-off: the relationships between articles are fixed at the time of generation. Once built, subsequent recommendation results will not adjust in real time with minor changes in the model or data, but will rely on the structure calculated at that time.

However, for systems like personal blogs with limited content and a relatively stable update schedule, this trade-off is reasonable. It results in greater stability, simpler runtime logic, and more controllable system complexity.

At this point, the entire embedding chain is essentially closed:

Original WordPress content ↓ semantic-source.json (semantic input layer) ↓ semantic-vector.json (vector representation layer) ↓ semantic-path.json (semantic relation layer) ↓ Guess what you want / Semantic recommendation system

In abstract terms, these three steps correspond to language expression, mathematical representation, and relational modeling, respectively, while semantic-path.json is the layer that returns from the "vector space" to the "usable structure".

At this point, the underlying data foundation for "Guess What You Think" is complete. Going forward, all that's needed is for WordPress to use these relationship results to build a stable and interpretable recommendation presentation mechanism.

3 From semantic computation to recommendation display

3.1 Generating script for semantic-source.json

In the previous chapter, we introduced the role and data structure of semantic-source.json. It is the starting point of the entire embedding data chain, responsible for extracting input data suitable for semantic computation from the original WordPress content.

Therefore, the first step is to read the article content from the database and generate the semantic-source.json file. The corresponding script "build_semantic_source.py" has the following code:

import requests
import re
import html
import json
import os
from urllib.parse import urlparse

# 根据实际环境修改
BASE_URL = "http://127.0.0.1"

# ⭐ semantic content 总长度限制
MAX_CONTENT_LENGTH = 8000

# ⭐ 单段最大长度
MAX_PARAGRAPH_LENGTH = 800


# -------------------------
# Step 1: 获取 WordPress 文章
# -------------------------
def fetch_posts(base_url):
    posts = []
    page = 1
    total_pages = None

    while True:
        url = f"{base_url}/wp-json/wp/v2/posts?page={page}&per_page=100"

        resp = requests.get(url)

        if resp.status_code != 200:
            raise Exception(
                f"Request failed at page {page}, status: {resp.status_code}"
            )

        data = resp.json()

        if total_pages is None:
            total_pages = int(resp.headers.get("X-WP-TotalPages", 1))
            print(f"Total pages: {total_pages}")

        posts.extend(data)

        print(
            f"Fetched page {page}, total posts: {len(posts)}"
        )

        if page >= total_pages:
            break

        page += 1

    return posts


# -------------------------
# URL -> path
# -------------------------
def extract_path(url: str) -> str:
    if not url:
        return ""

    return urlparse(url).path


# -------------------------
# HTML 清洗（保留章节结构）
# -------------------------
def clean_html(raw_html):
    text = raw_html or ""

    text = re.sub(
        r"<pre.*?>.*?</pre>",
        " ",
        text,
        flags=re.DOTALL | re.IGNORECASE
    )

    text = re.sub(
        r"<code>.*?</code>",
        " ",
        text,
        flags=re.DOTALL | re.IGNORECASE
    )

    text = re.sub(
        r"<script.*?>.*?</script>",
        " ",
        text,
        flags=re.DOTALL | re.IGNORECASE
    )

    text = re.sub(
        r"<style.*?>.*?</style>",
        " ",
        text,
        flags=re.DOTALL | re.IGNORECASE
    )

    # shortcode
    text = re.sub(r"\[/?[^\]]+\]", " ", text)

    # 标题
    text = re.sub(
        r"</?(h1|h2|h3|h4|h5|h6)[^>]*>",
        "\n",
        text,
        flags=re.IGNORECASE
    )

    # 段落
    text = re.sub(
        r"</?p[^>]*>",
        "\n",
        text,
        flags=re.IGNORECASE
    )

    # br
    text = re.sub(
        r"<br\s*/?>",
        "\n",
        text,
        flags=re.IGNORECASE
    )

    # 删除剩余 HTML
    text = re.sub(r"<[^>]+>", " ", text)

    text = html.unescape(text)

    # URL
    text = re.sub(r"https?://\S+", " ", text)

    # 特殊符号
    text = re.sub(
        r"[丨|｜•·■◆►▶●]+",
        " ",
        text
    )

    # 压缩空格（保留换行）
    text = re.sub(r"[ \t]+", " ", text)

    # 压缩连续换行
    text = re.sub(r"\n{3,}", "\n\n", text)

    return text.strip()


# -------------------------
# 将正文分段
# -------------------------
def split_paragraphs(text):
    paragraphs = [
        p.strip()
        for p in re.split(r"\n+", text)
        if p.strip()
    ]

    result = []

    for p in paragraphs:
        while len(p) > MAX_PARAGRAPH_LENGTH:
            result.append(p[:MAX_PARAGRAPH_LENGTH])
            p = p[MAX_PARAGRAPH_LENGTH:]

        if p:
            result.append(p)

    return result


# -------------------------
# Step 2: semantic content 构建
# -------------------------
def build_semantic_content(title, clean_text):
    paragraphs = split_paragraphs(clean_text)

    if not paragraphs:
        return title

    filtered = []

    for p in paragraphs:
        p = p.strip()

        # 太短
        if len(p) < 80:
            continue

        # 纯章节标题
        if re.match(r"^\d+(\.\d+)*\s+", p):
            if len(p) < 40:
                continue

        filtered.append(p)

    if not filtered:
        filtered = paragraphs

    total = len(filtered)

    selected = []

    # 开头部分
    selected.extend(filtered[:3])

    # 全文均匀采样
    if total > 10:
        sample_ratios = [
            0.10,
            0.20,
            0.30,
            0.40,
            0.50,
            0.60,
            0.70,
            0.80,
            0.90
        ]

        for ratio in sample_ratios:
            pos = int(total * ratio)

            if 0 <= pos < total:
                selected.append(filtered[pos])

    # 结尾部分
    if total >= 2:
        selected.extend(filtered[-2:])
    elif total >= 1:
        selected.append(filtered[-1])

    # 去重
    unique_paragraphs = []
    seen = set()

    for p in selected:
        if p not in seen:
            unique_paragraphs.append(p)
            seen.add(p)

    semantic_text = (
        f"{title}\n"
        + "\n".join(unique_paragraphs)
    )

    return semantic_text[:MAX_CONTENT_LENGTH].strip()


# -------------------------
# Step 3: 构建 semantic-source
# -------------------------
def build_semantic_source(posts):
    articles = []

    for post in posts:
        raw_content = post.get("content", {}).get("rendered", "")

        clean_text = clean_html(raw_content)

        title = post.get("title", {}).get("rendered", "")

        semantic_content = build_semantic_content(
            title,
            clean_text
        )

        article = {
            "id": str(post.get("id", "")),
            "title": title,
            "url": extract_path(post.get("link", "")),
            "content": semantic_content
        }

        articles.append(article)

    return articles


# -------------------------
# 输出 JSON
# -------------------------
def write_json(data, output_path=None):
    if output_path is None:
        base_dir = os.path.dirname(os.path.abspath(__file__))
        output_path = os.path.join(
            base_dir,
            "semantic-source.json"
        )

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(
            data,
            f,
            ensure_ascii=False,
            indent=2
        )

    print(f"\nJSON written to: {output_path}")


# -------------------------
# Main
# -------------------------
if __name__ == "__main__":
    posts = fetch_posts(BASE_URL)

    print(f"\nTotal posts fetched: {len(posts)}")

    semantic_source = build_semantic_source(posts)

    print(
        f"Semantic source built: "
        f"{len(semantic_source)} articles"
    )

    write_json(semantic_source)

    print("\nSample article:\n")

    if semantic_source:
        print(
            json.dumps(
                semantic_source[0],
                ensure_ascii=False,
                indent=2
            )
        )

The logic of this script is actually not complicated.

First, it retrieves all published posts via the WordPress REST API, then extracts the post titles, URLs, and body content. Since embedding calculations are required later, the focus here is not on the page display, but rather on the semantic information conveyed by the posts themselves.

After retrieving the main text, the script first performs content cleaning. WordPress posts typically contain a large amount of display-related information, such as HTML tags, code blocks, shortcodes, style content, and various links. This content is necessary for page rendering, but often constitutes noise for semantic calculations. Therefore, at this stage, the script tries to preserve the natural language expression of the post while removing semantically irrelevant content, thereby improving the quality of subsequent vectorization processing.

After cleaning, the script will break the main text down into paragraphs and filter out overly short content and titles that only contain chapter numbers. The purpose of this is to ensure that subsequent content used in embedding has a relatively complete semantic expression, rather than being distracted by a large number of fragmented titles or navigation information.

Subsequently, the script does not use the entire article directly, but instead extracts a set of representative content segments from the whole text. Besides retaining the beginning section to illustrate the theme and background, it also samples evenly from different positions throughout the text at a fixed ratio, and additionally retains the content from the end. Compared to simply extracting the first few paragraphs or a fixed length of text, this method better covers the overall structure of the article, ensuring that the generated semantic representation not only reflects the topics discussed at the beginning but also retains important information from the middle and later sections.

Finally, the script combines the title with these filtered content fragments into a unified semantic text and writes it to semantic-source.json. Subsequent embedding processes no longer directly process the original WordPress post, but instead perform vectorized calculations based on this preprocessed and compressed semantic data.

3.2 Generating script for semantic-vector.json

The prerequisite for performing this section is having an embedding model that can generate embeddings.

If there are no special requirements regarding cost and deployment method, you can directly call OpenAI's text-embedding-3-small, or the embedding interfaces provided by other mainstream large model service providers. These commercial models usually have good Chinese understanding and semantic expression capabilities, and are relatively simple to implement.

However, for scenarios like personal blogs, the number of articles often continues to grow. If a cloud API needs to be called every time the semantic index is regenerated, it will not only incur additional costs in the long run, but also increase dependence on external services.

Since I previously wrote an article about deploying the local embedding model nomic-embed-text in a home data center using Ollam, this article will directly use nomic-embed-text to complete the vector generation (for those unfamiliar, please refer to my previous article:Making Embedding an Infrastructure: Deploying Standalone Embedded Services in PVE + LXCOf course, it's even better to deploy Ollam on a Mac Mini (M series), as the embedding speed will be much faster. For example, in my actual production environment, I directly use the Qwen3-embedding:8B embedding model. For specific setup and configuration instructions, please refer to my previous article.Deploy Llama 3.2 on Mac mini (M4 Pro): A complete guide to achieve efficient operation and cross-domain access optimization through Ollama).

It's important to note that this article doesn't focus on which specific embedding model is used (e.g., OpenAI's commercial model, locally deployed nomic-embed-text, or Qwen series embedding models), but rather on the method for constructing the semantic indexing chain. Therefore, as long as the model provides a standard embedding interface, the overall implementation approach remains consistent.

However, different models may differ in semantic expressive power, computational resource consumption, and adaptability to Chinese contexts. But these differences are implementation-level choices and do not affect the overall architecture design.

After completing the semantic-source.json file, the next step is to convert the articles into vector representations. Therefore, the task at this stage is very clear: read the semantic-source.json file and generate corresponding vector data for each article.

The code for the corresponding script build_semantic_vector.py is as follows:

import json
import requests
import os
import time
import numpy as np
import hashlib

# -------------------------
# Ollama embedding API
# -------------------------
# 根据实际环境进行修改
OLLAMA_URL = "http://127.0.0.1:11434/api/embed"
MODEL_NAME = "nomic-embed-text"

# -------------------------
# config
# -------------------------
MAX_SEGMENT_LENGTH = 1500
MAX_SEGMENTS = 8


# -------------------------
# load data
# -------------------------
def load_semantic_source(input_path=None):
    if input_path is None:
        base_dir = os.path.dirname(os.path.abspath(__file__))
        input_path = os.path.join(base_dir, "semantic-source.json")

    with open(input_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    print(f"Loaded semantic source: {len(data)} articles")
    return data


# -------------------------
# load cache
# -------------------------
def load_existing_vectors(output_path=None):
    if output_path is None:
        base_dir = os.path.dirname(os.path.abspath(__file__))
        output_path = os.path.join(base_dir, "semantic-vector.json")

    if not os.path.exists(output_path):
        return {}

    with open(output_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    return {item["id"]: item for item in data}


# -------------------------
# hash
# -------------------------
def make_hash(text):
    return hashlib.md5(text.encode("utf-8")).hexdigest()


# -------------------------
# split text
# -------------------------
def split_text_for_embedding(text):
    text = text.strip()
    if not text:
        return []

    paragraphs = []

    for line in text.split("\n"):
        line = line.strip()
        if not line:
            continue

        while len(line) > MAX_SEGMENT_LENGTH:
            paragraphs.append(line[:MAX_SEGMENT_LENGTH])
            line = line[MAX_SEGMENT_LENGTH:]

        if line:
            paragraphs.append(line)

    return paragraphs[:MAX_SEGMENTS]


# -------------------------
# embedding
# -------------------------
def get_embedding(text, retries=3):

    payload = {
        "model": MODEL_NAME,
        "input": text
    }

    for i in range(retries):
        try:
            resp = requests.post(OLLAMA_URL, json=payload, timeout=120)

            if resp.status_code == 200:
                data = resp.json()
                emb = data.get("embeddings", [])
                if emb:
                    return emb[0]

            print(f"[Retry {i}] HTTP {resp.status_code}: {resp.text}")

        except Exception as e:
            print(f"[Retry {i}] Exception: {e}")

        time.sleep(1.5 * (i + 1))

    raise Exception("Embedding failed")


# -------------------------
# build vectors (incremental)
# -------------------------
def build_semantic_vectors(articles, existing):
    vectors = []
    total = len(articles)

    for index, article in enumerate(articles, start=1):

        title = article.get("title", "")
        content = article.get("content", "")
        article_id = article.get("id", "")

        content_hash = make_hash(title + content)

        cached = existing.get(article_id)

        # -------------------------
        # skip unchanged
        # -------------------------
        if cached and cached.get("hash") == content_hash:
            print(f"[{index}/{total}] Skip: {title}")
            vectors.append(cached)
            continue

        print(f"[{index}/{total}] Embedding: {title}")

        segments = split_text_for_embedding(content)

        if not segments:
            print(f"[WARN] Empty content: {title}")
            continue

        embeddings = []

        for seg in segments:
            print(f"Segment length: {len(seg)}")

            emb = get_embedding(seg)

            if emb:
                embeddings.append(np.array(emb))

            time.sleep(0.1)

        if not embeddings:
            continue

        avg_vector = np.mean(embeddings, axis=0).tolist()

        vectors.append({
            "id": article_id,
            "title": title,
            "hash": content_hash,
            "vector": avg_vector
        })

    return vectors


# -------------------------
# write
# -------------------------
def write_json(data, output_path=None):
    if output_path is None:
        base_dir = os.path.dirname(os.path.abspath(__file__))
        output_path = os.path.join(base_dir, "semantic-vector.json")

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(data, f, ensure_ascii=False)

    print(f"\nJSON written to: {output_path}")


# -------------------------
# main
# -------------------------
if __name__ == "__main__":

    articles = load_semantic_source()
    existing = load_existing_vectors()

    vectors = build_semantic_vectors(articles, existing)

    write_json(vectors)

    print("\nSample vector:\n")

    if vectors:
        print({
            "id": vectors[0]["id"],
            "title": vectors[0]["title"],
            "vector_dimensions": len(vectors[0]["vector"])
        })

Unlike the previous section which generated semantic-source.json, this step truly begins the Embedding computation phase.

The script first reads semantic-source.json The article content was organized and then an attempt was made to load the existing vector cache.semantic-vector.jsonIf not found, a new one is created. For each article, the script calculates the hash values of its title and content to determine if a semantic vector has already been generated and if the content has not changed. Only articles with updated content, or entirely new articles, are sent to the locally deployed Ollam service for processing. nomic-embed-text The model generates corresponding semantic vectors; unchanged articles will directly reuse cached vectors, thus saving time from repeated computation.

Considering that some articles are quite long, submitting the entire content at once would not only increase the processing load on the model but may also affect vector quality. Therefore, the script first segments the content, breaking the article down into multiple smaller semantic fragments, and then performs embedding calculations on each fragment separately.

After all segments have been vectorized, the average of these segment vectors is calculated to obtain the final semantic vector representing the entire article.

This combination of incremental computation and caching mechanisms does not aim for absolute precision, but rather to achieve a relatively reasonable balance between computational cost, response time, and semantic coverage. For scenarios with limited content, such as personal blogs, this approach is sufficient to achieve adequately stable semantic expression capabilities.

At this level, the article is no longer text in the traditional sense, but has been transformed into a vector representation that computers can use for mathematical operations and similarity comparisons. (Continued...) semantic-path.json The generation of [data] is based on these vector data.

Notice:

This implementation was initially built based on the embedding model nomic-embed-text during the debugging phase. In the design of the semantic indexing system, the embedding model is essentially a "definer of semantic space". Therefore, replacing the embedding model is not a simple "model name switch", but a system-level change that may affect the distribution of the entire semantic retrieval results.

Different models (such as nomic-embed-text used during the debugging phase and qwen3-embedding:8b that I am currently using) differ in the following aspects:

Input tokenization method
Vector space distribution structure
API parameters and return format
Semantic compression strategies

This code retains the nomic-embed-text version as the baseline implementation, and its main value lies in:
1. Low resource consumption, suitable for local debugging and rapid verification;
2. The semantic expression is relatively stable, making it easy to build an initial index system;
3. As a lightweight baseline model, it helps to compare the performance differences of different embedding models.

In production environments (such as the currently used qwen3-embedding:8b), the embedding request and index generation logic need to be adapted and adjusted based on the actual model characteristics, rather than simply replacing the model name.

3.3 Generating script for semantic-path.json

After completing the semantic-vector.json file, the next step is to establish semantic relationships between the articles.

Therefore, the task at this stage is very clear: read the article vectors stored in `semantic-vector.json`, calculate the similarity between articles, and finally generate `semantic-path.json` and save it in the following path in WordPress:

`wp-content/themes/theme_directory/cache/semantic-path.json`

Because I'm using the Argon theme, the corresponding save path is:

wp-content/themes/argon-theme-master/cache/semantic-path.json

The script for this step is build_semantic_path.py:

import json
import os
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# -------------------------
# 配置
# -------------------------
TOP_K = 6
SEMANTIC_THRESHOLD = 0.72

# 排除一些不需要处理的url，这里是我博客的那些地图
EXCLUDED_PATH_KEYWORDS = [
    "/cloudflaremap/",
    "/aimap/",
    "/singingmap/",
    "/roadmap/"
]

# -------------------------
# Step 1
# -------------------------
def load_semantic_source(input_path=None):
    if input_path is None:
        base_dir = os.path.dirname(
            os.path.abspath(__file__)
        )
        input_path = os.path.join(
            base_dir,
            "semantic-source.json"
        )

    with open(
        input_path,
        "r",
        encoding="utf-8"
    ) as f:
        data = json.load(f)

    print(
        f"Loaded semantic source: {len(data)} articles"
    )

    return data

# -------------------------
# Step 2
# -------------------------
def load_semantic_vector(input_path=None):
    if input_path is None:
        base_dir = os.path.dirname(
            os.path.abspath(__file__)
        )
        input_path = os.path.join(
            base_dir,
            "semantic-vector.json"
        )

    with open(
        input_path,
        "r",
        encoding="utf-8"
    ) as f:
        data = json.load(f)

    print(
        f"Loaded semantic vectors: {len(data)} articles"
    )

    return data

# -------------------------
# 过滤地图页
# -------------------------
def is_excluded(article):
    url = article.get("url", "")

    return any(
        keyword in url
        for keyword in EXCLUDED_PATH_KEYWORDS
    )

# -------------------------
# Step 3
# -------------------------
def build_article_map(
    source_data,
    vector_data
):
    vector_map = {
        item["id"]: item
        for item in vector_data
    }

    article_map = {}

    for item in source_data:
        if is_excluded(item):
            continue

        article_id = item["id"]

        vector_item = vector_map.get(article_id)

        if not vector_item:
            continue

        article_map[article_id] = {
            "id": article_id,
            "title": item.get("title", ""),
            "url": item.get("url", ""),
            "vector": vector_item.get(
                "vector",
                []
            )
        }

    print(
        f"Merged articles: {len(article_map)}"
    )

    return article_map

# -------------------------
# semantic similarity
# -------------------------
def calculate_semantic_similarity(a, b):
    va = np.array(
        a["vector"]
    ).reshape(1, -1)

    vb = np.array(
        b["vector"]
    ).reshape(1, -1)

    return float(
        cosine_similarity(va, vb)[0][0]
    )

# -------------------------
# semantic recommendation
# -------------------------
def build_semantic_recommendation(
    current_article,
    article_map
):
    scored = []

    for article_id, article in article_map.items():
        if article_id == current_article["id"]:
            continue

        if is_excluded(article):
            continue

        sim = calculate_semantic_similarity(
            current_article,
            article
        )

        scored.append({
            "id": article["id"],
            "title": article["title"],
            "url": article["url"],
            "score": sim
        })

    scored.sort(
        key=lambda x: x["score"],
        reverse=True
    )

    filtered = [
        item
        for item in scored
        if item["score"] >= SEMANTIC_THRESHOLD
    ]

    if len(filtered) < TOP_K:
        filtered = scored[:TOP_K]

    return filtered[:TOP_K]

# -------------------------
# 主流程
# -------------------------
def build_semantic_path(article_map):
    results = {}

    total = len(article_map)

    for idx, (
        article_id,
        article
    ) in enumerate(
        article_map.items(),
        start=1
    ):
        print(
            f"[{idx}/{total}] "
            f"{article['title']}"
        )

        items = build_semantic_recommendation(
            article,
            article_map
        )

        results[article_id] = {
            "article_id": article_id,
            "article_title": article["title"],
            "article_url": article["url"],
            "items": items
        }

    return results

# -------------------------
# 输出 JSON，需要根据实际场景进行修改，这里我是直接输出到自用的wordpress的主题目录下。
# -------------------------
def write_json(data, output_path=None):
    if output_path is None:
        output_path = (
            "/docker/wordpress/html/"
            "wp-content/themes/"
            "argon-theme-master/cache/"
            "semantic-path.json"
        )

    output_dir = os.path.dirname(
        output_path
    )

    if (
        output_dir
        and not os.path.exists(output_dir)
    ):
        os.makedirs(
            output_dir,
            exist_ok=True
        )

    with open(
        output_path,
        "w",
        encoding="utf-8"
    ) as f:
        json.dump(
            data,
            f,
            ensure_ascii=False,
            indent=2
        )

    print(
        f"\nJSON written to: {output_path}"
    )

# -------------------------
# Main
# -------------------------
if __name__ == "__main__":
    source_data = load_semantic_source()

    vector_data = load_semantic_vector()

    article_map = build_article_map(
        source_data,
        vector_data
    )

    semantic_path = build_semantic_path(
        article_map
    )

    write_json(semantic_path)

    first_key = next(
        iter(semantic_path)
    )

    print("\nSample:\n")

    print(
        json.dumps(
            semantic_path[first_key],
            ensure_ascii=False,
            indent=2
        )
    )

The script first reads all article vectors stored in `semantic-vector.json`, and then calculates the semantic similarity between the articles. Cosine similarity is used as the distance metric to measure the closeness of two articles in the semantic space.

For each article, the system iterates through the vectors of other articles and sorts them from highest to lowest similarity. It then selects the most relevant articles as future recommendation candidates for that article.

Considering that the blog contains various navigation pages, learning maps, and other special pages, although these contents will participate in the site structure construction, they are not suitable as recommended content. Therefore, the script will actively exclude these pages during the calculation process.

In the final generated semantic-path.json, each article will store a set of pre-calculated semantic relationships.

3.4 WordPress Frontend Integration

At this point, the entire semantic computation chain is essentially complete. For the WordPress frontend, it doesn't need to understand embeddings, vector spaces, or similarity algorithms. Its only task is very simple: based on the current post ID, it retrieves the corresponding list of related posts from `semantic-path.json` and displays the results. The entire frontend stage is essentially just a data reading process.

The corresponding PHP code in WordPress is as follows (the left side shows the "Related Articles" feature, and the right side shows the "You Might Like" feature):

function add_recommend_list_after_content( $content) { if (!is_single() || !in_the_loop() || !is_main_query()) { return$ content;
    }

     $post_id = get_the_ID(); // ========================= // 地图页 / 结构页过滤 // =========================$ map_slugs = [
        '/map/',
        '/cloudflaremap/',
        '/aimap/',
        '/roadmap'
    ];

     $current_url =$ _SERVER['REQUEST_URI'] ?? '';

    foreach ( $map_slugs as$ slug) {
        if (strpos( $current_url,$ slug) !== false) {
            return  $content; } } // ========================= // 读取 article-index.json // =========================$ article_index_file = get_template_directory() . '/cache/article-index.json';

    if (!file_exists( $article_index_file)) { return$ content;
    }

     $article_index = json_decode( file_get_contents($ article_index_file),
        true
    );

    if (!isset( $article_index[$ post_id])) {
        return  $content; } // ========================= // 读取 semantic-path.json // =========================$ semantic_path_file = get_template_directory() . '/cache/semantic-path.json';
     $semantic_path = []; if (file_exists($ semantic_path_file)) {
         $semantic_path = json_decode( file_get_contents($ semantic_path_file),
            true
        );
    }

     $current =$ article_index[ $post_id]; // ========================= // 左侧：相关文章（article-index.json） // =========================$ related_ids =  $current['related'] ?? [];$ related_list = [];

    foreach ( $related_ids as$ related_id) {
        if (!isset( $article_index[$ related_id])) {
            continue;
        }
         $related_list[] = [ 'title' =>$ article_index[ $related_id]['title'], 'url' =>$ article_index[ $related_id]['url'] ]; }$ related_count = 5;
     $related_list = array_slice($ related_list, 0,  $related_count); // ========================= // 右侧：延伸阅读（semantic-path.json） // =========================$ semantic_list = [];
     $semantic_count = 3; if (!empty($ semantic_path) && isset( $semantic_path[$ post_id])) {
         $semantic_items =$ semantic_path[ $post_id]['items'] ?? []; foreach ($ semantic_items as  $item) { // 去重：避免和相关文章重复$ duplicate = false;
            foreach ( $related_list as$ related_item) {
                if ( $related_item['url'] ===$ item['url']) {
                     $duplicate = true; break; } } if ($ duplicate) {
                continue;
            }

             $semantic_list[] = [ 'title' =>$ item['title'],
                'url'   =>  $item['url'] ]; if (count($ semantic_list) >=  $semantic_count) { break; } } } // ========================= // 最终为空 // ========================= if (empty($ related_list) && empty( $semantic_list)) { return$ content;
    }

    // =========================
    // UI
    // =========================
     $html = '<div class="post-recommend-wrapper" style=" display:flex; gap:0; margin:28px 0; padding:16px 18px; background:#f0f7ff; border-left:4px solid #3b82f6; border-radius:6px; font-size:14px; line-height:1.7; color:#444; ">'; // ===================================================== // 左侧：相关文章 // =====================================================$ html .= '<div style="
        flex:1;
        padding-right:16px;
        min-width:0;
    ">';

     $html .= '<div style=" font-size:13px; color:#666; margin-bottom:8px; font-weight:600; "> 📎 相关文章 </div>'; if (!empty($ related_list)) {
        foreach ( $related_list as$ item) {
             $html .= '<div style="margin:6px 0;"> <a href="' . esc_url($ item['url']) . '" target="_blank" style="
                    color:#1d4ed8;
                    text-decoration:none;
                    font-weight:500;
                ">
                ' . esc_html( $item['title']) . ' </a> </div>'; } } else {$ html .= '<div style="color:#999;">
            暂无相关文章
        </div>';
    }

     $html .= '</div>'; // ===================================================== // 中间分割线 // =====================================================$ html .= '<div style="
        width:1px;
        background:#dbeafe;
        margin:0 10px;
    "></div>';

    // =====================================================
    // 右侧：猜你所想
    // =====================================================
     $html .= '<div style=" flex:1; padding-left:16px; min-width:0; ">';$ html .= '<div style="
        font-size:13px;
        color:#666;
        margin-bottom:8px;
        font-weight:600;
    ">
        🔗 猜你所想
    </div>';

    if (!empty( $semantic_list)) { foreach ($ semantic_list as  $item) {$ html .= '<div style="margin:6px 0;">
                <a href="' . esc_url( $item['url']) . '" target="_blank" style=" color:#2563eb; text-decoration:none; font-weight:500; "> ' . esc_html($ item['title']) . '
                </a>
            </div>';
        }
    } else {
         $html .= '<div style="color:#999;"> 暂无猜你所想 </div>'; }$ html .= '</div>';

     $html .= '</div>'; return$ content . $html;
}

add_filter('the_content', 'add_recommend_list_after_content', 20);

The above PHP code can be added using the Code Snippets plugin or by adding it to functions.php.

From an overall logical perspective, this code mainly completes three steps.

First, the system determines whether the current page is an article page and filters out special pages such as knowledge maps and learning maps. These pages serve navigation and indexing functions and are not suitable for inclusion in the recommendation system, so they are directly excluded from the recommendation logic.

The program then reads two data files, article-index.json and semantic-path.json, from the path "wordpress/html/wp-content/themes/argon-theme-master/cache/". article-index.json stores the structured relationships used to generate the "Related Articles" on the left; while semantic-path.json stores the semantic relationships calculated based on embeddings, used to generate the "You Might Want" section on the right.

Once the current article ID is obtained, the system will extract the corresponding recommendation results from both datasets and assemble them into the final display content. To avoid duplicate recommendations, an article that has already appeared in the "Related Articles" list will not appear again in the "You Might Like" area.

From an implementation perspective, WordPress's role here is actually quite simple. It is neither responsible for vector generation nor similarity calculation; it simply reads the pre-generated JSON data and outputs it to the article page according to preset styles.

Therefore, regardless of the embedding model used in the backend or how the semantic relationships are generated, the frontend code does not need to be changed. For WordPress, `semantic-path.json` is essentially just a regular data file.

Thus, the entire implementation loop of the "Guess What You Want" feature is complete. Starting from the original article content, through semantic organization, vectorization processing, and similarity calculation, the semantic relationships between articles are finally formed and presented to readers in the form of a recommendation list on the front end.

3.5 Demonstration and Recommendation Analysis of the "Guess What You Want" Feature

To more intuitively observe the differences between the two recommendation mechanisms, this section uses the article "..." as an example.Understanding RAG from Scratch (Part 1): Principles and Complete Process AnalysisLet's take this article as an example. Below it, the traditional "Related Articles" section provides the following recommendations:

Understanding RAG from Scratch (Part 2): Running a Local RAG Demo on a Mac Mini – A Practical Guide to Minimal Architecture
Practical application of Ollam's self-built embedded model + Chatbox knowledge base
The most convenient AI App front-end: Chatbox - A comprehensive introduction and user guide
Vectors: The Universal Language of the AI World
Starting the AI journey: A detailed introduction to local large language model UI and large language model API providers

The recommendations provided in the "Guess What You Think" section are as follows:

Building a Lightweight Knowledge Index for Your Blog (Part 2): Implementation of JSON Structure and Script Generation
Building a Lightweight Knowledge Index for Your Blog (Part 1): Structure Design and Construction Process
Building a Lightweight Knowledge Index for Blogs (Part 3): Design and Implementation of a Lightweight Recommendation System for WordPress Based on Pre-computed Semantic Indexes

At first glance, this result seems somewhat similar to the traditional "related articles" because the recommended content also focuses on related topics such as knowledge indexing, semantic retrieval, and recommendation systems. However, the underlying implementation logic of the two is actually different.

“"Related Articles" essentially rely on the article structure. Whether it's classification, tags, series relationships, or manually compiled knowledge maps, it focuses on "which system this article belongs to." Therefore, the recommendation results usually continue to extend along the established knowledge path. "Guess What You Want" focuses on something else entirely: what exactly the article is discussing.

For example, while "Building a Lightweight Knowledge Index for Blogs (Part 3)" and the first two articles in the series are related, semantically they also revolve around semantic index construction, data organization, and recommender system design. Within the semantic space established by embedding, these articles are naturally clustered in similar locations, so even without relying on categories, tags, or series relationships, they are highly likely to be interconnected.

This is also one of the biggest differences between semantic recommendation and traditional related articles: it does not care which category an article is placed in, but tries to understand the actual content expressed in the article, and then establishes a connection based on semantic similarity.

Of course, in some cases, this association may transcend existing classification systems. For example, two articles, though on different topics, might still be close in semantic space if they discuss similar concepts such as knowledge organization, system design, or cognitive structure, and thus be recommended together. However, whether and to what extent this cross-domain association occurs largely depends on the embedding model used and the way the semantic space is constructed.

From this perspective, "related articles" are more like continuing along a predetermined path on a given map, while "what you might think" is like finding the nearest neighbors in a semantic space. The two are not substitutes for each other, but rather establish connections between content from different dimensions.

Initially, I simply wanted to use an embedding model to create a "guess what you're thinking" feature, recommending related content based on vector similarity. However, due to insufficient consideration of segmentation strategies, cleaning rules, and semantic coverage in the early stages, the "guess what you're thinking" feature was not performing consistently. So, I took a step back and implemented a "related articles" module based on more intuitive structural information to ensure a basic content association experience.

Later, with the gradual optimization of the overall process, including adjustments to text cleaning, segmentation strategies, and vector construction methods, the "Guess What You Think" feature gradually achieved the expected results. However, a natural phenomenon emerged: its recommendations were often quite similar to those for "Related Articles."

In principle, this is not surprising—both are essentially making judgments on "content relevance," one based on semantic vectors and the other on structure or explicit relationships. Therefore, when the content quality is high and the theme is clear, it is normal for the results to converge.

I initially considered keeping only one module, but ultimately decided to keep both. The reason is quite simple: although their recommendation results are similar, their focuses are not entirely the same. "Guess What You Want" leans more towards expanding potential interests in the semantic space, while "Related Articles" focuses more on deterministic connections at the structural level. The former provides room for exploration, while the latter provides anchor points for reading; the combination of the two makes the content navigation more stable and hierarchical.

However, in order to highlight the primary and secondary information, "Guess What You Think" only displays a maximum of 3 recommended results.

4 Conclusion

Over the past few years, technologies such as semantic retrieval, vector databases, and RAG have gradually become important infrastructure for AI applications. However, for personal blogs, the real problems that need to be solved are often not so complex. In most cases, what we need is not a semantic retrieval system that updates in real time and supports hundreds of millions of data points, but simply to establish relationships between articles that are more consistent with the content itself.

Therefore, throughout the implementation process, I always adhered to one principle: keep the complexity in the build phase and the simplicity in the runtime phase. Through the three-layer structure of semantic-source.json, semantic-vector.json, and semantic-path.json, all semantic calculations are completed in advance, and WordPress is only responsible for reading the results and displaying the content.

While this approach sacrifices real-time computing power, it achieves extremely low operating and maintenance costs. For scenarios like personal blogs, where update frequency is far lower than access frequency, this trade-off is often more reasonable. In the end, "related articles" maintain the knowledge structure, while "what you might want" discovers semantic connections; together, they form a recommendation mechanism that balances interpretability and exploratory features.

However, for me, the greatest value of this system may not lie in the recommendations themselves. What's truly meaningful is the semantic network of connections built by Embedding—in traditional blog systems, the connections between articles usually come from categories, tags, series articles, and manually maintained knowledge maps; but after introducing semantic connections, articles begin to have a connection relationship based on the content itself.

This relationship does not rely on a pre-designed classification system or manual organization, but rather forms naturally from the content actually expressed in the article. In other words, classification, tags, and knowledge maps describe the knowledge structure as seen by the author; while the semantic connections established by embedding attempt to discover the relationships presented within the content itself.

In a sense, this can be seen as an attempt to gradually evolve blogs from information collections into knowledge networks.

📚 系列文章：为博客构建“轻量级知识索引”（5 / 8）

1 2 3 456 7 8

📌 Content Structure Hints:

This content belongs to "AI Learning MapThis is part of the document; you can view the full content path here: AI Learning Map .

Share this article

Comments

Kuang Wencheng's personal blog

iPhone Safari 26.5

2 months ago
2026-6-18 16:01:21

Such a complicated feeling
- tangwudi
  Owner
  Kuang Wencheng's personal blog
  
  Macintosh Chrome 149.0.0.0
  
  2 months ago
  2026-6-18 16:08:24
  
  The underlying principle isn't complex. Essentially, it involves first generating a semantic vector for each article using an embedding model, then saving these vectors to a local JSON file. Next, it calculates the distance between articles using cosine similarity and selects the most similar articles as recommendations. The real complexity lies in the various detailed optimizations, rather than the core algorithm itself.
  
  The underlying principle isn't complex. Essentially, it involves first generating a semantic vector for each article using an embedding model, then saving these vectors to a local JSON file. Next, it calculates the distance between articles using cosine similarity and selects the most similar articles as recommendations. The real complexity lies in the various detailed optimizations, rather than the core algorithm itself.
- tangwudi
  
  Kuang Wencheng's personal blog
  
  Macintosh Chrome 149.0.0.0
  
  2 months ago
  2026-6-18 16:09:46
  
  The underlying principle isn't complex. Essentially, it involves first generating a semantic vector for each article using an embedding model, then saving these vectors to a local JSON file. Next, it calculates the distance between articles using cosine similarity and selects the most similar articles as recommendations. The real complexity lies in the various detailed optimizations, rather than the core algorithm itself.
  
  The underlying principle isn't complex. Essentially, it involves first generating a semantic vector for each article using an embedding model, then saving these vectors to a local JSON file. Next, it calculates the distance between articles using cosine similarity and selects the most similar articles as recommendations. The real complexity lies in the various detailed optimizations, rather than the core algorithm itself.