Building a Lightweight Knowledge Index for Your Blog (Part 2): Implementation of JSON Structure and Script Generation

1 Introduction

The first article in this series (see article:Building a Lightweight Knowledge Index for Your Blog (Part 1): Structure Design and Construction Process)的最后,我们已经将整套知识索引系统收敛为一个清晰的结构模型:从文章获取、内容整理,到关系构建与结果输出,形成了一条完整的数据处理链路。

不过,这种描述本身仍然停留在“结构层”——它解决的是“系统由哪些部分组成”,而不是“这些部分如何真正运行起来”。

因此,在这一篇中,我们要做的事情,就是把上一篇中的结构设计,进一步落地为一个可以实际运行的脚本系统。不过需要说明的是,这一步并不仅仅是“把流程翻译成代码”。更准确地说,它是一种执行视角下的重新组织:我们不再以“功能阶段”来理解系统,而是开始以“函数调用链”和“数据流转过程”来组织整个实现。

在这种视角下,上一篇中定义的数据获取、结构整理、语义生成以及关系计算等阶段,会自然对应到脚本中的多个核心步骤。它们不再只是概念上的划分,而是会以具体函数的形式,在整个执行链路中依次完成。

需要强调的是,这部分实现的目标,并不是追求复杂的工程设计或过度优化,而是优先保证一件事:整个流程能够稳定、清晰地跑通。因此,在实现方式上,会尽量采用直接、可读、易于调试的结构,而不会提前引入过多复杂机制。

某种意义上来说,这一篇真正完成的,其实是一次“从结构到执行”的转换:把一个在逻辑上成立的系统,真正变成一个可以持续运行的数据处理流程。

2. Script runtime environment and initialization

Before delving into the script logic, it's crucial to understand that the semantic indexing script's runtime environment doesn't rely on complex frameworks or special runtimes. Its entire execution foundation is essentially a standard Linux Python environment. Therefore, the goal of this step isn't to "build the system," but rather to consolidate the production environment from a "usable operating system state" into a "stable runtime environment for executable Python scripts."

This example uses Debian Linux (Debian 11), a typical server environment characterized by its stable structure, clear dependencies, and default provision of basic Python runtime capabilities.

1. Python runtime environment verification

In most Debian 11 base installations, the system already comes with a Python 3.x interpreter. First, you need to confirm that your current environment has the necessary Python execution capabilities:

python3 --version

If the return is similar Python 3.9.x The version information indicates that the Python runtime already exists, and you can proceed directly to the next step.

2. Package management tool (pip) confirmation

Python scripts rely on the pip tool for dependency management, so you need to verify that pip is available.

python3 -m pip --version

If the system prompts that pip does not exist, you need to install it. In Debian systems, this can be done through the official package manager.

apt update apt install python3-pip -y

After installation, double-check that pip is available.

3. Run dependency installation

This script has very limited external dependencies, currently only relying on... requests For HTTP requests:

python3 -m pip install requests

After installation, you can verify whether the environment is working properly by simply importing the data:

python3 -c "import requests; print(requests.__version__)""

If no errors are reported, it means that the dependent environment is ready.

4. Minimum validation of the operating environment

Before proceeding with the full script execution, two key capabilities need to be confirmed:

  • Python can execute scripts normally.
  • HTTP requests can access the WordPress REST API
    Verification can be performed using a minimal call (assuming the WordPress local access address is...). http://127.0.0.1/(In actual use, the access address needs to be replaced with the corresponding one according to the actual deployment environment.)
python3 -c "import requests; print(requests.get('http://127.0.0.1/wp-json').status_code)""

If an HTTP status code (such as 200 or 301/302) is returned, it means that the local WordPress API is accessible.

image.png

5. Environmental conditions converge.

After completing the above steps, the production environment has effectively converged from a "general-purpose Linux system" into an execution environment with the following capabilities:

  • Python 3 runtime available
  • pip dependency management is available
  • HTTP request capability is normal
  • The local WordPress API is accessible.

In other words, the system has met all the prerequisites for running the semantic indexing script. After this stage, the environment initialization task is complete, and what follows is no longer "preparation work," but rather the step-by-step debugging and layer-by-layer construction process of the script itself.

3. Construction of Semantic Indexes

3.1 Overall Script Structure

Before starting the specific implementation, you can first take a look at the overall structure of this script.

Functionally, this script is not complex. Essentially, it takes the stages broken down in Chapter 6 and strings them together sequentially to form a complete data processing flow. The difference is that this time, these steps are no longer just conceptual divisions, but are concretely organized into an executable piece of logic.

To express this more intuitively, the script can be roughly understood as having the following structure:

fetch_posts() → build_articles() → generate_semantic_info() → compute_related() → write_json()

These five steps correspond to different stages of the entire process:

First is fetch_posts()This step is used to retrieve raw post data from WordPress. The output of this step is a set of post records that have not yet been processed and still retain the structure returned by the REST API.

Next is build_articles()The raw data is organized into a unified internal structure. In this stage, the basic fields of the article are extracted and organized into the form required for subsequent processing.

Based on this,generate_semantic_info() This step is responsible for generating semantically relevant information, such as summaries and keywords. After this step, each article has semantic fields that can be used in calculations.

Then came compute_related()This is the most crucial step in the entire process. Here, the script compares the articles based on existing semantic information and generates corresponding relationships.

Finally write_json()It summarizes all processing results and outputs them as a final JSON file.

In terms of execution, these steps are executed strictly sequentially. The output of the previous stage directly serves as the input for the next stage until the final result is generated. The entire process does not depend on any external state, nor does it require intermediate persistence; all data is processed during script execution.

From a data perspective, this process can be understood as a gradual "structure enrichment": initially there is only the original content, then basic fields, semantic information, and relationships are gradually added, eventually forming a complete data record.

Because of this design approach, the script itself does not require complex control logic. It is more like a linear processing pipeline, with each step responsible for only one relatively independent task, while the overall complexity is distributed across various stages.

In the following sections, we will follow this structure, starting with data acquisition, and gradually implement each step.

3.2 Retrieving article data (fetch_posts)

The first step in the script is to retrieve all post data from WordPress. This is implemented as a simple pagination request: starting from the first page, the REST API is called page by page until all posts have been retrieved. Since the API returns data in paginated form, the data from each page must be accumulated using a loop.

At the code level, this step can be implemented directly as a separate function:

import requests BASE_URL = "http://127.0.0.1" def fetch_posts(base_url): posts = [] page = 1 total_pages = None while True: url = f"{base_url}/wp-json/wp/v2/posts?page={page}&per_page=100" resp = requests.get(url) if resp.status_code != 200: raise Exception(f"Request failed at page {page}, status: {resp.status_code}") data = resp.json() # Get the total number of pages on the first request if total_pages is None: total_pages = int(resp.headers.get("X-WP-TotalPages", 1)) print(f"Total pages: {total_pages}") posts.extend(data) print(f"Fetched page {page}, total posts: {len(posts)}") if page >= total_pages: break page += 1 return posts if __name__ == "__main__": posts = fetch_posts(BASE_URL) print(f"\nTotal posts fetched: {len(posts)}")

The logic here is quite straightforward:

  • page controls the current page number being requested.
  • Each request returns one page of articles.
  • Append the results to the posts list
  • The total number of pages (X-WP-TotalPages) is obtained in the first request and used as the termination condition for traversal.

After the function completes execution, `posts` contains all the post data for the current blog. Each item is the raw structure returned by the REST API, without any processing.

If we use a simplified structure to represent this, the data would look something like this:

{ “id”: 123, “title”: { “rendered”: “Article title” }, “link”: “http://127.0.0.1/post”, “content”: { “rendered”: “HTML body” }, “tags”: [1, 2, 3] }

As you can see, the data here still maintains the original structure of the WordPress REST API, with the fields scattered across different levels without any organization or processing.

At this point, the script has completed the first step of "data takeover": retrieving the complete collection of posts from WordPress. Subsequent processing will no longer rely on the WordPress API, but will instead perform structural transformations and semantic calculations based on this data.

In actual operation, the script retrieves article data page by page based on the returned pagination information, eventually converging into a complete set of articles. Taking my blog as an example, the final number of articles retrieved was 237.

image.png

Consistent with the current number of blog posts:
image.png

This indicates that the pagination logic and termination conditions have been correctly implemented.

3.3 Building the basic structure (build_articles)

After obtaining the original article data, the next step is to organize this data into a unified internal structure.

The content returned by a REST API is intended for web page display, with different fields scattered across different levels. Performing further processing directly on top of this would add unnecessary complexity. Therefore, before proceeding with semantic processing, a structural "convergence" is necessary: extracting the fields that will be used later and organizing them into a uniform format.

This process can be accomplished using a simple conversion function:

def build_articles(posts): articles = [] for post in posts: article = { "id": str(post.get("id", "")), "title": post.get("title", {}).get("rendered", ""), "url": post.get("link", ""), "content": post.get("content", {}).get("rendered", ""), "tags": post.get("tags", []) } articles.append(article) return articles

The process here is very straightforward: iterate through each record in the posts, extract the fields that will be used later, and reorganize them into a new data structure.

It is important to note that because the data structure returned by the WordPress REST API is nested, when retrieving fields, both default values and structural security need to be considered. For example, fields such as title and content need to be accessed through multiple levels of GET methods to avoid exceptions caused by incomplete data.

After this transformation, the data format will undergo a significant change. In a simplified example, the structure would look something like this:

{
    "id": "123",
    "title": "\u6587\u7ae0\u6807\u9898",
    "url": "https:\/\/example.com\/post",
    "content": "<p>HTML body<\/p>",
    "tags": [
        1,
        2,
        3
    ]
}

As you can see, compared to the nested structure returned by the original WordPress, the data here has been "converged" into a single, unified field representation, with each post corresponding to an independent record. Information that was originally scattered across different levels has been extracted and standardized into the same structure.

Up to this point, the script has completed a structural reorganization: the data source remains WordPress, but the data format has been transformed into an internal unified model. This structure will not change the original content, but will serve as the standard input for subsequent semantic processing and relational calculations.

After the script containing the "fetch_posts" and "build_articles" sections completes execution, it will output an additional sample article to quickly verify the correctness of the structure transformation. This output is for debugging and verification only and does not represent the complete dataset. It is simply the first record extracted from the "articles" list (the latest post on the current blog) to check whether the fields have been standardized as expected, such as whether fields like id, title, url, and content have been correctly extracted from the nested structure of WordPress.

image.png

The sample article shown in the CLI above is essentially a structured snapshot of the processing results at this stage, used to confirm whether the build_articles stage was executed successfully.

3.4 Generate semantic information (generate_semantic_info)

After completing the basic structure, the next step is to supplement each article with semantic information that can participate in the computation. In the current implementation, this step mainly includes two parts: generating summaries and organizing keywords. These do not replace the original content, but are appended to the existing structure as a more compact expression.

The corresponding implementation can be written as a separate processing function:

import re import html def clean_html(raw_html): # Remove tags text = re.sub('<[^<]+?>', '', raw_html or "") # Reverse HTML entities (→ space) text = html.unescape(text) # Compress whitespace (multiple spaces/line breaks → one space) text = re.sub(r'\s+', ' ', text) return text.strip() def generate_semantic_info(articles): for article in articles: content = article.get("content", "") # Clean HTML text = clean_html(content) # Generate summary (ensuring clean text) summary = f"{article['title']}\n{text[300:600]}" # Keywords (temporarily reusing tags) keywords = article.get("tags", []) article["summary"] = summary article["keywords"] = keywords return articles

The processing method at this stage is very straightforward: while traversing each article, perform basic text processing on the original content, extract semantic information that is easier to participate in the calculation, and then write these results back into the same data record.


The summary in the code above is not simply a section of text taken from the beginning of the main body. Since the beginning of my blog posts is usually used to introduce the problem, while the specific solutions often appear in the later parts, when generating the summary, I deliberately avoid the beginning area and prioritize the middle section to more accurately reflect the core semantics of the article. This is a more economical approach.

Of course, if we focus more on the completeness of semantic expression, rather than being overly concerned about... summary The increased size due to the fields can also be addressed by extracting longer body text directly from the beginning of the article, for example:

summary = f"{article['title']}\n{text[:2000]}""

This method will further increase semantic-index.json While the file size is relatively small, since subsequent functionalities typically don't send the entire JSON file directly to the frontend but instead use it as base data for the backend or middleware, in many scenarios, appropriately increasing the file size is necessary. summary Length can actually help improve the accuracy of semantic expression.

In addition, the current implementation directly reuses the tag ID returned by WordPress. The purpose of this approach is not to pursue the completeness of semantic expression, but to make a trade-off between implementation complexity and usability: at this stage, keywords exist only as a stable structured identifier to participate in the subsequent calculation process, rather than as the main source of semantics.

In other words, the keywords here do not serve the function of "explaining the content," but rather act as a low-cost structural signal to supplement the lack of other semantic information.

From a system design perspective, this actually means that the tag system is still retained, but its role has changed—it is no longer the sole basis for determining the relationship between articles, but rather serves as an auxiliary signal, participating in similarity calculations along with summary, title, and other content.

Therefore, the effectiveness of this approach does not entirely depend on the granularity of the labeling system itself, but label quality still affects the final result. If the label definitions are chaotic or inconsistent in granularity, their contribution to the overall score will decrease accordingly, but it will not completely destroy the usability of the entire system.


The processing here involves several basic steps: removing HTML tags, restoring entity characters (e.g., `<p>`), and compressing unnecessary whitespace. Based on this, the script extracts the preceding content from the processed text as a summary; simultaneously, it directly reuses the original article's tags as keywords for subsequent calculations.

After processing, each article will have new fields added to its original structure. A simplified example can be used to represent this data format:

{
    "id": "123",
    "title": "\u6587\u7ae0\u6807\u9898",
    "url": "https:\/\/example.com\/post",
    "content": "<p>HTML body<\/p>",
    "tags": [
        1,
        2,
        3
    ],
    "summary": "\u8fd9\u662f\u4ece\u6e05\u6d17\u540e\u7684\u6b63\u6587\u4e2d\u622a\u53d6\u7684\u4e00\u6bb5\u6587\u672c...",
    "keywords": [
        1,
        2,
        3
    ]
}

As we can see, the change here compared to the previous stage is not in the structure itself, but in the addition of new semantic fields on top of the original structure. These fields do not attempt to "understand" the content of the article, but rather provide a more compact and computable form of expression, allowing subsequent comparisons to be conducted independently of the full text.

Up to this point, each article has met the basic requirements for participating in relation calculation. The next step will be to construct the relationships between articles based on this semantic information.

After the current script (which consists of three parts: "fetch_posts", "build_articles", and "generate_semantic_info") completes execution, the script will also output a sample data to verify whether the semantic information has been generated as expected:

image.png

Unlike the previous stage, the focus here is no longer on the structure itself, but on whether the newly added semantic fields are usable. For example, this can be observed through the sample article in the CLI:

  • The summary field no longer contains HTML tags, but is instead cleaned and continuous text;
  • Entity characters (such as ) in the original content have been restored;
  • Excess whitespace and line breaks are compressed, making the summary a readable piece of natural language;
  • The keywords field has been populated for subsequent association calculations.

This output can be understood as a snapshot of the results of the semantic processing stage: it does not represent the final output file, but is used to confirm whether the transformation from "content to semantic expression" has been completed.

Once the output at this stage is confirmed to be stable, the article data has met the basic conditions for participating in relation calculations.

3.5 Constructing Relationships (compute_related)

After generating summaries and keywords, each article possesses basic semantic information for comparison. The next step is to calculate the relationships between articles based on these structured fields and write the results. related In the field.

The overall implementation can be broken down into a simple calculation process: taking the current article as the benchmark, compare it with other articles one by one; based on multiple fields (such as... title,summary,tagsCalculate similarity scores; sort the articles according to the scores, and select the most relevant articles as the results.

Unlike the previous stage, similarity calculation here no longer relies on a single field, but combines information from different sources for evaluation. Each field provides a different type of "semantic signal," which together determine the degree of similarity between articles.

In terms of implementation, we can first define some basic functions to handle the similarity calculation between text and sets:

import re def tokenize(text): return set(re.findall(r'\w+', text.lower())) def jaccard(set_a, set_b): if not set_a or not set_b: return 0.0 return len(set_a & set_b) / len(set_a | set_b)

Based on this, a multi-field similarity calculation function can be defined:

def compute_score(a, b): # tag similarity tags_a = set(a.get("keywords", [])) tags_b = set(b.get("keywords", [])) tag_score = jaccard(tags_a, tags_b) # summary similarity summary_a = tokenize(a.get("summary", "")) summary_b = tokenize(b.get("summary", "")) summary_score = jaccard(summary_a, summary_b) # title similarity title_a = tokenize(a.get("title", "")) title_b = tokenize(b.get("title", "")) title_score = jaccard(title_a, title_b) # weighted fusion score = ( 0.5 * summary_score + 0.3 * tag_score + 0.2 * title_score return score

In this calculation process, different fields play different roles:summary As the primary expression of content, it provides the most core semantic information;tags As a supplement to the existing structure, it can provide stable signals when the labeling system is relatively complete;title This serves as high-density text information, used to enhance matching results in certain situations.

Based on this, a corresponding list of related articles can be built for each article:

def compute_related(articles, top_n=5): for article in articles: related = [] for other in articles: if article["id"] == other["id"]: continue score = compute_score(article, other) if score > 0: related.append({ "id": other["id"], "score": round(score, 3) }) # Sort by score and take the top N article related.sort(key=lambda x: x["score"], reverse=True) article["related"] = related[:top_n] return articles

The processing logic here remains within a controllable and intuitive range:

  • For each article, iterate through all other articles.
  • Similarity scores are calculated based on multiple fields.
  • Filter out irrelevant items (score of 0).
  • Sort by score and keep the top few items.

After this process is completed, the data structure will change again. Each article will no longer just contain its own information, but will also include a set of related content. For example:

{ "id": "123", "title": "Article Title", "url": "...", "summary": "...", "keywords": [1, 2, 3], "related": [ { "id": "456", "score": 0.812 }, { "id": "789", "score": 0.637 } ] }

At this point, the data has transformed from an "independent collection of content" into a structured set of related records. Each article is processed through... related The field establishes connections with other content.

This step does not pursue complex semantic understanding, but rather, while maintaining simplicity and controllability, generates a stable and sortable set of association results through the combination of multiple signals. For the current layer of the structure, this processing is sufficient to support the subsequent recommendation and display logic.

The final output sample shows the structure of a complete article, including not only basic information (id, title, summary, keywords) but also a list of related articles obtained through semantic computation.

image.png

As you can see, each article is no longer an isolated data unit, but rather embedded in a collection with a local relational structure. Each... related Each record represents a "neighboring node" of the current article in the semantic space.

Compared to methods that rely solely on tags, this approach introduces text-based similarity calculations, allowing the connections between articles to no longer depend entirely on the existing structure but rather to complement each other at the content level. This also means that even if the tagging system is not perfect, the system can still establish a certain degree of semantic relationships based on the text information.

At this point, a preliminary semantic association structure based on content rather than tags has been built into the system.


It's important to note that the implementation in this section has a computational complexity of O(n²), meaning the computational cost increases quadratically with the number of articles. At the current blog size, this overhead is perfectly acceptable. However, if the data size expands further, methods such as inverted indexes or candidate set filtering could be considered to narrow the computational scope.


3.6 Output JSON (write_json)

After establishing the relationships, the data for each article has evolved from "independent structured records" into a complete unit containing semantic relationships. At this point, the data in memory... articles It is no longer just a temporary calculation result, but a data set that can be directly consumed by external systems.

The next step is to solidify this entire structure into a single JSON file, which will serve as the final output of the semantic indexing system. This process can be accomplished through a very straightforward serialization operation:

import json import os def write_json(articles, output_path=None): # Default output to the current script directory if output_path is None: base_dir = os.path.dirname(os.path.abspath(__file__)) output_path = os.path.join(base_dir, "semantic-index.json") with open(output_path, "w", encoding="utf-8") as f: json.dump(articles, f, ensure_ascii=False, indent=2) print(f"JSON written to: {output_path}")

From an implementation perspective, this step itself is not complex; it essentially involves persisting the data structure in memory to disk. However, its significance in the entire process lies not in "writing the file" itself, but in...The boundaries of semantic computation results have been formally fixed..

Prior to this step, all steps (fetch, build, semantic, compute_related) occurred during the computation process, but at this step, the system output is converged into a stable data product.

The output JSON file structure is roughly as follows:

[ { "id": "123", "title": "Article Title", "url": "https://example.com/post", "summary": "...", "keywords": [1, 2, 3], "related": [ { "id": "456", "score": 0.812 }, { "id": "789", "score": 0.635 } ] } ]

As you can see, each article already contains a complete information hierarchy:

  • Basic information (id / title / url)
  • Semantic compression result (summary)
  • Structural information (keywords)
  • Related information

The essence of this document is no longer a "list of articles," but a...Static representation of local semantic graph structure obtained from content computation.

3.7 Complete Script Example

After breaking it down in the previous sections, we have already mastered:

  1. Article crawlingRetrieve all articles from the WordPress API.
  2. Building structured dataOrganize into a list of articles that includes ID, title, URL, content, and tags.
  3. Generate semantic informationClean the HTML and extract the summary and keywords.
  4. Calculate association relationshipsGenerate based on title, abstract and tags related Field.
  5. Output JSON file

In this section, I've integrated these steps into a complete, runnable script. After running it, you'll get a structured list of articles with relationships, written to a JSON file at a specified path, which can be used directly for front-end display or recommendation systems.

The features of this script are as follows:

  • Full process coverageFrom capturing to processing to output, it can all be done with a single command.
  • Clear structureEach step is encapsulated in a function for easy modification and expansion.
  • Good readabilityJSON file formatted output, Chinese characters are displayed correctly.
  • Easy to expandMore complex logic can be added to abstract generation, keyword extraction, or similarity calculation without rewriting the entire process.

The complete Python script example is as follows:

import requests import re import html import json import os # Modify BASE_URL = "http://127.0.0.1" according to your actual environment # ------------------------- # Step 1: Get articles # ------------------------- def fetch_posts(base_url): posts = [] page = 1 total_pages = None while True: url = f"{base_url}/wp-json/wp/v2/posts?page={page}&per_page=100" resp = requests.get(url) if resp.status_code != 200: raise Exception(f"Request failed at page {page}, status: {resp.status_code}") data = resp.json() if total_pages is None: total_pages = int(resp.headers.get("X-WP-TotalPages", 1)) print(f"Total pages: {total_pages}") posts.extend(data) print(f"Fetched page {page}, total posts: {len(posts)}") if page >= total_pages: break page += 1 return posts # ------------------------- # Step 2: Build infrastructure # ------------------------- def build_articles(posts): articles = [] for post in posts: article = { "id": str(post.get("id", "")), "title": post.get("title", {}).get("rendered", ""), "url": post.get("link", ""), "content": post.get("content", {}).get("rendered", ""), "tags": post.get("tags", []) } articles.append(article) return articles # ------------------------- # Step 3: HTML cleaning # ------------------------- def clean_html(raw_html): text = re.sub('<[^<]+?>', '', raw_html or "") text = html.unescape(text) text = re.sub(r'\s+', ' ', text) return text.strip() # ------------------------- # Step 4: Generate Semantic Information # ------------------------- def generate_semantic_info(articles): for article in articles: content = article.get("content", "") text = clean_html(content) # Using "Title + Middle Section of Text" summary = f"{article['title']}\n{text[300:600]}" keywords = article.get("tags", []) article["summary"] = summary article["keywords"] = keywords return articles # ------------------------- # Step 5: Similarity Tool # ------------------------- def tokenize(text): return set(re.findall(r'\w+', text.lower())) def jaccard(set_a, set_b): if not set_a or not set_b: return 0.0 return len(set_a & set_b) / len(set_a | set_b) def compute_score(a, b): # tags tags_a = set(a.get("keywords", [])) tags_b = set(b.get("keywords", [])) tag_score = jaccard(tags_a, tags_b) # summary summary_a = tokenize(a.get("summary", "")) summary_b = tokenize(b.get("summary", "")) summary_score = jaccard(summary_a, summary_b) # title title_a = tokenize(a.get("title", "")) title_b = tokenize(b.get("title", "")) title_score = jaccard(title_a, title_b) # weighted fusion return ( 0.5 * summary_score + 0.3 * tag_score + 0.2 * title_score ) # ------------------------- # Step 6: Build association # ------------------------- def compute_related(articles, top_n=5): for article in articles: related = [] for other in articles: if article["id"] == other["id"]: continue score = compute_score(article, other) if score > 0: related.append({ "id": other["id"], "score": round(score, 3) }) related.sort(key=lambda x: x["score"], reverse=True) article["related"] = related[:top_n] return articles # ------------------------- # Step 7: Output JSON # ------------------------- def write_json(articles, output_path=None): # The script will write the file to the path specified by the `output_path` parameter. If not specified, it will generate `semantic-index.json` in the current directory by default. if output_path is None: base_dir = os.path.dirname(os.path.abspath(__file__)) output_path = os.path.join(base_dir, "semantic-index.json") with open(output_path, "w", encoding="utf-8") as f: json.dump(articles, f, ensure_ascii=False, indent=2) print(f"JSON written to: {output_path}") # ------------------------- # Main Flow # ------------------------- if __name__ == "__main__": # Step 1 posts = fetch_posts(BASE_URL) print(f"\nTotal posts fetched: {len(posts)}") # Step 2 articles = build_articles(posts) print(f"Total articles built: {len(articles)}") # Step 3 articles = generate_semantic_info(articles) print("Semantic info generated.") # Step 4 articles = compute_related(articles) print("Related articles computed.") # Step 5 (new): Write JSON write_json(articles) # Sample check print("\nSample article:") print({ "id": articles[0]["id"], "title": articles[0]["title"], "summary": articles[0]["summary"], "keywords": articles[0]["keywords"], "related": articles[0]["related"] })

To give everyone a more intuitive understanding, in the actual "semantic-index.json" file I generated, the article "”The Awakening of Sound: Fundamentals (Part 3): Visualization of Interval Structure and Historical Conventions“"corresponding" related The content is shown in the image below:

image.png

The image above shows the most relevant information to this article. 5 articles(Sorted by similarity score from highest to lowest, IDs are respectively) 14026,14017,13767,14048,13751The following are, in order:
1.“Awakening of Sound: Basics (Part 2): From Note Names to Simplified Musical Notation”"”
2.“The Awakening of Sound: Fundamentals (Part 1): Pitch Structure and Auditory Stability”"”
3.“The Awakening of Sound (Part 3): The Misconceptions and Scientific Interpretation of "High Position"”"”
4.“Awakening of the Voice (Part 6): How to determine whether you belong to the low, middle, or high range?”"”
5.“The Awakening of Sound (Part Two): Misunderstood Sounds: The Illusion of High Notes and the Illusion of Low Notes”"”

As can be seen, the recommendation results are highly relevant to the article's topic, indicating that... related The accuracy of the function is high. This also verifies that the goal of this script has been successfully achieved: from article acquisition, structured processing, semantic generation to relationship building, the entire process logic generates reliable semantic indexes, providing a directly usable data foundation for subsequent recommendations or displays.

3.8 Location and Updates of JSON Files

After completion semantic-index.json After the file is generated, a practical problem still needs to be solved: where to put this file and how to keep it updated during subsequent use.

By default, scripts output files to the current directory, but in actual deployments, the output location is usually adjusted based on the runtime environment. Common practices include:

  • Uploaded to object storage and accessible via CDN.
  • Place it in the website's static resource directory and deploy it along with the website.
  • As a local data file on the server, it is read by the backend program as needed.

The differences between the different methods are mainly reflected in the access path, caching strategy and update mechanism, but in essence they are all to ensure that the JSON file can be accessed stably by the system.

It should be noted that, although semantic-index.json Although it is a static file, its content is not permanent. Since this index is calculated based on the current blog posts, it becomes invalid and needs to be regenerated whenever a new post is published or the content of an existing post changes.

For personal blogs with low update frequency, the simplest approach is usually to manually run a script once after publishing a new article. To automate the process further, it can be combined with cron jobs or CI/CD workflows to automatically regenerate the index and overwrite old files at fixed time intervals. From a maintenance perspective, this type of JSON file is closer to a "repeatable generation result": it doesn't require manual modification but is automatically recalculated by the script based on the current content.


In addition, file size needs to be considered during practical use. As the number of articles increases,semantic-index.json The size will also gradually increase. For example, the index file generated based on my current blog content has already reached approximately 6MB in size.

image.png

Loading the entire file directly on the front end not only increases network transmission overhead but also incurs additional browser parsing costs.


In practice, this script doesn't require high system resources. The entire process mainly consists of sequential requests, text processing, and simple similarity calculations. With the current data scale, it can run stably even on a typical VPS with 2 CPU cores and 2-3GB of RAM. For example, on my Chicago VPS with 3 cores and 4.5GB of RAM, the script runs smoothly from start to finish and outputs...semantic-index.jsonThe file was processed in about 20 seconds.

Therefore, for most personal blog scenarios, this solution has achieved a relatively ideal balance between deployment costs and operational overhead.

4. Feature extensions based on semantic-index.json

After completion semantic-index.json After the JSON file is generated, the entire blog's semantic indexing system has only just entered the "usable" stage. It's important to clarify that this JSON file itself is not the final implementation of any specific function; rather, it's more like an intermediate structure between "content" and "function," used to support all subsequent semantic-related extensions.

In other words, script generation semantic-index.json The process essentially involves pre-processing semantics: the relationships that would otherwise be dynamically calculated at runtime are pre-defined as a structured result. Subsequent functionalities typically only need to read, filter, and reorganize this result.

Therefore, from a system structure perspective, the entire process is actually closer to the following relationship:

WordPress Original Post ↓ Semantic Index Generation Script ↓ semantic-index.json ↓ Extraction and Processing of Different Functions as Needed ↓ Final Output

This also means that the focus of subsequent extended functions is no longer "re-understanding the article", but "how to utilize existing semantic results".

For example, for a related article recommendation feature, the data actually needed might only be... related The first few results in the field; however, for content aggregation functions, more attention might be paid to... keywords The relationship between the content and the article; for search, navigation, or AI summarization functions, it may rely more on... summary,title Even the raw content fields. Different functions have different focuses on the data—but they have one thing in common: they are all based on the same... semantic-index.json Above.

From this perspective, the real importance of this document lies not in the "JSON" itself, but in the fact that it stores a set of pre-calculated semantic relationships. Because of this, subsequent blog extensions typically do not need to iterate through the entire text again, recalculate article similarities, or re-analyze content; they only need to read the existing data.

The core value of this approach is essentially an "offline pre-computation" concept—concentrating complex computations in the generation phase while minimizing operating costs during actual use. This is particularly important for personal blogs because it means that even without vector databases, embedding models, GPUs, or complex AI infrastructure, it is still possible to build a content system with "semantic capabilities."

Although this approach cannot compare with a true vector semantic system in terms of theoretical expressive power, it has a very obvious advantage in scenarios such as personal blogs: low cost, high controllability, easy maintenance, and it has already been able to solve a large number of real-world problems.

Often, a truly long-term, sustainable, and maintainable engineering solution has a higher practical value than the "theoretically most advanced" solution.

And the current system is based on semantic-index.json The structure is essentially a balance between complexity, cost, and effectiveness.

5. Summary and Outlook

Looking back, this entire implementation is not essentially limited to the implementation of a single specific function, but rather introduces a new layer of capability to blogs: enabling content originally intended only for human reading to begin to have the foundation for machine processing and understanding. The implementation method adopted in this article is precisely an engineering implementation of this capability.

Currently, this semantic index uses a weighted calculation method based on keywords, titles, and summaries, which is essentially an engineered semantic approximation. It doesn't attempt to fully understand the language, but rather compresses and expresses the content at a controllable cost, establishing relationships between articles on this basis. This approach can be seen as a compromise or even a transitional solution: when the data scale is small or the tagging system is relatively stable, it can already provide sufficiently usable results while maintaining simplicity, interpretability, and ease of debugging.

In a sense, this is also the most important value of this approach: it doesn't pursue the "theoretically strongest" semantic capabilities, but rather tries to find a balance more suitable for personal blog scenarios. In many technical discussions, "semantic processing" almost naturally points to embedding, vector databases, and large model-related systems. However, for personal blogs, while such solutions are more powerful, they often mean higher complexity: requiring additional model runtime environments, more complex data storage structures, and ongoing maintenance costs.

The current approach actually offers an alternative: without introducing a heavyweight system, a semantic relationship network with practical value can be established through reasonable structural design and offline pre-computation mechanisms. The core advantage of this approach is not merely its "simplicity," but its ability to enter the "semantic content organization" stage at extremely low cost.

For many personal blogs, what they truly lack is not a state-of-the-art semantic model, but rather a foundational structure that allows content to connect with each other. This current solution precisely addresses this problem: it enables previously isolated articles to form relationships; it transforms the blog from a "time-stacked collection of content" into a content system with internal connectivity.

Of course, from a longer-term perspective, this is not the end. As the number of articles increases and the requirements for semantic precision rise, this rule-based and field-combination approach will eventually approach its limit. The more natural direction of evolution in the future will still be vectorization (embedding): mapping text to a higher-dimensional semantic space, measuring the relationships between content through distance rather than rules, thereby obtaining a more granular and generalizable semantic expression.

Even so, this current layer of structure will not lose its value. Regardless of whether embedding is introduced in the future, the article structure, semantic fields, relationships, and the indexing system itself remain crucial foundations of the entire system. In other words, the current implementation is not a "wrong path," but rather a sustainable, evolving intermediate layer: it can play a direct role in the current stage while also reserving sufficient space for future upgrades.

From the implementation process, this practice itself verifies one thing: in many scenarios, it is not necessarily necessary to introduce complex models or heavy systems from the outset. Through reasonable structural design and appropriate engineering methods, a sufficiently usable semantic approximation can be obtained at a lower cost and gradually evolve as needs grow.

This approach of "starting from usability and then gradually approaching higher semantic capabilities" is often more realistic and sustainable for personal projects than pursuing a complete large model system from the beginning.

And this is precisely the state that this attempt truly aims to achieve.


📚 系列文章:为博客构建“轻量级知识索引”(2 / 3)


← 上一篇


下一篇 →

📌 Content Structure Hints:
This content belongs to "Blog Knowledge MapThis is part of the document; you can view the full content path here: Blog Knowledge Map .
Share this article
All blog content is original; please indicate the source when reprinting! The blog's RSS address is:https://blog.tangwudi.com/feed, welcome to subscribe; if necessary, you can joinTelegram GroupDiscuss the problem together.
No Comments

Send Comment Edit Comment


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠(ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ°Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
Emoticons
Emoji
Little Dinosaur
flower!
Previous
Next