Building a Lightweight Knowledge Index for Your Blog (Part 1): Structure Design and Construction Process

1 Introduction

I've been thinking a lot lately about what features I could add to my blog's structure, such as content structure indicators at the bottom of article pages (see article:From content to structure: A low-cost approach to blog evolution).

不过,还有不少更进一步的能力,比如:自动推荐内容相关的文章、按主题聚合内容,甚至让博客逐渐形成一种“知识网络”式的结构,却不是仅仅依靠简单的“标签”、“分类”,甚至前面提到的“结构提示”就能实现的。

因为这些方案本质上都有一个共同的问题:它们依赖的是Human-defined structure,而不是内容之间基于文本本身形成的关系——标签可以标,但粒度粗;分类可以分,但维度有限;哪怕是我现在在尝试的“结构提示”,本质上也还是一种“人为抽象后的映射”。

这里并不是说标签或分类没有价值。相反,它们本身就是一种已经存在的“结构信号”,在很多场景下依然非常有效。但问题在于:当博客规模逐渐扩大时,如果内容之间的关系Completely dependent on these man-made structures to maintainThis will gradually expose a very real problem—They cannot evolve automatically..

举个很具体的例子:当我写了一篇新的文章,它在某些方面可能和过去的几篇内容存在关联,但如果我没有刻意打上相同的标签,或者没有把它纳入某个既有结构中,那么这篇文章在博客整体结构里其实就是“孤立”的。

换句话说,博客现在虽然已经“有内容”,但它本身仍然缺少一种能够自动组织内容关系的能力。

而我真正想要的,其实是另一种状态:当一篇文章被写出来之后,它不仅仅只是被展示出来,还能够被博客“自己”组织进整体结构中——知道它和哪些内容接近、属于哪个主题方向、在整个内容体系中大概处于什么位置。

一旦具备了这种能力,很多原本需要手工维护的功能,其实都会自然出现,比如:自动推荐内容相关的文章、基于内容而不是标签进行聚合,甚至后续进一步扩展出的知识图谱、AI 问答等能力。


在很多技术语境中,一旦提到“内容理解”或“语义关系”,往往会直接联想到向量化表示(embedding)——一种表达能力更强、也更加通用的语义建模方式。

However, such solutions typically require the introduction of additional models, computational processes, and new data storage methods. While these costs are reasonable for large-scale systems, they may not be the most suitable starting point for a content-centric personal blog with a relatively limited scale.

换个角度来看,如果只是为了回答一个更具体的问题——“哪些文章之间更接近”,那么是否一定需要完整的向量语义表示,其实是有待斟酌的。

在本文中,我并没有直接引入 embedding,而是选择了一种更加工程化、也更加轻量的方式:通过对内容进行压缩表达,并在此基础上构建文章之间的关联关系。

It is important to emphasize that this approach does not completely abandon the existing labeling system, but rather...Beyond tags, introducing content-based expression allows relationships to be established less reliant on artificial structures..

从这个角度看,这套方法更像是在博客现有结构之外,再增加一层基于内容关系的“轻量级知识索引”:它不追求完整的语义理解能力,而是在较低成本下,让内容之间形成一种可以被计算、组织和引用的关联结构。

In other words, the goal here is not to pursue "the strongest semantic expressive power," but rather to attempt to construct a set of...在成本、复杂度与效果之间更平衡的内容关联方案Tags can still exist as a structural signal, but they are no longer the only basis; the content itself has also begun to participate in relation calculation.


那问题就变得很明确了:如何为博客构建一层能够描述内容关系的结构化索引?而这,也正是我这段时间一直在折腾的一个方向——为博客内容增加一层轻量级知识索引(Lightweight Knowledge Index).

What needs to be done next can actually be summarized in one sentence:

Transform the original human-friendly text content into a machine-friendly structured representation.

但问题在于——这种“结构化表示”到底应该如何设计?是继续在现有的标签和分类体系上做增强?还是引入一套全新的数据结构?又或者,干脆借助现成的搜索或向量化方案?

这些看起来都像是可行路径,但如果从“可维护性”和“实际落地成本”的角度去看,它们之间的差异,其实非常大。

2. From "Feasible Path" to "Implementationable Solution"“

If we want to introduce a "structured representation" layer to a blog, what are some practical and feasible paths?

The most intuitive approach is to continue enhancing the existing tagging and categorization system: for example, refining tags, adding dimensions, or even artificially unifying some naming rules to make the "connections" between content more obvious.

But I soon discovered that this approach essentially only pushed the problem a step further—it still relied on manual maintenance, and this "correlation" was discrete and coarse-grained: it either existed or it didn't, making it difficult to express subtle differences like "somewhat related" or "highly related." Once the number of articles increased, the maintenance costs and inconsistencies of this approach would rapidly amplify.

What if we stop relying on tags and instead introduce a more complete content structure? For example, we could establish a clearer hierarchy for the articles, manually maintain "related articles" or "series articles," or even, as I tried before, add structural hints to the articles to give the content a clearer overall structure.

This approach is effective in certain situations—especially during the writing process, where it can indeed help clarify thoughts. However, the problem lies in its heavy reliance on human design: deciding which content is relevant, how it's connected, and at which level it's placed—essentially, the author still makes the judgments. Once the scale expands, or older articles need to be revised, this structure gradually becomes a continuous burden. In other words, this path merely builds "structural capability" on continuous human input, rather than endowing the system with this capability itself.

Taking it a step further, we can completely change our approach—instead of trying to define the structure ourselves, we can directly leverage existing search or vectorization capabilities to let the machine "understand" the relationships between content. For example, we can vectorize the article content, mapping the text into a semantic space, and then determine the similarity between content based on distance. In terms of capabilities, this type of solution is almost perfect: both full-text search and vector-based semantic similarity calculation can effectively solve the problem of "content association," and can even bring a more natural search experience.

However, the problems are equally obvious: this means introducing additional services, storage, and computing processes, and may even require some modifications to the existing blog architecture. For a content-driven blog with a relatively simple structure, such a solution, while powerful, is somewhat "overly cumbersome."

Reaching this point, a very real contradiction emerges: on the one hand, lightweight solutions (tags, manual structures) struggle to express true semantic relationships; on the other hand, powerful solutions (search, vectorization) introduce excessive complexity. What I need is a solution that falls somewhere in between—one that doesn't rely on extensive manual maintenance, can evolve naturally with the content, doesn't require a complex backend system, and can be implemented directly on the existing blog structure; and simultaneously, can express the "relationships between content" to a certain extent, providing a foundation for subsequent features.

From another perspective, this question can actually be rephrased as: Is there a way to add a layer of "machine-understandable" representation to the content without changing the overall architecture of the blog? If the answer is yes, then this representation itself should be lightweight, independent, and capable of being generated and consumed offline.

Following this line of thought, the problem has actually shifted from "which option to choose" to another, more specific question:If we consider "semantic representation" as an independent data product, in what form should it exist? In other words, we no longer care about "which system to use to implement it," but rather we need to first determine:Is there a data representation method that is simple enough, universal enough, and can be directly consumed by the front end?

3. Morphological selection for structured representation

If we treat "semantic representation" as an independent data product, then the problem becomes very specific:In what form should this data be stored? Intuitively, this seems to be just a matter of "formatting," but when considered in conjunction with the constraints mentioned in Chapter 2, it is actually limited to a relatively narrow range.

First, this data needs to be directly consumed by the front-end. This means its representation should not rely on complex parsing processes, nor should it be bound to a specific runtime environment. Ideally, when the browser needs to use this data, it should be able to retrieve it like a regular resource and directly participate in the page logic, such as generating recommended content, enhancing structural hints, or serving as a data source for search. It's important to note that "direct use" here doesn't mean the front-end must load the complete data at once, but rather that the data structure itself should be inherently front-end friendly, allowing it to be loaded and used appropriately based on the specific scenario.

Secondly, this data should be generated during the build phase, rather than dynamically calculated when a user visits the site. This may seem like a minor implementation detail, but it has significant implications. If the calculation is performed at request time, performance, caching, and overall architectural complexity will all increase rapidly. However, if it is generated offline, the data becomes more like a "build artifact" that can be deployed, cached, and distributed along with the page.

Furthermore, this data needs to possess a certain degree of independence. It shouldn't be attached to an existing database structure, nor should it be hidden behind a backend interface. Instead, it should exist as "explicitly" content—accessible, debuggable, replaceable, and even migrated between different environments. This is crucial for subsequent evolution.

When these conditions are considered together, they actually imply a fairly clear form: it should be a data file that can be directly loaded by the browser, generated during the build phase, and exists independently.

Up to this point, we haven't discussed any specific technology choices, but the "options" have actually been compressed quite a bit. Because once it's defined as a "file," it means a structured way of expressing it is needed; and once it needs to be used directly by the front-end, this way of expressing it must be natively supported by the browser; and if readability and debugging convenience are also taken into account, the format itself cannot be too obscure.

Under these constraints, some options can be naturally excluded: for example, custom serialization formats, while flexible, lack universality; some more compact encoding methods, while advantageous in size, introduce additional parsing costs; and some solutions that depend on specific runtime environments, which are difficult to meet the premise of "direct consumption by the front end".

In other words, when these constraints are layered upon each other, the problem is no longer "choosing one from many options," but rather more like:Only a few ways of expressing things remain that won't feel awkward.

Following this line of thought, a very natural candidate emerges—JSON. Its advantage lies not in any particular outstanding capability, but in its high degree of fit with these constraints: it can be directly parsed and used by browsers, it is itself a way of expressing structured data, it can naturally support nested relationships, and it is almost the "default language" in the current front-end ecosystem, with both toolchains and usage habits being very mature.


Of course, this doesn't mean there are no other options. Some formats with similar structural expressive capabilities (such as XML, YAML, CSV, or even binary structures) can be used, but they either lack direct browser support, requiring additional parsing; or they are somewhat out of step with the current front-end ecosystem in terms of readability or usage habits; and some solutions, while having advantages in performance or size, seem somewhat excessive for the current scenario where "maintainability" and "simplicity of implementation" are prioritized.

Considering these trade-offs, JSON is not the only possibility, but it is the one that requires almost no further explanation and can be directly accepted—it has no obvious shortcomings and introduces very little additional complexity. In other words, the key here is not "I chose JSON," but rather that when "semantic representation" is defined as a data file generated offline and directly consumed by the front end, JSON is almost a natural choice.


Once this is determined, the focus of the problem shifts—the real question is not "whether to use JSON," but rather:How should we express the semantic information of an article in this JSON?

In other words, from this moment on, the problem has shifted from "form choice" to "data modeling." And this is the part of the entire semantic representation layer design that truly requires significant effort.

4. Structural Design of Semantic Representation

After deciding to use JSON as the carrier of "semantic representation", the question quickly became specific: what information should this JSON contain?

From the most intuitive perspective, the first things that come to mind are often basic fields such as title, links, and tags. However, if it's just this much content, there's no need to introduce an additional layer of structured representation, because this information already exists in the existing blog system.

This also means that the significance of this layer of JSON is not to repeat existing data, but to extract information that originally existed in the content but the system could not directly use—that is, the so-called "semantics".

In its most basic form, an article first needs to be uniquely identifiable and accessible; therefore, its initial structure is often very simple, containing only a title and links.

{ "title": "...", "url": "..." }

However, this structure remains at the "descriptive level," merely telling the system what it is without providing any "understanding" capability.

If you want to take things a step further, you usually start with an abstract. This is because the main text is often too long to be processed directly, while an abstract can significantly compress the expression while retaining the core information.

{ "title": "...", "url": "...", "summary": "..." }

Here summary It is not a replacement for the main text, but a semantic entry point that is more suitable for calculation and comparison, providing a unified and controllable input for subsequent processing.

Building on this, if one wishes to further abstract the article's theme, keywords will naturally be introduced. Keywords are essentially a discrete expression of content; they are more abstract than a summary, but also easier to use for classification and filtering.

{ "title": "...", "url": "...", "summary": "...", "keywords": ["...", "..."] }

Up to this point, this structure can describe an article quite completely. However, it is important to note that all of this still revolves around the "single article," answering "what is this article?" rather than "what is its relationship with other content?"

Once we start considering the "relationships between articles," a more fundamental question inevitably arises: how do we reliably point to an article in structured data?

The most direct approach is to use URLs, but URLs are essentially access paths that bind "identity" and "location" together. Once the path changes, the entire relationship system will be affected. Another approach is to use internal system IDs, such as primary keys in a database. However, while this method is stable, it depends on the specific implementation and lacks semantic readability.

In this case, a separate identifier field is introduced. id This makes sense. It doesn't depend on the access path or the database structure; it simply exists as a stable unit of reference.

{ "id": "...", "title": "...", "url": "...", "summary": "...", "keywords": ["...", "..."] }

When this id After its emergence, the nature of the structure began to change. The article was no longer just an object to be described, but a node that could be referenced—and it was from this moment that "relationships" truly became possible to be expressed.

Building on this, if we further introduce a mechanism to describe "associations," the structure will naturally evolve as follows:

{ "id": "current-article", "title": "...", "url": "...", "summary": "...", "keywords": ["...", "..."], "related": [ { "id": "article-1", "score": 0.91 }, { "id": "article-2", "score": 0.84 }, { "id": "article-3", "score": 0.76 } ] }

Here related It's not simply a "collection of related articles," but rather a set of links originating from the current article. Each item points to another article, and through... score It expresses the strength of the connection.

When this structure emerges, the perspective has actually shifted: the article is no longer an isolated unit of content, but a node in a network, and related This defines the connection relationships between nodes.

Overall, this structure implicitly represents a network of relationships between content. However, this network is not dynamically calculated at runtime, but rather determined during the construction phase. In actual use, each article only needs to read the connections relevant to itself, without needing to recalculate the entire relationship structure.

In other words, the association is not "computed" at the time of access, but "solidified" at the generation stage.

How the weights of these connections are obtained is not actually a problem that this layer needs to solve. For this current layer, what is more important is not "how the scores are calculated," but that this representation itself provides a foundation that can be directly used for ranking and recommendation.

From a more holistic perspective, the fields in this layer of structure can be naturally divided into two categories: one category comes from the blog system itself, which is existing structured data, such as titles and links; the other category comes from further extraction or calculation of the content, such as summaries, keywords, and relationships between articles. The former provides the basic information skeleton, while the latter supplements semantic capabilities; together, they constitute the complete form of this layer of semantic representation.

5. The process of generating semantic relations

Once this network of relationships is clearly defined, the next question becomes inevitable: how were these relationships constructed?

If we go back to the initial input, there are essentially only two things: the structured information of the article and the content of the article itself. The former is already structured and can be used directly; while the latter is the true source of the entire semantic layer. This also means that the JSON in Chapter 4 was not "directly generated," but rather underwent a transformation process from content to structure.

From a data source perspective, this process isn't complicated. Fields like titles and links can be directly obtained from the blog system; what truly needs processing is only a portion—the article's body content. And that's where the problem begins. The body content is text naturally designed for human reading; its information is continuous and unstructured. Using it directly for calculations is unlikely to be satisfactory in terms of either efficiency or effectiveness.

Therefore, before proceeding with relational construction, the first step is to transform this content into a more suitable form of expression. The most direct way is to compress the main text and extract a summary that represents the core content. This process does not aim for complete reproduction, but rather to preserve the "semantic density" as much as possible, making the text shorter and more focused, thus providing a stable input for subsequent processing. Based on the summary, keywords can be further abstracted. Compared to the continuous expression of a summary, keywords are a discrete semantic representation, breaking down the content into several labelable thematic units, making them more suitable for filtering and initial judgment.

With these two expressions in place, the calculation of relationships truly has a foundation. In the most intuitive way, the relevance between two articles can be judged by keyword overlap: the more keywords shared, the closer the themes usually are. This method is simple to implement and highly interpretable, but its limitations are equally obvious—keywords are discrete; they can express "whether they are related," but it's difficult to express "the degree of relevance."

Therefore, to achieve a more nuanced understanding of relationships, a more continuous mode of expression is needed. A common approach is to calculate similarity directly based on the text, such as through word frequency or weights, converting the text into a comparable numerical representation. This method is more stable than using keywords and more easily reflects differences in "similarity."

Taking it a step further, we can map text into a unified semantic space. In this space, the distance between texts can be used to represent the degree of their correlation. Compared to the previous methods, this approach no longer relies on the consistency of specific words, but focuses more on the overall semantic proximity. However, regardless of the specific method used, they are essentially doing the same thing:The question of whether content is related to each other, which originally relied on human judgment, is transformed into a process that can be calculated.

Once this process is complete, each article will generate a set of "other articles most closely related to it," along with the corresponding correlation strength. These results are precisely what is discussed in Chapter 4. related The source of the fields—those weighted reference relationships—is not defined by humans, but is obtained through calculation.

It's important to note that this process typically occurs offline. That is, it's not executed in real-time when a user visits the page, but rather completed all at once during the build phase. A new version is generated only when new articles are added or existing content changes.

One direct result of this approach is that the runtime system can be significantly simplified: the front-end doesn't need to participate in any calculations; it only needs to read the prepared data to complete sorting, recommendation, and other operations. Looking back from this perspective, the JSON in Chapter 4 isn't actually an "intermediate structure," but rather the final product of the entire process. It carries both the semantic information of individual articles and solidifies the relationships between them.

Therefore, the "semantic representation layer" here is not essentially introducing a complex online system, but rather adding an offline processing step: it extracts information that was originally scattered throughout the content, transforms it into structured data, and completes all the necessary calculations during the construction phase. From this perspective, this process is actually introducing an "offline computing capability" to the blog.

Once this process is complete, the blog itself undergoes a slight change: it is no longer just a collection of content organized by time or category, but has an additional layer of structure based on semantic relationships. This structure forms the basis for the various capabilities that follow.

However, it's important to note that what's described here is still only the "computation result itself," not how this process actually operates within the system. In other words, how this semantic information is constructed step by step and organized into a complete processing chain constitutes the problem that needs to be solved next.

6 Semantic index building process

6.1 Overall Process Overview

In the previous chapters, we determined the final form of the semantic representation: an offline index file based on JSON. However, between design and implementation, a clear execution process is needed to actually build it. In my implementation, this process is condensed into a standalone script with a very straightforward function: to retrieve post data from WordPress, process it, and then generate the final JSON semantic index file.

Overall, this process can be abstracted into four consecutive stages:

  • The first step is to retrieve post data from WordPress;
  • The second step is to process and semantically transform the acquired article data;
  • The third step is to construct the relationships between articles based on semantic information;
  • The fourth step is to output the final result as a unified JSON structure.

These stages may seem very linear, but the key point is:Each step is independent of WordPress's own operating mechanism; instead, the data takeover and reconstruction are completed entirely within the script.In other words, WordPress here merely acts as a "data source" and no longer participates in any logical calculations.

The entire process does not depend on the blog's runtime environment, but is completed offline in a one-time calculation. Therefore, it is more like an independent data processing pipeline than part of the system's operation. In this pipeline, different stages assume different responsibilities and work together to complete the transformation from "content" to "structured semantics".

If we break it down further from an engineering perspective, we can see that the essence of this assembly line is actually doing one thing:Reorganize the content data scattered across the CMS into a data structure oriented towards "relational computing".The fields originally used for display (title, body, tags) are recombined here into inputs that can participate in calculations; while fields that did not exist originally (summary, keywords, associations) are gradually "generated" in this process.

For this reason, this script is not a simple data transfer tool, but more like a "builder": it is responsible for transforming the relatively loose content structure in WordPress into a data product with a stable structure and clear semantic boundaries.

In terms of execution, this build process is typically a batch task completed in one go. When articles are added or updated, the script only needs to be run again to regenerate a complete index file. Therefore, in terms of the operational model, it is closer to the "build phase" than the "service phase," which also determines that the complexity of the entire system can be kept at a very low level.

6.2 Retrieving Post Content from WordPress

The first step in the entire process is to obtain complete post data from WordPress. This step may seem simple, but it actually determines the input boundaries for all subsequent processing, so it is necessary to prioritize determining a stable and controllable data acquisition method.

Among the available options, RSS is the most intuitive choice, but it's not suitable for this layer of implementation. The reason is simple: RSS is designed for content distribution, not data processing. It typically only contains summary information, its fields are relatively fixed, making it difficult to obtain the complete text, and it offers little flexibility in controlling the returned data structure.

In contrast, WordPress's REST API is closer to being a "data interface." It allows direct access to complete article information, including fields such as title, links, body text, and tags, and supports features like pagination and filtering, making it more suitable as an input source for offline processing workflows.

In actual implementation, the interface used is:

/wp-json/wp/v2/posts

This API returns a list of articles by default, with each record containing the main fields of an article. The key fields include:

  • idThe unique identifier of an article in the system
  • title.renderedArticle title
  • linkArticle access address
  • content.renderedArticle body (HTML format)
  • tagsList of tag IDs

It's important to note that this API returns pages in pagination by default. This means that a single request cannot retrieve all articles; you must iterate through the pages using pagination parameters.

The request can be made through per_page Parameters control the number of pages per page (maximum 100), and are accessed via... page The parameter specifies the page number. Therefore, a complete data retrieval process typically manifests as a pagination loop: starting from the first page, subsequent pages are requested sequentially until all articles have been retrieved.

Two details need attention during this process: First, how to determine "whether all data has been retrieved". One direct approach is to terminate the loop based on whether the returned result is empty; a more robust approach is to utilize the response header... X-WP-TotalPages First, obtain the total number of pages in advance to define the traversal range. Second, the obtained text content is usually in HTML format, which requires extra attention in subsequent processing. At this stage, it can be retained as raw data and cleaned and extracted in later steps.

In this way, the script can acquire all the article data from the blog at once and use it as input for subsequent semantic processing. Up to this point, the entire process is still just "data acquisition" and has not yet involved any semantic-level computation, but it has already provided a complete and stable data foundation for all subsequent steps.

6.3 Processing and semantic transformation of the acquired article data

After obtaining the raw post data from WordPress, the next step is to organize this content into a unified structure and generate semantic information that can participate in subsequent calculations.

The core of this stage is not "adding data", but reorganizing existing content to transform it from "content for display" into "input that can be used for calculation".

Structurally, the first step is field alignment. The data retrieved via the REST API already contains information such as titles, links, body text, and tags, but their organization is still geared towards page display. Therefore, these fields need to be mapped to a more stable internal structure. For example, ... title.rendered Convert to a unified title Fields, will link As the access address, content.rendered Retain as original content.

Building upon this, the main content needs some processing. Since the main content returned by the REST API is typically in HTML format, it needs to be converted into plain text, a more suitable format for processing, before subsequent calculations. This process itself is not complex, and its goal is not to "perfectly reproduce the content," but rather to remove structural tags so that the text can serve as a consistent input for subsequent steps.

After basic text cleaning, the next step is to generate a simplified representation of the article. The most direct way is to extract a summary from the main text that represents the main content. This process can employ very simple strategies, such as truncating the first part of the text or performing light compression. The key is not precision, but providing a semantic carrier with higher information density and more controllable length.

Compared to abstracts, keywords offer another form of expression. In this implementation, tags themselves are a natural starting point and can be used directly as keywords. Additionally, words can be extracted from the main text or abstract as needed to enhance the expressive power.

After these processes, each article is no longer just a collection of original content, but has been transformed into a data record with a unified structure. It retains basic access information and also possesses semantic fields that can participate in calculations, thus providing a foundation for the next step of building relationships.

6.4 Constructing Relationships Between Articles Based on Semantic Information

In the previous section, each article was organized into a data record with a unified structure and semantic information that could be used for computation. Based on this, the next problem to be solved becomes very clear:How are these articles connected?

If the previous steps were about answering "what is each article?", then this stage is about answering "what is its relationship with other content?"

From an implementation perspective, the core idea of this step is actually not complicated: compare each article with all other articles and calculate their similarity based on semantic information. This "semantic information" mainly comes from the summaries and keywords generated in the previous stage, which provide two different forms of expression: continuous text and discrete features, respectively.

In practice, the entire process can be understood as a combination of traversal and comparison: taking a specific article as the current object, it is compared with each other article in the collection, and a numerical value representing the "relevance" is calculated. This value does not need to have absolute precision; its significance lies more in providing a certain level of relevance.Sorting criteriaThis allows more relevant content to be ranked higher.

Once all comparisons are complete, a set of candidate results can be generated for the current article. Next, these results are simply sorted according to their relevance, and the top-ranked subset is selected to obtain a stable set of related content.

It is precisely in this process that the structure defined above... related The fields are populated. For each article, it is no longer just an independent node, but rather, through these relationships, it forms an implicit network of connections with other content.

It's important to note that this approach doesn't aim for "perfect semantic understanding." At this stage, relevance calculation leans towards an engineered approximation: constructing a sufficiently reasonable set of associations based on available information. Compared to complex models or algorithms, this method is simpler, easier to control, and already meets the needs of most content recommendation scenarios.

Overall, this stage represents a leap from "single-point description" to "relationship network." The previous section constructed a set of structured data, while here, this data is further organized into a connected whole.

After completing this layer of construction, the remaining work becomes relatively straightforward: output this data, which already contains complete relational information, into a final unified JSON structure.

6.5 Output the final results as a uniform JSON structure.

In the previous section, the relationships between articles were calculated and organized into structured data records. At this stage, all semantic-related processing has been completed, and the remaining questions become relatively straightforward:How can we save these results in a form that can be practically used?

From a processing perspective, all the preceding calculations revolve around a single article. After each article is processed, a data record containing basic information and relationships is generated. However, for practical use, these scattered results need to be further aggregated to form a unified dataset.

In the current implementation, this step is handled very directly: the structured results of all articles are organized into an array and output as a single JSON file.

The reason for choosing this approach, rather than continuing to rely on interfaces or introducing additional storage systems, is that this layer of data has a very clear use case. It is not a data source that requires frequent updates or dynamic queries, but rather it is generated once during the build phase and directly read at runtime. Therefore, fixing it as a static file is actually the simplest and most stable solution.

From a system perspective, this JSON file can be understood as the "final product" of the entire semantic representation layer. It contains both the basic information of each article and the relationships between articles, thus forming a complete content index.

In practical use, the front-end only needs to load this file and find the corresponding data record based on the current article's identifier to obtain relevant recommendations and other information, without needing to participate in any additional calculations. Because of this, all the complex processing steps mentioned earlier are ultimately reduced to a simple data read.

Thus, from data acquisition and semantic processing to relationship building and result output, the entire process has formed a complete closed loop. The final form of this closed loop is this JSON structure that can be directly consumed.

7. Summary: From Structural Design to a Workable Semantic Index

Up to this point, the design of this semantic index has formed a complete closed loop: from the idea of obtaining article data from WordPress, to the organization of structure and semantic generation, to the way of building relationships, it finally converges into a JSON structure that can be directly consumed.

Structurally, this design already possesses a very crucial capability:It is no longer just a "collection of articles", but an index structure that can describe the relationships between articles.Each article possesses its own semantic information, and through... related The field is connected to other content, thus forming an implicit network of content as a whole.

In this process, we deliberately controlled complexity, retaining the core structure and processing logic to enable the entire solution to be implemented at a lower cost. This also means that it is not a complete engineering solution relying on a complex system, but rather a...Basic forms that can be realized.

However, when this structure actually enters the implementation stage, some more practical problems will also emerge, and these problems are often not given priority in the design stage.

For example:

  • How should data acquisition be organized? How can all articles be reliably traversed?
  • How should the generation of semantic information be handled specifically? What strategies are more suitable?
  • How should the relevance between articles be calculated and ranked?
  • How should the final generated JSON file be used by the blog system?

These issues do not change the rationality of the structure itself, but they determine how the solution behaves in a real-world environment. In other words, so far, we have completed a "deterministic design at the structural level"; the next challenge is "how to translate this design into a piece of code that can run stably." It is at this dividing line that the focus of the problem begins to shift: from "how to define the semantic structure" to "how to implement this structure in a specific environment."

These topics will be discussed further in the next section.


📚 系列文章:为博客构建“轻量级知识索引”(1 / 3)


← 第一篇


下一篇 →

📌 Content Structure Hints:
This content belongs to "Blog Knowledge MapThis is part of the document; you can view the full content path here: Blog Knowledge Map .
Share this article
All blog content is original; please indicate the source when reprinting! The blog's RSS address is:https://blog.tangwudi.com/feed, welcome to subscribe; if necessary, you can joinTelegram GroupDiscuss the problem together.
No Comments

Send Comment Edit Comment


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠(ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ°Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
Emoticons
Emoji
Little Dinosaur
flower!
Previous
Next