A solution for implementing a WordPress multi-active architecture (simplified version) in a personal blog.
This article was last updated 149 days ago. The information in it may have developed or changed. If it is invalid, please leave a message in the comment section.

1 Background

The reason why I first came up with the idea of developing a WordPress multi-active solution was actually very simple: the Internet at home was cut off by China Telecom for 3 days without any warning (see article for details:Home Data Center Series: Talking about the options for building a personal blog website based on the current situation where the Internet was disconnected at home). Because my blog's WordPress only had one node in my home data center at that time, the blog was down and inaccessible. Although I urgently migrated the data to Tencent Cloud's lightweight server and restored access after about 2 hours, I still remember the depression during that period. It was at that moment that I secretly made up my mind: I must have a real WordPress active-active architecture (I only thought of active-active at that time), so that even if there is a problem with the blog at home, it can be seamlessly switched to the backup node in the first time, and there will be no more embarrassing situations of "a good blog disappearing just like that".

However, at that time, I had been building and maintaining the blog for less than a year, and a lot of knowledge was still in the fragmented exploration stage. The system was far from complete. As a result, after all the trouble, I had to settle for the second best and first developed a half-baked disaster recovery plan to try to prevent similar accidents from happening again (see the article for details:Home Data Center Series: Using Cloudflare Tunnel to automatically take over the disaster recovery site when the main WordPress site fails).

This disaster recovery solution seemed to solve the urgent problem at the time, but looking back now, it has a very obvious, almost fatal flaw: How can Tencent Cloud nodes accurately and reliably determine whether the MacMini (master station) at home has really experienced an abnormality? At that time, the solutions I could think of were actually quite crude, either by executing scripts on Tencent Cloud at regular intervals or by directly accessing the https://blog.tangwudi.com You can use a non-cached live page to determine whether the home service is normal; or you can directly access the WordPress page on the macmini through the intranet IP of tailscale to see if it can be loaded normally.

However, both of these methods have obvious shortcomings:

  • The former is easily affected by various external factors including lines, and the detection results may not accurately reflect the actual availability of the site;
  • Although the latter can detect the running status of WordPress on the MacMini more intuitively, it is completely dependent on the tailscale at both ends. If the tailscale at any end is abnormal, it may cause the Tencent Cloud node to make a misjudgment and then be in the UP state at the same time as the main node.

In addition, this disaster recovery switching is just a "passive takeover", which can only achieve 1 active and 1 disaster recovery, and cannot achieve true multi-active synchronization (at that time, I had no idea how to solve the WordPress database synchronization problem, and I had no idea how Cloudflare Tunnel multi-connectors work). This prompted me to keep exploring WordPress dual-active and even multi-active architectures that can be further dynamically expanded for personal blogs. The complete transformation was completed after the VPS was moved from Tencent Cloud to Racknerd (see article:The second reconstruction of the home data center series blog architecture: service migration and active-active disaster recovery practice caused by VPS relocation),而后续还进行了优化(参见文章:The third optimization of the home data center blog architecture series: WordPress dual-active node adjustment and database import considerations.).

Note: This article focuses on describing the architecture solution. For technical implementation and configuration steps, please refer to the practical article links given in the article to avoid redundancy to keep the structure clear.

2 Technical challenges faced by WordPress multi-active solutions

However, if you really want to implement a dynamically scalable, truly multi-active WordPress architecture, it is far from being a matter of simply preparing a few more servers to back up each other and synchronize the database regularly.

The technical challenges it faces can be summarized from three levels:

image.png

1. Traffic scheduling: local access and disaster recovery switching

The first is how to intelligently direct the user's traffic to the closest and healthiest node. This involves the most common global traffic scheduling solutions such as DNS polling, GeoDNS or F5 GTM, and also the problem of health checks: for example, how to avoid distributing user traffic to a node that is actually down due to detection delays? How to quickly switch after a node fails, and automatically restore it to the traffic pool after the node is restored? These requirements, which sound normal, actually require very solid health monitoring and traffic switching mechanisms to support them.

2️. Data synchronization: multi-node consistency and file resources

Then there is the more troublesome problem of multi-node data consistency. WordPress is not only about synchronizing articles and comments at the database level, but also about whether the multimedia files uploaded by users should be synchronized to each node locally, or simply mounted on object storage or distributed file system?

Even if the database uses master-slave, dual-master, or the more advanced Galera Cluster, it is still necessary to solve the problems of write conflict, data delay, and split brain; even if all of these are done, caches such as Redis and Memcached must find a way to synchronize across regions, otherwise different nodes may still read inconsistent page states.

3️, Write conflict and read-write separation

Next is the writing problem. In a multi-active scenario, if users initiate writing on different nodes (such as posting comments, registering, uploading files), it is easy to cause conflicts or version confusion. In traditional architectures, read-write separation is often used to avoid this situation: all write requests are fixed to the primary node, and other nodes are only responsible for reading, and delayed replication or eventual consistency is used to ensure that there is not too much confusion.


It seems natural, but in fact, each step requires a lot of mechanisms to cooperate. These things are just a few arrows drawn on the architecture diagram, and it sounds normal, but if you really want to implement them one by one at the bottom level, almost every link needs to pay double the price: whether it is cross-region multi-active file synchronization, or distributed database consistency protocol, or global DNS live detection and dynamic routing. It is also because of this that most personal bloggers choose a single node + CDN approach to save trouble, and few people go to the real multi-active architecture - because it is too easy to get yourself into trouble.


Fortunately, I am just a simple personal blog (registration is closed, so I don't have to worry about the trouble of writing user data). Most of the access requests are read requests, and most of the write requests are "controllable writes" (regularly publishing or modifying articles). "Uncontrollable writes" are pitifully rare (just someone occasionally comments). So on these three seemingly serious multi-active issues, I can actually make some bold compromises:

1. At the traffic scheduling level

For personal blogs, there is actually no need to pursue enterprise-level millisecond-level multi-dimensional health detection, automatic removal of failed nodes, and dynamic load weight adjustment. As long as the traffic can be reasonably distributed to multiple available nodes, the risk of single point failure has been greatly reduced.
If a node suddenly fails, it is completely acceptable for occasional requests to be assigned to the failed node and return errors.

2. Data synchronization and consistency

My site is mainly for reading articles, which are basically static content. Even if there is some delay in data synchronization between different nodes, it will not affect the normal access experience of most users. In this scenario, dynamic data such as comments may occasionally be out of sync for a period of time, which will not cause any substantial problems (maybe no one will even notice it~). As for media files, as long as they are consistent between nodes in advance, there is no need to use those complex distributed file or object storage solutions.

3. At the level of write conflict and read-write separation

If it is a scenario like e-commerce or social network, write conflicts and data consistency must be taken seriously, and usually rely on single-point write or distributed coordination and consistency solutions to ensure correctness. However, my blog has very few write requests, and there is almost no pressure of high-concurrency writes. Even if I occasionally operate in the background or have visitors post comments, it rarely triggers the real risk of conflicts, let alone affects the overall data integrity.

Therefore, these seemingly compromising practices of "lowering the technical standards of multi-active" are actually rational choices made based on the actual use scenarios of my personal blog: even the traffic scheduling, data consistency, and write conflict issues that need to be taken very seriously in professional architectures can be weakened and delayed to a certain extent. As long as I can avoid the blog being paralyzed due to a home Internet outage, it is enough for me.

What's more, by choosing the right combination of technologies, many seemingly missing parts can actually be compensated to a certain extent.

3 Multi-active traffic scheduling based on Cloudflare Tunnel

Compared with the traffic scheduling method of self-built multi-active data centers in traditional commercial environments (such as intelligent DNS resolution and global load balancing systems represented by F5 GTM), this approach is obviously not suitable for personal blogs to achieve traffic distribution among multi-active nodes.
Not only are these solutions expensive and complex to configure, but they also often rely on their own BGP Anycast, regional health detection, or require cooperation among multiple operators to complete global routing optimization, which is simply beyond the reach of individual webmasters.

Fortunately,The Tunnel (formerly known as Argo Tunnel) provided by Cloudflare naturally has multi-node ingress distribution and local scheduling capabilities.It can help me automatically detect availability on the global Cloudflare edge network and direct traffic to the node closest to the visitor and with the fastest response. This not only greatly reduces the difficulty of my own investment in DNS, health checks, and line tuning, but also solves the problem of removing failed nodes.


It should be noted that although Cloudflare Tunnel provides multi-node entry distribution and local scheduling capabilities for free, strictly speaking, this is actually just a feature for availability and failover.Best effort mechanism, the official description is as follows:

image.png

The original address of the official document is as follows:Tunnel availability and failover.

Its multi-active implementation simply allows you to run the same tunnel connector on multiple servers at the same time, so that Cloudflare's edge network can prioritize forwarding requests to the most appropriate node based on the user's geographic location and network conditions. If a node fails or cannot be detected, it will automatically switch to other available nodes (For more information on the relationship between physical location, tunnels, and connectors, see the article:Cloudflare tutorial series for home data centers (Part 9) Introduction to common Zero Trust functions and multi-scenario usage tutorials).

However, it should be made clear that this is not intelligent traffic scheduling in the traditional sense: there is no fine-grained traffic ratio control, nor is there a strict polling or hash distribution algorithm. Cloudflare's scheduling relies more on the health detection between edge nodes and connectors and the dynamic estimation of network distance. Most of the time, it can achieve relatively reasonable nearby access, but it is only "doing its best." For example, from the perspective of network latency, the Chicago node is several times faster than my home data center. In theory, user requests should all be given priority to the Chicago node, but the actual situation is that occasionally a small number of requests will be allocated back home. For large commercial systems that strictly require millisecond-level optimization, this uncertainty is unacceptable, but it is completely tolerable in my personal blog scenario.

So in general, this mode is more suitable as an entry guarantee in high-availability scenarios to avoid single point failures, rather than for accurate traffic load balancing. For me, as long as there are other nodes to take over the traffic when the network cable is unplugged at home or a server is down or there is a power outage, the blog can continue to be available to the outside world, which makes me satisfied.


Since a single Cloudflare Tunnel supports up to 25 active copies (official document address:https://developers.cloudflare.com/cloudflare-one/account-limits/):

image.png

So, theoretically, this multi-active architecture can support expansion to 25 nodes at most. There is still an upper limit, but it is enough for a personal blog.

4 Additional Discussion: Cloudflare Tunnel and “Fully Closed” Security Architecture

In the previous chapters, we mainly discussed how to use Cloudflare Tunnel to implement traffic scheduling in a multi-active architecture. But in fact, the benefits that Cloudflare Tunnel brings to me are not only the intelligent distribution of traffic, but also the fundamental change of the traditional security boundary thinking based on "public IP + port".

In the past, we used to use firewall rules to barely maintain the security of the server: for example, only open 80/443 for HTTP(S), only open 22 for SSH to fixed IP segments, and drop the rest. However, as long as these well-known ports are exposed on the Internet, they will inevitably be targeted by countless scanners - masscan, zmap, nmap, automated weak password blasting, various web vulnerability scanning scripts, almost knocking on your door every minute. Even if you do a good job on the firewall and Fail2Ban, the logs will continue to grow like a pile of garbage, full of connection detection, suspicious requests, and even DDOS flood. The most terrifying thing is that this kind of "being seen" itself exposes your existence: as long as the attacker detects that ports 22 and 443 are alive, you can be included in the next round of more precise attack pools.

Cloudflare Tunnel almost eliminates this problem physically. Its core mechanism is to actively initiate an encrypted persistent connection from the VPS to the Cloudflare edge node, without opening any ports to the public network. This means that all incoming traffic (TCP, UDP, ICMP) can be dropped directly on the VPS, making your machine completely disappear from the Internet. All port scans, vulnerability detections, and automated attacks against your IP will fail at the first step, because to them, this machine does not exist at all (another prerequisite is that there should not be any domain name resolution pointing to the public IP of the VPS, otherwise it will still be leaked, just look at the first step of those penetration tests to know).

In this way, the application entrance is published using the public host name configured in the tunnel, and all security pressure is naturally transferred to Cloudflare. As long as WAF, rate limiting, Bot management, and DDoS protection are configured in the dashboard, it is enough to resist most threats (how to really use Cloudflare is the key at this time. Friends who are not familiar with it can refer to:cloudflare learning map), and your VPS becomes a true "black hole node": without an entrance, it is naturally impossible to break into it:

image.png

It is precisely based on this security strategy that when I later encountered needs such as file synchronization, I completely abandoned the old idea of "publishing HTTP services through VPS public IP + port direct connection", leaving no opportunity for any crawlers or scanners on the Internet to detect website services.

Note 1: The ideal is full, but the reality is very skinny. In theory, such a fully closed strategy can achieve that even SSH is operated and maintained through Tailscale's private network (because Tailscale is also based on outward connections, which is very consistent with this closed model. I used this on Tencent Cloud's lightweight server before). However, cross-border Tailscale traffic was severely degraded in my scenario and was almost unusable. I had to settle for the second best and open a high port for the SSH service (port 22 cannot be used directly), only allowing SSH public key authentication and disabling password login, so as to minimize the exposure as much as possible, while being able to use scp, rsync and other methods that can transfer files based on SSH connections.

Note 2: By default, Cloudflare Tunnel uses an outbound connection based on QUIC (UDP 443 + TLS), which is very obvious in traffic characteristics and is easy to be identified and intervened in the domestic network environment. For more security, it is recommended to switch it to HTTP/2 mode to reduce the risk of being targeted. For specific operations, please refer to the article:Home Data Center Series From QUIC to HTTP2: Creating a More Private and Stable Cloudflare Tunnel Solution.

5 Lightweight data synchronization and consistency practice based on rsync (multi-active scenario)

5.1 Selection of Data Synchronization Solution

To use the multi-node ingress distribution and local scheduling capabilities provided by Cloudflare Tunnel introduced in the previous chapter to implement a multi-active architecture, a natural prerequisite is:The content provided by each node must be exactly the sameOtherwise, even if the traffic scheduling is smart, if the data seen by users accessing different nodes is inconsistent, the meaning of multi-active is lost.

In typical business scenarios, to achieve data consistency among multiple nodes, the following solutions are usually adopted:

  • Master-Slave Replication: For example, the most common master-slave structure of MySQL and PostgreSQL, single-point writing, multi-point reading, and real-time playback of binlog or WAL logs on remote nodes.
  • Bidirectional replication or multi-master (Master-Master):Both locations can be written, and additional conflict detection and resolution are required, such as Galera Cluster and MySQL Group Replication.
  • Distributed Databases (NewSQL/NoSQL): Such as CockroachDB, TiDB, and Cassandra, they ensure data consistency across nodes through distributed protocols such as Paxos/Raft, at the cost of more complex maintenance and higher network quality requirements.
  • File-level distributed storage: For example, CephFS and GlusterFS distribute files and metadata among multiple nodes through multiple copies and consistency protocols.

These are very common in large companies and multinational Internet projects. They can ensure that reads and writes from multiple nodes in different geographical locations maintain very strict consistency, but they also have very significant prerequisites:Requires reliable network quality and extremely low latencyOtherwise, it will lead to serious problems such as increased latency, lock timeout, and even brain split.

Unfortunately, back to my scenario - my home data center is in China, and another node is in Racknerd's VPS in Chicago. The middle is all cross-border lines, with high network latency, large jitter, and frequent packet loss. It is simply unrealistic to use the enterprise-level distributed synchronization solutions mentioned above. The TCP connection alone may be reset at any time, not to mention the database cluster that relies on stable heartbeats and strict consistency.

So in the end I had to choose a betterLightweight, loose consistencyThe idea is to deploy an independent set of WordPress in both the home data center and the Chicago node, then export the WordPress library of the home data center node, synchronize the library file to the Chicago node and import it, so that the content is consistent most of the time (in fact, as long as I don't publish new articles, modify articles, and there are no new comments, it is consistent). In this way, even if the cross-border line is bad, it will not block the normal read and write operations of the local node.

Note: Currently, because there are fewer comments (sometimes not even one in a week), I am still using the manual triggering method to export WordPress library files. If the number of comments increases, I will need to upgrade to regular export and synchronization.

5.2 Why choose rsync as the synchronization tool?

As for the specific method of synchronizing database files, I have actually tried many solutions.

At first, I tried to use Syncthing between my home data center and my Chicago VPS (for detailed configuration, please refer to another article:Docker series: A detailed tutorial on how to synchronize multiple folders using Docker based on syncthing) with Tailscale's private IP to achieve point-to-point automatic synchronization. This solution works well in the local area network or domestic range, with almost zero latency (this is how I used to synchronize database files with Tencent Cloud servers before). However, once it crosses the border, it becomes terrible: Tailscale's traffic is severely degraded on the cross-border link, and even SSH or accessing web pages is very slow, not to mention transferring database files of tens or hundreds of megabytes.

However, because I had already designed a "fully closed" security architecture, I was not willing to expose a synchronization port for Syncthing on the public network, so I had to give up this idea regretfully.

Since we had already opened a high-level SSH port on the VPS in Chicago for operation and maintenance, we turned to the most primitive and worry-free solution: directly using SCP to transfer the database file. Under normal circumstances, this method is actually simple and easy to use, and it requires almost no configuration when combined with SSH public key authentication (for the specific configuration of SSH public key authentication, please refer to the article:Debian series configuration ssh public key login). However, once the cross-border line fails, SCP's full-volume single-transmission feature becomes a fatal shortcoming. It often times out and fails at a certain progress percentage, and has to start all over again, even if the difference is only a few MB.

Finally, I returned to rsync, which is more professional and more suitable for unstable link environments: it not only supports breakpoint resuming, but also optimizes the resuming behavior through parameters such as –partial and –append-verify, which can almost minimize the transmission cost. In addition, with a simple shell loop script to automatically retry failures, even if the line quality occasionally fluctuates, it can quietly move the data bit by bit, and finally became my best choice for multi-active database synchronization (for details, please refer to the article:Home Data Center Series: Using rsync to elegantly solve the problem of cross-border VPS file synchronization).

6. Primary write node and write conflict avoidance in multi-active architecture

In the previous chapter, since the library files were exported from the WordPress in the home data center node and synchronized to the WordPress in the Chicago node, it has been determined that the WordPress in the home data center node must be the primary write node, which is also the most suitable way for my daily management. After all, I usually manage WordPress on the Macmini directly through the intranet IP to publish and modify articles, and even approve and reply to comments.


In order to explain the mode of this master write node more clearly, let me add some details on the operation and maintenance level.

The Chicago node only opens one high-level SSH port and only allows public key authentication, so it is impossible to log in to the backend management directly with the public IP like at home (of course, you can also use SSH port forwarding, but I am too lazy to open a tunnel every time).

Although tailscale can be used to access WordPress in Chicago, the cross-border speed is too slow. It can only be used as an emergency measure and is not suitable for daily management.

Therefore, making WordPress in the home data center the only primary write node and then pushing the modified data to other nodes is the most convenient and easy way. The most important thing is that even if more nodes are added in the future, it is just a matter of adding a few more commands in the script. The script still only needs to be executed once to complete the synchronization, which will not significantly increase the operation and maintenance burden.


If it were just these "write" actions, there would be no such thing as write conflicts, because these are all controllable writes. The most critical "write" action is low-frequency but uncontrollable: comments. If we only rely on the multi-active traffic scheduling capability of Cloudflare Tunnel, the result is that there is a high probability that the comments will be written to the WordPress node in Chicago, and a small probability that they will be written to the WordPress node in the home data center. This is unacceptable for me who uses the home data center node as the main write node. So how can I ensure that the comments can be written to the home data center node?

The most direct approach in the early days wasForce all user comment requests to be written to the master node database, ensuring the consistency and centralized management of data sources. To achieve this goal, common ideas include:

  1. Cross-domain writing solution:Configure CORS headers in WordPress of the master node to allow other read-only nodes to submit comments directly to the master node (such as comment.tangwudi.com) through JavaScript AJAX requests. This method is relatively simple to implement, but it involves browser cross-domain restrictions, and requires proper configuration of response headers such as Access-Control-Allow-Origin, and handling of cookie scope issues.

  2. Reverse proxy write-back solution:In the Web Server of each read-only node, the user's POST request to admin-ajax.php or wp-comments-post.php is reversed to the main node. This method is transparent to the user, does not require modification of the front-end code, and does not have cross-domain issues, but in practice, attention should be paid to details such as session retention and cache penetration, otherwise it is easy to cause problems such as authentication loss or CSRF verification failure.

I have tried both methods, both are too hassle-free and easy to fall into traps, which really annoyed me.

In the end, I chose a solution that is both lightweight and best suited to my distributed network status quo - using Cloudflare Worker Acts as a smart relay layer for comment writing.

By deploying Workers on Cloudflare's global edge nodes, I can intercept comment submission requests at the location closest to the visitor, and then have Workers proactively forward them to my backend nodes for actual writing.

There are several obvious advantages to doing this:

  • No front-end code needs to be modified (unlike CORS, which requires rewriting the Ajax submission logic).
  • Naturally circumvent the browser's cross-domain security restrictions,
  • At the same time, it can also provide an additional layer of protection, auditing and current limiting at the edge.

In the Worker-based writing solution, it can be further divided into two different strategies:

1. Write only to the master node (home data center)

This is the most "classic" single-point writing method, which can ensure that all comment data is concentrated in the database of the home node, completely avoiding conflicts or data forks that may be caused by multi-point writing. However, the disadvantages are also obvious: first, if the home node is temporarily disconnected, the comment submission will fail directly, and the user experience will be very poor; second, before being synchronized to other nodes, if the visitor is assigned to the Chicago node by Cloudflare, he will not be able to see the comments just submitted immediately.

2. Write to all nodes at the same time

That is, after receiving a comment request, the Worker will directly POST the same comment to the home node and the Chicago node, so that multiple databases can be written simultaneously. In this way, no matter which node the subsequent visitor is dispatched to, he can see his comment at the first time, greatly reducing the visibility difference caused by the synchronization cycle.

However, it should be noted that this "homemade method" of writing to multiple points abandons the strict consistency of a single data source, and in rare cases (for example, after configuring the cache, the self-increment IDs of comments generated by different nodes are different, resulting in the cache ID seen by visitors not matching the ID of the actual reply node), it may cause secondary comments to fail. But for extremely low-concurrency scenarios such as my personal blog, this occasional inconsistency is completely acceptable, in exchange for better instant visibility (and I can control the time to reply to comments, unless several visitors are chatting in the comment area themselves~). Even if there are occasional duplications or abnormal replies, it is much better than letting the comment function crash directly due to a home Internet outage.

After weighing these options, I finally chose the second option:Have Cloudflare Worker write comments to all nodes simultaneouslyFor personal blogs where almost no one has "real-time conversations" in the comment section, this is the simplest and most direct high-availability solution (for specific technical details of this part, see the article:Home Data Center Series: Using Cloudflare Worker to Solve the Comment Synchronization Problem in WordPress Multi-Active Architecture).

7 Cloudflare APO: Transforming the Multi-Active Architecture

7.1 Limitations of the Existing Multi-Active Architecture

At this point, you may find a very interesting problem: although the previous set of multi-active traffic scheduling based on Cloudflare Tunnel, coupled with rsync synchronization and master write node control writing, has achieved true multi-node availability.But its essence has not actually deviated from the traditional WordPress operating paradigm - every time a user visits, it still has to go back to a certain node to execute PHP + MySQL, dynamically generate a page and then return it to the user.

So it still depends on:

  • The PHP / Nginx / MySQL performance of each node.
  • Rely on Redis or Memcached for object caching.
  • Rely on various classic local optimization plugins such as WP Rocket, Autoptimize, and Query Monitor.
  • It even relies on the node not to fail, the hard disk not to be damaged suddenly, the database not to crash, php-fpm not to explode the memory...

The most important thing is that under this architectureThere is no real buffer zone.:

  • Once a node has a PHP error, database downtime, or even a brief network fluctuation, Cloudflare Tunnel will still distribute user requests to this node, and users will directly encounter 502, 504, or 500 errors.
  • If there is a minor problem with the Chicago node, users in the eastern United States will be the first to be affected; if the home data center's network cable is unplugged, domestic users will feel it first.

In this mode, although the multi-active architecture improves availability and reduces the risk of single point failure, the user experience is still easily exposed directly at the node level without any buffer.

7.2 Cloudflare APO

7.2.1 What is Cloudflare APO?

When it comes to improving WordPress performance, most people first think of installing various cache plugins, such as W3 Total Cache, WP Super Cache, or using Redis/Memcached for object caching. These solutions can indeed allow a single server to handle more user requests, but they still run locally on the node and still rely on the node's CPU, memory, and IO performance to process requests.

Cloudflare APO fundamentally changes this idea, turning WordPress' local access mode into a cloud cache access mode. So, what exactly is Cloudflare APO?

Cloudflare APO(Automatic Platform Optimization), is a "smart edge full-site cache" solution launched by Cloudflare for WordPress (and various platforms that will be gradually supported in the future). Its core concept is very simple: instead of letting users access your server every time, dynamically generating HTML with PHP+MySQL, and then returning it to the user, it is better to cache the complete page (HTML) directly on the Cloudflare node closest to the user.

The result of this is:

  • The vast majority of user requests are directly hit at Cloudflare's 300+ data centers around the world and do not trigger a return to the origin at all.
  • Cloudflare will only fetch the page from your server on the first visit or after the cache expires.
  • What users get is a "complete page that has been generated at the edge", which is not only fast, but also reduces the pressure on your server.

7.2.2 Differences between Cloudflare APO and traditional CDN

Many people may ask: What is the difference between this and ordinary CDN? Isn't it possible to cache static resources all over the world?

There is actually a big difference:

  • Traditional CDN Usually only files with appropriate Cache-Control or Expires settings are cached, and most websites only add these headers to JS, CSS, and images by default, and HTML pages are often not cached. To let CDN cache HTML, you usually have to configure additional rules in the server or CDN. And dynamic sites like WordPress do not generate cacheable response headers for HTML pages by default. If you want to make it work well with traditional CDN, you need to manually install a bunch of cache plug-ins and write purge hook scripts, which is very troublesome.
  • Cloudflare APO It is specially tailored for WordPress. It actively communicates with your WordPress through WP's REST API or XML-RPC, and senses in real time whether new articles have been published, comments have been approved, menus have been modified... As long as there is a change, it can immediately trigger the cache invalidation of Cloudflare's edge nodes around the world, automatically refresh the HTML page cache, and truly achieve "full-site edge cache of dynamic content."

In other words, APO provides a Dynamic full-site caching solution for WordPress, allowing your blog to deliver full pages directly at the edge almost anywhere in the world without having to worry about updates being out of sync or requiring complex custom purges.

To explain more clearly, after enabling APO, Cloudflare will: cache complete HTML pages across the entire site, not just images and scripts; monitor whether your site publishes/updates articles, receives comments, or has menu structure changes; and once a site data update is detected, it will actively call the invalidation API through the background to notify Cloudflare's edge nodes across the entire network to clean up the cache of related pages.

This is much better than the general Nginx FastCGI Cache or general CDN, because: you don't need to write purge scripts manually; you don't need to install a bunch of plugins in WordPress for caching (such as W3TC's page cache + cache invalidation hook); for comments and new articles, APO can update all edge nodes almost in real time. Therefore, APO, which costs $5/month, can be said to be one of the most valuable and brainless investments for WordPress users.

Of course, you can also use Cloudflare Worker to manually implement a free "beggar version of APO" (for details, please refer to my previous article:Home Data Center Series Cloudflare Tutorial (VII) Introduction to CF Worker Functions and Practical Operation, Verification and Research on Related Technical Principles of Implementing "Beggar Version APO for WordPress" Function to Accelerate Website Access Based on Worker), but the dynamic content caching and version management of this solution depends on a plug-in called "Cloudflare Page Cache". This plug-in has not been updated for many years, and the official Cloudflare page on Github does not provide it. I don't know if it can still work properly now. On the other hand, this solution depends on Cloudflare's KV, and the free KV space always makes me a little worried - I don't use much, and I'm likely to receive a reminder email from Cloudflare saying "50% used", which is painful to see.

So, this kind of free DIY method is really unnecessary: WordPress is already running, just spend $5 to let Cloudflare help you with cache hosting, active purge, and global acceleration, which is much more worry-free and labor-saving. And if you have already subscribed to Cloudflare Pro, APO is directly given, which means you don’t have to pay the $5, which is really worth it for WordPress users.

Note: For a comparison of the functional differences between Free and Pro users, please refer to this article:Home Data Center Series Cloudflare Pro In-depth Experience: From Free to Pro, is it worth upgrading?.

However, APO may fail due to some unexpected factors (such as cookies) during use. I encountered this once before, and it took me a lot of time to troubleshoot. Friends who are interested can refer to the article:Home Data Center Series Cloudflare APO Cache Failure Analysis: Causes and Solutions for Cf-Cache-Status BYPASS.


In fact, many bloggers know that publishing WordPress sites directly with VPS public IP is not only unsafe, but also lacks high availability guarantees, but they still do it in the end. Why? Because once CDN caching is involved, the complexity immediately increases tenfold, and you have to keep stepping on the pits and debugging. In comparison, although direct naked running has many hidden dangers, it is simple and direct.

This is precisely why WordPress (and even most dynamic CMS) has long been criticized for poor performance and weak security:

  • PHP + MySQL cannot withstand high concurrency and is prone to crashes.
  • Once the VPS is attacked (such as the XML-RPC amplification attack common in old versions of WordPress, or WP-JSON interface brute force enumeration), it is easy to get 502 errors.
  • Without a cache buffer layer, any minor failure will be noticed by users immediately.

In reality, small websites and websites with low traffic (personal blogs, portfolios) usually just use the public network + Nginx + PHP-FPM, and are too lazy to make it so complicated. Once you pursue global acceleration, high concurrency resistance, and multi-active, you will fall into the pit of complex cache management. APO actually solves these common pitfalls at one time with "WordPress perception + automatic purge + comprehensive HTML cache". For this reason, only those who have maintained WordPress CDN cache by themselves can best appreciate how much operation and maintenance effort APO can save.


7.2.3 Providing a Real “Buffer Zone” for Multi-Active Architecture

For scenarios like mine where WordPress multi-active architecture is already in place, APO allows regular requests to be routed back to the source from multiple points in a multi-active architecture, turning this into a response mode that relies primarily on the Cloudflare edge network and supplemented by WordPress node routed back to the source:

  • In a multi-active architecture without APO, user access will be distributed to different WordPress nodes by Cloudflare Tunnel, and the nodes will directly take on the task of generating pages with PHP + MySQL.
  • With APO, users will not access specific nodes at all. All requests for 99% are processed by Cloudflare at the edge, and the nodes are only pulled to generate pages after the cache expires.

This not only significantly improves performance, but also greatly enhances availability - even if a node experiences a brief network fluctuation, PHP error, or even the database is temporarily unavailable, users will hardly notice it, because Cloudflare has cached the complete HTML page to the global edge network, making the multi-active architecture not just "multi-node availability", but also A strong isolation protection belt is added to the user experience.

In other words, even if one of your nodes is already smoking and catching fire, visitors can still get the complete page directly from the nearest Cloudflare node without noticing any abnormality in the backend. This is the real value provided by APO under the multi-active architecture: even if there is a node failure, there are other nodes that can take over the return to the source and continue to provide fresh blood for the global edge cache.

Of course, if you only have a single-node architecture, the situation is different - once the source site fails completely, Cloudflare can continue to use the old cache to serve users for a short period of time, but it will soon display a prompt that "source site is unavailable" and the cache cannot be further updated, which means that there is no subsequent guarantee.

Note 1: Such an architecture will also bring some new troubles. For example, the comment ID inconsistency problem mentioned earlier: since the back-to-source has become multiple independent WordPress nodes, visitors submit comments at different times or locations, which may be written into the database of different nodes, resulting in different generated auto-increment IDs. This problem does not exist at all under a single-node architecture, but it is the norm in the current distributed back-to-source multi-active environment. It is not caused by APO, but APO makes the differences among multiple nodes more easily exposed while accelerating access and shielding node failures, which needs to be balanced through back-end data synchronization.

Note 2: In a multi-active architecture with APO enabled, the number of nodes does not need to be as large as traditionally required. Because most user requests have already been directly hit and returned in Cloudflare's edge network, the server only bears the responsibility of returning to the source when the cache misses. Therefore, the significance of the multi-active architecture has also changed from the past to share pressure and improve concurrency to a more pure availability guarantee - as long as there is a node in each geographical location, such as I have now: one at home and one overseas, it is enough to deal with most risks, and it is almost impossible for them to fail at the same time. Even if one of the nodes goes down or the network is interrupted, Cloudflare can still automatically dispatch the return request to another node, and the user will not be aware of any problems with the backend. Compared with the past era when machines had to be added, PHP adjusted, Redis installed, and cache plug-ins inserted everywhere in order to resist more traffic, this architecture has been much easier, and the focus of operation and maintenance has been completely shifted to how to keep nodes online, instead of worrying about how to increase the limit of a single machine.

8 WordPress optimization ideas under multi-active architecture

8.1 Thoughts

In the previous chapters, the core technical structure has actually been built: multi-node + Cloudflare Tunnel's local scheduling, rsync's lightweight synchronization, and APO directly pushing HTML to the edge, making this WordPress multi-active architecture both distributed and high-performance.

However, from a technical perspective, this is just "usable". If you want to maintain long-term operations more elegantly, you can further reduce the weight, cloudify, and automate the multi-active scenarios based on this architecture.

8.2 Weight reduction: no longer requiring the burden of traditional local optimization

In the single-node era, WordPress optimization was almost a cliché:

  • Install Redis / Memcached locally for object caching
  • W3TC, WP Rocket, Autoptimize, various cache plugins are stacked layer by layer
  • Also be careful not to over-cache, which can cause confusion in comments, paging, and sessions.

In the current architecture, almost all access traffic is taken over by APO, and HTML pages are returned directly at the edge, with only a small number of comments written back to the source node. This means that Redis and various local cache plug-ins can be removed, and PHP and Nginx no longer need to be specially tuned for pressure resistance. You only need to ensure that php-fpm does not crash. WordPress has returned to its purest publishing system, with less burden on the machine and lower maintenance difficulty.

8.3 Cloudification: Don’t store data locally if it can be handed over to Cloudflare

Now that the site has been transformed into a multi-node one, many functions that previously relied heavily on local machines can also be cloud-based, such as:

In short, moving as much functionality as possible to the Cloudflare layer will keep the backend cleaner.

8.4 Automation and Observability: Less Worry about Operations and Maintenance

Multi-nodes sound great, but daily maintenance is obviously more troublesome than single-node: Which node died? Which synchronization failed? Are comments written normally?

It is obviously unrealistic to manually monitor the logs. Here I mainly use the health checks and notifications provided by Cloudflare, combined with my own webhook (pushed to the phone via Bark. For technical details, please refer to the article:Docker series builds a message push server based on bark server), making multi-node monitoring more automated.

  • Cloudflare Health Checks: This feature allows you to configure HTTP detection directly on Cloudflare, such as scheduling visits to specific pages on your bloghttps://blog.tangwudi.com/xxx, if continuous detection fails, an alarm will be triggered.
  • Cloudflare Notifications: On the panel, you can configure health checks and various security events (WAF triggered, IP blocked, etc.) as event sources, and support sending alerts via email, Slack, Webhook, etc.
  • Webhook + Bark:I wrote a lightweight relay script to receive webhook notifications from Cloudflare Worker and then forward them to Bark to implement iPhone notifications; some Cloudflare Notifications events can be directly notified to the mobile phone through bark.
  • Shell + Mail/Bark: The local rsync synchronization script also does simple error capture. For example, if the synchronization fails, it will automatically try again a few times. If there are consecutive failures, it will remind me via email or Bark.

In this way, even though the multi-node architecture adds complexity, you hardly need to keep an eye on it during daily maintenance, and you can receive a push notification on your mobile phone as soon as something happens.

For the specific implementation of this function, please refer to the article:Home Data Center Series Cloudflare Monitoring Alarm Combination Practice: Health Check + Event Alarm Brings Lightweight Operation and Maintenance Experience, It has also been mentioned in some previous articles, but it is a bit scattered. After all, operation and maintenance needs are everywhere.

9 The last insurance of multi-active architecture: static site backup

To be honest, for a small blog with little traffic, the Cloudflare Tunnel multi-active architecture and APO global edge cache are already "self-impressive" enough, but in the end, a layer of pure static site is needed as a backup. It sounds like ensuring the continuous availability of some "core financial business system", which is a bit ridiculous.

But this is just my personal obsession - I have spent more than half a year perfecting this architecture bit by bit. Since I can do one more step, why not just prepare for the most extreme accidents?

So there is this step: use Simply Static plugin Generate pure static pages for the entire WordPress site and deploy them to Cloudflare Pages, so that you have an ultimate life-saving solution that can still provide normal reading access even if the database is completely down, PHP cannot run, and Cloudflare Tunnel is completely disconnected.

If unfortunately all dynamic nodes fail one day in the future, visitors can still be served by the static version of the blog deployed on Cloudflare Pages:https://staticblog.tangwudi.comTake it. Even if you can't see the latest comments or post comments, it's better than just 502.

As for the specific usage of the Simply Static plug-in, I have recorded it in detail in another article:Home Data Center Series WordPress Websites Use Simply Static Plugin to Make Sites Static, I will not go into details here.

10 Conclusion

At this point, I have basically sorted out all the troubles I have been going through for the past six months with the WordPress multi-active architecture: from running WordPress on a single node on my Mac mini at home, to being forced to set up a disaster recovery node on Tencent Cloud's lightweight server after the telecom network was inexplicably blocked; and then after the Tencent Cloud server expired, due to the expensive renewal price, I simply migrated to Racknerd's Chicago VPS node and officially started to work on dual-active. Along the way, I gradually came up with the current complete architecture based on Cloudflare Tunnel for global local distribution, redundant back-to-source nodes, rsync to ensure data consistency, Worker to coordinate multi-point writing, and finally APO to "extract" PHP from the user access link:

image.png

Its essence is to turn WordPress, the core combat unit that once had to be online at all times, into a redundant back-origin node that can be replaced at any time and is not afraid of being offline for a period of time: even if PHP crashes, users can still read the complete page from Cloudflare's edge node, and can hardly detect any abnormalities in the backend; if a node database crashes or the VPS is temporarily shut down for maintenance, it is only a matter of synchronization and repair later; even if comments fail to be written occasionally, it is only a minor episode for a few users, which is much more calm than the previous single-node mode where people would be terrified when the server crashed.

Friends who want to try to build it themselves can directly refer to the solution architecture and related deployment scripts I have compiled. The warehouse address is:https://github.com/tangwudi1979/multiwp-tunnel.


This article is just a summary of my past six months of continuous mistakes, corrections, troubles, and explorations. It is written to myself. But it is precisely because of this "idealized goal" that I can happily start over again and again, gradually clarifying those originally vague details, and in the process, I learned more about the underlying network, cache consistency, and multi-active availability. Perhaps these solutions have no direct reference significance for most people, but for me, it is a real personal network exploration note, and it is also a small milestone worth recording during this period.


📌 Content Structure Hints:
This content belongs to "Cloudflare Learning MapThis is part of the document; you can view the full content path here: Cloudflare Learning Map .
Share this article
All blog content is original; please indicate the source when reprinting! The blog's RSS address is:https://blog.tangwudi.com/feed, welcome to subscribe; if necessary, you can joinTelegram GroupDiscuss the problem together.
No Comments

Send Comment Edit Comment


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠(ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ°Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
Emoticons
Emoji
Little Dinosaur
flower!
Previous
Next