The second reconstruction of the home data center series blog architecture: service migration and active-active disaster recovery practice caused by VPS relocation

1 Introduction

As the expiration date of Tencent Cloud Light Server is approaching, I have been hesitating whether to renew it. The first year is 99 yuan/year, the second year I remember is 300 yuan/year (I don’t remember clearly, about that), and what is the price for the third year? I took a look and wow:

image.png

Originally, I thought that filing was useless (China Telecom is cracking down on home broadband with inbound HTTP traffic and public IP. As a result, the method of pointing the domestic CDN back-to-source host directly to the dynamic domain name corresponding to the home broadband public IP is useless). In addition, there are various unchanging things in the daily use of domestic VPS (cannot access docker, git, apt, etc. normally, and I have to mess around with it, which is so annoying). I wanted to give up a long time ago, but I just wanted to keep the filing in case it comes in handy one day.

But the renewal price this time directly shattered my last wish, which made me finally make up my mind: cancel the domain name registration and move it to cloudflare; at the same time, my redundant data center on the cloud will also be moved to an overseas VPS. So, the first problem I will face next is where to move?

2. How to choose a VPS

2.1 Purchase criteria

There are so many VPS providers available overseas. Logically, the choices should be wide. However, for me, the choices are actually very narrow based on several metrics. I mainly consider the following aspects:

2.1.1 Price

When Tencent Cloud Server was charging around $280 per year, my tolerance was almost reaching its limit, so the price of my target VPS was capped at this, which is around $40 per year.

Generally speaking, VPS is sold on a monthly or yearly basis. My annual limit of $40 (3.3 USD if paid monthly) has eliminated most of the options. If I further eliminate the requirement of “packages that can only be purchased during special periods such as Black Friday every year”, that is, the requirement of “available at all times”, there is actually only one option left: Racknerd.

Manufacturer Features A cheap solution that can be purchased everyday?
RackNerd Many discounts and low-price plans Often available (e.g. $10.99/year)
HostHatch Promotion is good Daily expensive
GreenCloud The line is good, CN2 is available Basic model > $40
VirMach Very low price but easy to overturn It works sometimes and sometimes not, not recommended for important use
CloudCone The panel is easy to use, but the speed is average Poor reputation for stability
BuyVM Cheap and high quality, but sold out in seconds You can't grab it
LiteServer (Netherlands) Average Sometimes you can buy EUR 30/year
HostHatch / Netcup / Contabo Large resource, cheap price Daily budget is usually exceeded or there are many restrictions

2.1.2 Performance

The lowest-end Tencent Cloud lightweight server I bought in China has the following configurations: 2 cores, 2G memory, 40-50G SSD system disk, and 15% off prices of 459/year and 510/year (I bought the 510/year version because I was lured in by the 99 yuan discount):

image.png

Relatively speaking, it would have been cheaper to choose Hong Kong and other overseas regions at the beginning. Unfortunately, I had to file a record at that time, so I had to bite the bullet and choose the mainland region (of course, the price factor of 99 yuan/year also accounted for half of the reason):

image.png

If you really want to run some docker (including applications like wordpress) normally, 2 cores, 2G memory, and 40GB hard disk should be the minimum configuration. In fact, a 40G hard disk is a bit tight, and you need to monitor and clean the disk frequently, otherwise there will be some problems. If the log strategy is not configured in advance, it is easy to fill up the hard disk. The disk of my Tencent Cloud lightweight server often has more than 80%~90%:

image.png

So in terms of performance, my requirements are 2 cores, 2G memory, and 40G hard drive as a minimum, with no upper limit.

2.1.3 Traffic

2.1.3.1 计量方式


I don't really have any requirements in this regard, because I don't use the public IPv4 address of the VPS directly to publish websites or for scientific purposes (even if I have it, it's for backup). I always use Cloudflare CDN, so the traffic is also used for Cloudflare's back-to-source of the application, which is basically negligible. However, if some friends have some special needs, such as brushing PT traffic, using it as a scientific tool, etc., then the more traffic, the better.


在选购 VPS 时,流量计费方式是容易被忽视但非常关键的一环。尤其当你需要进行大量上下行传输时,不同厂商间的计费逻辑会直接影响你的使用成本和可用性。

以下是国内主流云厂商与国外 VPS 提供商的流量计量方式对比:

project Domestic VPS (such as Alibaba Cloud, Tencent Cloud) 国外 VPS(需区分类别)
Downstream traffic (Ingress) 通常不计费或无限(拉多少都免费) 单向计费型厂商(如 Vultr、DO、Linode):不计费
双向计费型厂商(如 RackNerd、Hetzner):Billing
Upstream traffic (Egress) 按量计费或有限(如部分套餐含 1Mbps 免费) 几乎所有国外厂商都计费,无论单向或双向模式
流量统计方式 仅统计上传(出站)方向的流量 – 单向计费:仅统计上行(出站)
– 双向计费:上下行合并计入总流量
Traffic restriction performance 常见为“上行限速”或“超出部分按量计费” 按“月流量包总额”计,超出则限速或断网(例:1500GB/月)
适用场景推荐 PT、缓存镜像站、低频任务、自用挂机脚本等 建站、反代中转、代理服务、跨境访问服务等

2.1.3.2 实际案例说明

国内 VPS 示例:

  • 从公网下载 100GB 数据 → 不计流量
  • 上传种子文件或数据 30GB → 计流量
  • 合计月流量消耗:30GB

国外 VPS 示例:

  • Vultr / DigitalOcean(单向):
    • 下载 100GB → 不计费
    • 上传 30GB → 计入流量包
    • 总计消耗:30GB
  • RackNerd / Hetzner(双向):
    • 下载 100GB + 上传 30GB → 计入 130GB
    • 假设套餐为 1500GB/月,剩余:1370GB

实用建议:

  • 需要高下载量(如爬镜像、缓存、抓站)的任务,国内 VPS 非常适合,尤其流量包虽然小,但只计上传.
  • 建站、做代理服务,或 CDN 回源服务器使用,推荐国外 VPS,优先选择单向计费型厂商,如 Vultr。
  • 选择 RackNerd 等便宜年付 VPS 时,注意其流量为双向合计,若中转高流量请做好预算与限制策略。

2.1.4 Geographical location

Generally speaking, friends who have requirements for the geographical location of the VPS do not necessarily have requirements for the "physical location itself".Domestic direct access delayRequired: This delay is mainly limited byThe line quality of the computer room where the VPS is locatedandDirect connections with Chinese operators, rather than geographical distance itself.

Sometimes, nodes that are physically closer perform worse, while nodes that are physically farther away have lower latency and faster speeds. This is because:

  • China's international export bandwidth is limited.
  • Most common VPSs useDetour three network backbone(For example, from Hong Kong to the US, then to Singapore and back) orInternational transfer return,
  • True low-latency, high-quality access relies on things like CN2, GIA, CUVIP, CTG And other lines with direct connection attributes.

Here is a common example:

  • Los Angeles (especially Cera/Krypt and other data centers) Physically farther than Chicago, but since some VPS providers deploy CN2 GIA Return Route, the actual access delay can be achieved Within 140ms, which is faster than some “non-directly connected” data centers in Hong Kong, Tokyo, and Seoul.
  • On the contrary, if some data centers in Hong Kong and Singapore do not use high-quality lines, the delay may be as high as 250ms or more due to the return detour.

Therefore, choosing VPSLocationWhen you look at the map, you should not only look at the map, but also pay attention to itsBackhaul line quality:Many people choose Los Angeles or San Jose, not because of the location of these two cities, but because these places are concentrated with manyBe friendly to China and take a high-quality route on the return tripThe machine room.

Note 1: For those who are concerned about the delay of direct domestic access, if the VPS can only be selected in the United States and there is a choice of Los Angeles or San Jose, try to choose these two.

Note 2: VPS in Los Angeles or San Jose may not be CN2, GIA and other premium lines, it also depends on the package you purchased. Don't even think about too cheap packages (such as $10.99/year). Of course, even if there is no such package, the latency will be relatively low. Take the latency of my two VPS as an example. The ping packet latency in Chicago is more than 220 milliseconds (now ping is disabled, it's hard to take a screenshot, and I'm too lazy to change the configuration), and San Jose's is more than 180 milliseconds:

image.png

If you take premium routes such as CN2 and GIA, the domestic direct connection delay can be lower than 140 milliseconds, or even lower than 100 milliseconds. Of course, those who pay dozens of dollars a year will most likely not have the opportunity to use premium routes.


There is one thing everyone needs to be aware of:ICMP Ping Latency ≠ Actual TCP Application Quality,because:ICMP is one of the least reliable "quality of service" indicators:

  • ICMP (the ping packet) was originally widely usedSpeed Limit,throw away,QoS priority very lowagreement;
  • Many VPS hosts directly impose policy restrictions on ICMP: high latency and false packet loss;
  • In the GFW environment, ICMP and UDP are not favored at all, only TCP 443 (HTTPS) is the most "friendly", even with a false sense of stability;

The end result may be:Just because you see stable ping and low latency doesn't mean TCP is necessarily good to use., or the other way around: ping is bad but TCP is stable.

A truly reliable direct access evaluation method:

protocol Transmission stability GFW interference illustrate
TCP port 443 Very high lowest Ideal for use as a primary tunnel (as Cloudflare Tunnel does)
TCP Other Ports Higher Mild to moderate For example, 22/80/8080 may be limited or identified
UDP Any port Unstable Medium to high It is easy to be limited in speed, downgraded in QOS, or even poisoned (especially DNS)
ICMP (ping) Very bad high It is only used as a simple connectivity reference and cannot be used as a basis for performance evaluation.

Summary in one sentence:Ping can only be used as a reference for reachability and cannot represent the actual network quality, especially in cross-border/cross-cloud environments. The actual order of communication quality is: TCP port 443 > TCP others > UDP > ICMP.


Note 3: In fact, if you do not use the public IP of the VPS to directly publish the website (for example, directly open port 443 for users to access), and do not plan to deploy scientific Internet tools that require "domestic direct connection" on it, then the "domestic direct connection access delay" indicator becomes less important - especially when you use Cloudflare Tunnel to publish the application architecture.

In this architecture, the user's access traffic does not actually hit the VPS directly, but first enters Cloudflare's global edge nodes, and then Cloudflare's own internal network "pulls" the data from the source station. Therefore, what is really important at this time is:Connectivity, stability, and back-to-origin response time between the origin server and the Cloudflare network.

From this perspective, Chicago is actually more popular than West Coast nodes such as Los Angeles and San Jose.More suitable origin station candidatesOn the one hand, although the West Coast is physically closer to China, a large number of cheap VPSs are crowded in that area, and line congestion and soaring packet loss and delays during the evening peak hours have long been the norm; on the other hand, Chicago, as a central node, although a little far from China, has maintained a more stable network quality with less fluctuation all year round, and the back-to-source performance is more controllable.

More importantly, Cloudflare's nodes in the eastern and central US often have stronger backbone connection resources, and the return path is shorter and faster. Under the Tunnel architecture, what we really rely on is the stable "pull-through" capability between Cloudflare and the source station, rather than the direct connection speed between domestic users and the source station. So in comparison, Chicago is more like aA quiet but reliable rear baseIt is very suitable for hosting Class C core services, such as WordPress blogs, personal homepages, and other businesses that are sensitive to access quality and have no reliance on direct connections.

So, if I initially had some "premium route anxiety" because I didn't choose San Jose/Los Angeles, now I can say with peace of mind:The Chicago node is the best source site choice under my current architecture. It is stable, low-cost, and takes into account the fast return to the source for users across the United States and Europe.

2.2 Final choice: Racknerd's 882

In fact, for me, as long as the price is taken into consideration, Racknerd is the only choice. Of course, if the performance does not meet the requirements (geographic location and traffic are not important to me, but they are indeed necessary for many friends who need a direct connection experience), I will not choose it even if the price is right, because I still have high requirements for performance. After all, it is used as a redundant data center on the cloud for my home data center, and quite a few applications have to run on it. As for traffic and geographic location, I have no requirements at all due to Cloudflare's "equality for all".

Finally I chose 882 (2024 Black Friday promotion package:Racknerd_High Configuration 1_882), which is the most cost-effective among the high-end models:

image.png

The performance fully meets my requirements. There is too much traffic to use up. It provides SolusVM control panel. Most importantly, the price is 40 US dollars, which is basically the upper limit of about 280. Perfect!

Note 1: Racknerd also has a variety of high-end, medium-end, and low-end options, with the lowest price being $10.99/year. Of course, that configuration is very low and can only be used as a springboard. For more configurations, see the article:Unbeatable recommendation.

Note 2: For those who have requirements for domestic direct connection delay, don’t be too quick when placing an order. You must first choose a good location. For example, if you choose San Jose (DC02 data center), the default is Chicago (DC03 data center). The domestic direct connection access delay is at least 40 milliseconds longer. Don’t ask me how I know (although the difference is actually not big, I don’t have any requirements for direct connection, but I was unhappy for several days~):

image.png

Note 3: After purchase, the login method and account password will be sent to the email address used during registration.

3. Transferring Domains

After choosing the VPS, I directly migrated the previously registered domain name from dnspod to cloudflare. In this way, in addition to the main ".com" domain, the domain name previously used for registration can be used as a backup domain name to conduct some comparative tests, and the ".xyz" domain name will be directly abandoned after it expires (there are too many color websites ending in .xyz, which lowers the grade of this domain name suffix~).

The following is a simple record of the process of migrating domain names from Tencent Cloud DNSPod to CloudFlare. I have migrated once before, but forgot to record it. I will make up for it this time:

image.png

image.png

Then the DNS record will be added according to the selected method, followed by the final step:
image.png

Then click Continue at the bottom of the page:
image.png

Subsequent operations are performed on Tencent Cloud's dnspod.

Confirm that the DNSSEC function is not enabled:

image.png

If the safety lock is turned on, it needs to be closed in advance:
image.png

image.png

image.png

image.png

image.png

Then just wait for the email from cloudflare:
image.png

4. Upgrade of key application disaster recovery methods

4.1 Disaster Recovery Mode in the Tencent Cloud VPS Era

Previously, I divided applications into 3 categories: A, B, and C:

  • Category A: Applications that consume a lot of resources or rely on the intranet data or computing resources of the home data center, such as emby, lobechat-database, Changting Leichi WAF, etc. These applications only run inside the home data center (emby needs to regularly scan the video resources on the local NAS; lobechat-database consumes more resources and sometimes needs to call the ollama local model running on the local M4 pro macmini; Changting Leichi consumes too many resources, and is deployed on the poor 2-core 2GB memory lightweight server on Tencent Cloud, so basically the VPS does not need to do anything else~)
  • Category B: Applications suitable for deployment on Tencent Cloud lightweight servers, such as Bitwarden, Tailscale Derp relay server, and some other relatively lightweight applications can be counted as Category B. Try to use cloud servers as much as possible (however, since the configuration of 2 cores and 2GB is only considered an entry-level configuration, it will not last long if you use more).
  • Category C: This is my most core application. At that time, there was only one: blog. In order to provide disaster recovery for the core application, the blog was deployed on both the home data center and the Tencent Cloud lightweight server. With the database synchronization mechanism (home data center database export, syncthing synchronization to Tencent Cloud node, Tencent Cloud node automatic import) and Tencent Cloud node monitoring of the home data center accessibility, the blog service can be automatically taken over when the home data center loses power or network connection (reference article:Home data center series uses cloudflare tunnel to realize automatic takeover of disaster recovery site when WordPress main site fails).

Therefore, the inbound access traffic of the home data center when it is normal is as follows:

image.png

When the home data center loses power or network, the inbound access traffic of the application is as follows:

image.png

Therefore, the previous disaster recovery method was only intended to ensure that when there was a problem with the home data center, the Class C core applications (only blogs at the time) would not be interrupted.

However, the original disaster recovery switching logic is not perfect. The root of the problem lies in how Tencent Cloud nodes detect the health status of home data centers:

Since my home broadband uses a dynamic public IP, it usually changes every three days or so, which makes it impossible to do reliable detection through a fixed IP (not to mention that most people’s broadband doesn’t even have a public IP, and only commercial networks have fixed IPs).

In theory, dynamic domain names can be used to solve the problem of IP changes, but in reality the refresh time of the DNS cache is uncertain. Once the public IP has changed and the DNS has not been updated, there will be a "dual active" state for a short period of time, that is, Tencent Cloud and the home data center are online at the same time, resulting in Class C applications (such as blogs) being unable to guarantee access consistency, which is unacceptable to me.

Finally, I chose the Tailscale IP of the MacMini where the blog was deployed as the detection target (Tencent Cloud lightweight server also deployed Tailscale). The principle of this solution is: even if the public IP changes, as long as Tailscale can quickly restore the connection after the change, the Tailscale communication status can be used to determine whether the home data center is available, thereby controlling whether to switch the blog master node.

However, even this has hidden dangers: What if the tailscale on the MacMini does not work properly? (This is also a lesson learned from blood and tears, see the article:Home Data Center Series A bloody murder caused by a "steamed bun": Recording the abnormal blog access phenomenon caused by the upgrade of tailscale in the past few days)?

Therefore, I have always been dissatisfied with the previous disaster recovery method, but there is no better solution to think of.


Why not use active-active? The main reason is that when my cloudflare was still a free subscription plan, the so-called WAF free hosting rules were actually completely useless, so I relied on the "Changting Leichi WAF Community Edition" deployed in my home data center to take the lead in security issues. If the active-active mechanism is enabled at ordinary times, it is possible that nearly half or more than half of the access will be sent directly to the Tencent Cloud node, and the performance is far from enough to deploy a Tencent Cloud node with Changting Leichi WAF. It is like an unguarded "beauty". It is impossible to rely on the free version of Wordfence to resist it, right?

Therefore, under the conditions at the time, the Tencent Cloud node could only serve as the blog's "real disaster recovery center" at best, temporarily supporting it and giving up the takeover of Class C applications once the home data center restored access.


4.2 Active-Active Mode in the Racknerd VPS Era

Since I became a noble Cloudflare Pro subscriber in early March, after observing for the past two months, there are basically no security alerts for the "Changting Leichi WAF" in the intranet of my home data center, which fully proves that the managed rule set of WAF has indeed played a huge role (it turns out that the treatment paid is completely different from that for free):

image.png

Therefore, at this time, "Changting Lei Chi WAF" has ended its historical mission, which also means that the "dual active" of Class C applications has officially entered the stage of history.

Taking advantage of this VPS move, I reorganized the existing applications. Since the configuration of Racknerd 882 is not low (3 cores, 4.5G memory, 100G pure SSD hard disk), I classified more applications into Class C. At this time, the normal inbound access traffic is as follows:

image.png

When the home data center loses power or network, the inbound access traffic of the application is as follows:
image.png

It seems that the switching when the home data center loses power or network is the same as before, but the internal logic is completely different: this switching behavior is not controlled by me (previously it was controlled by a detection script), but is dominated by the default mechanism of CloudFalre Tunnel: when one of the multiple connectors of the same tunnel is found to be working abnormally, the connector will be excluded from the multiple targets of traffic distribution, and when the connector is found to be restored, it will be automatically added to the traffic distribution target. Therefore, the current switching logic is perfect.

Similarly, when the VPS of the Racknerd node fails, all back-to-origin requests for class C applications will only be sent to the home data center:

image.png

Of course, there is no such treatment for Class A and Class B applications. No matter which node fails, the corresponding service will be interrupted. However, these applications are basically used by me, so they are harmless.


This mechanism is provided by default by CloudFlare Tunnel and supports up to 4 connectors deployed in the same tunnel:

image.png

For detailed description of the connector, see the article:Cloudflare tutorial series for home data centers (Part 9) Introduction to common Zero Trust functions and multi-scenario usage tutorials.


Currently, I have 3 cloudflare tunnels, wudihome (corresponding to class A applications), racknerd-chicago (corresponding to class B applications) and loadbalance_w_r (corresponding to class C applications:

image.png

Among them, the tunnel: "loadbalance_w_r" corresponding to the C-type application has two connectors, corresponding to the home data center and Racknerd's Chicago node respectively:

image.png

At the same time, the number of Category C applications has also increased from just blogs to 9 applications:
image.png

Class C core services, including blogs, achieve close to zero latency through dual deployment of home data centers and Racknerd nodes, automatic database synchronization mechanism, and intelligent failover logic for multiple connectors under a single tunnel. 99.9994% availability——Theoretically, the service interruption time throughout the year will not exceed 3 minutes. Even if either side fails, the other side can automatically take over to ensure that the service is always online.


The availability of 99.9994% is not just a casual statement. According to my experience, the total time of power outage and Internet disconnection at home in a year does not exceed 1 day, which is 1440 minutes (mostly routine maintenance of the power system, Internet disconnection has basically never happened, except that time when I was directly blocked by the Communications Administration for 3 days ~); and Racknerd refers to the industry's high availability of 99.9%, which is 8 hours, or 480 minutes, so the following results can be obtained:

C-type service high availability estimation basis (dual-node deployment)

project Home Data Center Racknerd Chicago Node
Estimated annual maximum disruption time 1 day = 1440 minutes 8 hours = 480 minutes
Annual availability 99.73% 99.9%
Whether to deploy Class C services yes yes
Common causes of failure Power outage, network disconnection, local device abnormality VPS network fluctuations, operator maintenance, data center failures
Failover mode Racknerd automatically takes over after the connection is lost After the connection is lost, the home data center will automatically take over

Overall service availability estimation (active-active architecture)

project illustrate
Fault Independence Assumption The downtime of two nodes is independent of each other
Probability of two nodes failing at the same time ≈ (1 ÷ 365) × (0.33 ÷ 365) ≈ 0.0000025
Annual theoretical interruption time 365 × 24 × 60 × 0.0000025 ≈ 3 minutes
Theoretical Blog Service Availability 99.9994%(Industry "5 9" level)
High availability guarantee mechanism Data synchronization (Syncthing + automatic import) + cloudflare tunnel
Automatic switching of multiple connector failures

Note 1: Cloudflare Tunnel is not only capable ofActive/active multi-connector and failoverSo simple, eh, sounds familiar:

image.png

at the same time,It can also select the nearest source site based on the visitor's location and "do its best" to achieve nearby access (if it works, it works, if it doesn't, forget it, it's very Buddhist). The key is that it does not provide visitor-based source site access adhesion (that is, this time the return source may be the Chicago node, and the next time the return source will go through the home data center). But even so, I am very satisfied. After all, I am very satisfied to be able to use it to this extent for free. So, in fact, my Racknerd Chicago node is now the main source station for Class C applications when they need to return to the source (because domestic access is basically assigned to the data center in the western United States. Compared with the home data center located in China, the source station of the Chicago node is of course much closer~). Although most of the access is cached data, there are not many return to the source, but for some reason, I always feel that the opening speed of Class C applications has become faster. I don’t know if it is an illusion~.

Note 2: Those who are familiar with active-active or multi-active data center architecture should know that one of the prerequisites for implementing active-active or multi-active deployment of applications in multiple data centers is that the service content provided by each node must be consistent. In other words, applications between multiple nodes must be synchronized, and there must be no differences in static resources or dynamic data. Especially in scenarios involving databases, the requirements are more stringent - the data of each node must be highly consistent to avoid data confusion or state loss when user requests switch between different nodes.

Therefore, although Cloudflare Tunnel provides multi-connector support, and can achieve automatic request distribution and disaster recovery switching by running the same tunnel configuration on multiple nodes, this does not mean that all types of applications can directly achieve "multi-active". If your service contains state information, database write operations, or user sessions, you must ensure that these states can be synchronized between multiple nodes, or design them as stateless services, in order to truly take advantage of the redundancy and high availability brought by multiple connectors (see the article for my data synchronization solution between multiple WordPress sites:Home Data Center Series WordPress Multi-node "semi-automatic" and "nearly" real-time synchronization solution).

Note 3: Cloudflare Tunnel created through the Zero Trust dashboard can be deployed in two ways: service mode and docker mode. Service mode is the officially recommended deployment method. Use the command “cloudflared service install " to create a system service. However, since there cannot be multiple systemd services with the same name on the same machine, this method can only deploy one tunnel by default. If you need to deploy multiple tunnels using the service method, you can also manually modify the systemd configuration and use different service names to run multiple connectors.

In contrast, the Docker mode is more flexible: since each container is an isolated operating environment, multiple connector instances can be easily deployed, and even if they correspond to different tunnels, they will not conflict with each other. Therefore, from the perspective of deploying multiple tunnels, Docker is a more suitable way to run multiple connectors on the same node (of course, it is not as stable as the service mode).

Note 4: There is another key issue that needs to be solved in active-active deployment, which isRead-write separationUnder the premise of ensuring the consistency of the content of the two WordPress nodes, "read" requests (to be precise, the back-to-origin requests triggered by user access) can be distributed to any node, which can effectively share the load and improve access speed. However, "write" requests are completely different, especially operations involving database writes such as updating website content or posting comments.Strictly limit submission to only one master nodeOtherwise, it is very easy to cause data confusion or even loss of comments.

In my architecture, this "master node" is WordPress deployed on a Mac mini in my home data center. It is also a node with a Cloudflare plug-in installed, responsible for automatically refreshing the APO cache. Updating website content is easy, as you can control the write node by logging into the "master node" yourself, but comments are not easy to handle, as they are controlled by Cloudflare's multi-connector mechanism. In order to implement the forced write of comment requests to the master node, I thought of a lot of tricks, and finally found the simplest way to implement it. During this period, I really stepped on a lot of pits, which really made me tormented.

Since the active-active architecture is enabled, many configuration details are the same as in "Home Data Center Series: Using Cloudflare Tunnel to automatically take over the WordPress primary site when the disaster recovery site failsThe content written in the previous part is different. In addition to the "read-write separation" implemented this time, I will use a separate article to introduce the entire idea, configuration plan and pitfall records in detail later.

5 Racknerd VPS's "Fully Closed" Security Strategy

5.1 What is the “totally closed” security strategy?

Regarding the security strategy of VPS, I have always been pursuing a "fully closed" strategy. What does "fully closed" mean? Simply put, it means that no ports are opened by default, and no active connection requests from the public network are accepted. Even for basic services like SSH, I try not to expose them to the public network. If they can be accessed through the intranet, they will never go through the public network; if they can communicate through point-to-point encrypted tunnels, they will never be exposed using plain text or traditional ports.

In practice, this means that I usually set the firewall's inbound traffic policy to "deny by default", which means that UFW sets all incoming TCP and UDP packets to deny:

image.png


In fact, the most ideal state is that the entire VPS looks like "invisible": no ports are open to the outside world, no scanner can scan any services, and hackers have no way to start. However, "default deny" alone cannot achieve the "invisibility" effect. To be completely "invisible", it is necessary to drop the inbound access of ICMP echo packets and drop all inbound TCP and UDP packets. This cannot be achieved by just using the regular configuration rules of UFW. You must use iptables or nftables.


Of course, this strategy sounds radical, but it is actually not complicated to operate. The key is to use some modern tools to assist: for example, I will use tailscale to build a point-to-point intranet network, and use tailscale's IP for routine operation and maintenance management. In this way, I can achieve "communication only in the private network". From the operator's perspective, they can only see that I am using the Wirdguard protocol to communicate, but they cannot see the actual content of the communication (of course, even if they can't see it, they can still interfere with you: actively drop packets or even directly block them from letting you use them~).

5.2 Core Technology of “Fully Closed” Security Strategy: Cloudflare Tunnel

Many people understand "full closure" as a defensive posture in one direction - such as closing ports, not accepting connections, and not exposing SSH. But here comes the question: if you really do this, how can you use the services running on the VPS? Can't you even log in through SSH? The service can only play by itself? How can you access the website if port 443 is not open? At this time, "full closure" is no longer a security strategy, but a boring self-entertainment.

therefore,Full closure must be matched with the "tunnel-based reverse access" capability, otherwise it will be self-enclosedCloudflare Tunnel is one of the core technologies to solve this contradiction: it can expose intranet services securely without exposing ports or setting up complex reverse proxies. As long as the local network can be connected to Cloudflare, a secure channel can be established. This is very friendly for scenarios where public network access services are required but no external ports are opened.

Its principle is essentially to let the VPS actively initiate a connection to Cloudflare, and then expose the local service through the Cloudflare network. Because it is an "active connection", there is no need to open any ports on the VPS side. Just like a client, as long as it can go out to the Internet, a tunnel can be established.

It has several benefits:

  1. No need to expose any ports: No matter whether you are running a web service, a backend management page, or a small program you wrote yourself on the VPS, you can provide services to the outside world through the tunnel. Public network access only goes through Cloudflare, and the VPS is still "fully closed" to the outside world.
  2. Support access control:Cloudflare Access allows you to do access control based on identity (such as Google account), and even MFA. It is equivalent to having a built-in authentication gateway, avoiding reinventing the wheel on the service itself.
  3. Dealing with port restrictions and carrier blocking:Some VPS, especially domestic cloud service providers, have various "inexplicable" restrictions on ports, especially 80/443. Cloudflare Tunnel does not rely on these ports at all and penetrates directly, which is a weapon to combat restrictions.
  4. Flexible routing + high availability:Combining Cloudflare's load balancing and multi-end tunnel capabilities, you can achieve very flexible backend switching. For example, I currently run a tunnel instance in my home data center and an overseas VPS, and use Cloudflare's load balancing mechanism to automatically determine who is online and automatically switch traffic. All of this is achieved under the premise of "full closure".

It can be said that Cloudflare Tunnel is essentially a“Hidden service exposure”This method is in line with my unreasonable philosophy of "never opening the port but releasing the application to the outside world".

For detailed configuration steps of Cloudflare tunnel, please refer to the article:The home data center series uses tunnel technology to allow home broadband without public IP to use cloudflare for free to quickly build a website (recommended).

5.3 “Fully Enclosed” Is Not a Panacea for Solving All Security Problems

Although a "fully closed" security strategy can indeed solve many daily minor troubles, such as preventing port scanning, mitigating blasting attacks, circumventing strange rules of cloud service providers, and even largely eliminating the need for complex firewall policy configurations, this does not mean that it will be solved once and for all, let alone that you can rest easy.

first,Fully enclosed does not mean absolutely safeFrom a network perspective, although all inbound connections are blocked, this machine will still actively initiate connections, such as running Cloudflare Tunnel, software updates, syncing data, etc. These outbound traffic may still become potential attack surfaces, especially when there is a possibility of local services being hijacked or DNS spoofed.

Secondly,Attackers don’t always come from outside. Often, the real risk comes from the services you deploy yourself, especially the various applications (including WordPress) that are quickly built through containers or scripts. Even if the external access path is "fully closed", an improperly configured service, an out-of-control shell command, or a low-quality image pull may cause lasting internal damage to the system, and these are precisely what the "fully closed" strategy cannot prevent.

Another very realistic point:Security policies are never staticYou block all ports today, but if the project requirements change tomorrow, you may have to temporarily expose a service; you only run one tunnel today, but tomorrow you may have to add a bypass proxy or a panel... This change itself means an increase in risk. If there is a slight looseness, the entire "fully closed" structure may have gaps. People often have the habit of becoming lax once they have tasted the convenience, and finally "fully closed" may become "a small leak is no problem".

So, my consistent understanding is:"Full closure" is a basic situation, not the ultimate solutionIt can greatly reduce the risk of passive exposure, but you still need to be constantly aware of security, including:

  • Regularly audit what services you have running;
  • Check the logic behind every outbound connection;
  • Try to run services with minimal permissions;
  • Be alert to all containers, scripts, and dependencies;
  • It is better to keep logs and monitor exceptions for key applications.

Just like a house with a tightly closed door, it does not mean that the house is necessarily safe. The wires, gas, and sockets in the house still need daily attention. Network security has never been a matter that can be solved by simply "closing the door".

5.4 Disadvantages of “Fully Closed” Security Strategy

Although the "full-closed" security strategy brings significant protection benefits, it is not without cost. The most direct problem isA large number of services that could have been accessed simply through direct connections have become extremely complex under the "fully closed" strategy., or even difficult to use, such as:

  1. Build proxy services based on trojan-go, xray, etc.When accessing, you only need to open a port and bind a TLS certificate. However, in a fully closed state, all inbound ports are rejected. This type of solution that relies on direct connection cannot be deployed at all, and you have to find another way.
  2. If passed Tailscale address to access services on the VPS, you will find that third-party proxy acceleration tools (such as Surge, Clash, Quantumult) cannot directly take over traffic. This is because these tools often only support proxy public network IPv4 addresses, rather than virtual intranet addresses such as Tailscale. This leads to high latency and instability when accessing some Tailscale-only services from China, which reduces the experience.
  3. File transfer becomes cumbersomeServices such as rsync, SFTP, and WebDAV often require reverse connections or transit services if they cannot establish active inbound connections. Otherwise, they can only be replaced by cloud synchronization solutions such as Cloudflare R2, OneDrive, and iCloud, which greatly reduces flexibility.
  4. In some remote operation and maintenance scenarios, such as using mosh, VS Code Remote, x11 over ssh, frp/nps and other tools,Many rely on bidirectional connections or a certain order of connections, direct connection is not possible in a fully closed environment, which means that the entire operation and maintenance path needs to be redesigned.
  5. A fully closed inbound policy can also easily affect the normal communication between the container and the host, especially in the default bridge network mode of Docker. Since UFW intercepts all inbound connections by default, even if these connections originate from containers on the same host, as long as the data flows through the network interface of the host (such as docker0), it may be rejected as "external inbound traffic", resulting in the container being unable to access services on the host. For example, when using Docker to deploy Cloudflare Tunnel, if the –network=host mode is not used, the Tunnel container may not be able to connect to the services on the host; for example, even if the local listening port of the Nginx Proxy Manager (NPM) container is normal, it cannot access the WordPress service of the host through the reverse proxy. This situation will cause communication failures between many components that should be "connected locally", increasing the complexity of debugging and deployment.

Therefore, "full closure" is not a strategy that can be simply applied.More like a basic situation control:You need to choose a suitable supplementary method based on your own scenario: for example, use scientific tools like AnyTLS that support embedded reverse connections and can be used with cloudflat tunnel, or build a transit layer yourself (such as a unified entrance based on Cloudflare Tunnel). Therefore, there are gains and losses, but the magic of application lies in one's heart.

Note: In fact, most of the problems can be solved by purchasing a cheap VPS (such as the one that costs $10.99/year) to use as an SSH springboard or self-built scientific tool machine.

6. Afterword

The subsequent VPS initialization and construction of various applications are all physical work and do not require much technical content, so I will not write about them.

After a week of hard work, Racknerd's Chicago node has been officially put into operation. During this period, I also conducted several downtime tests on Class C applications in my home data center and the Chicago node, and did not find any problems, so it has passed the initial test. Whether it can pass the final test will only be known after it runs stably for a period of time.

Share this article
The content of the blog is original. Please indicate the source when reprinting! For more blog articles, you can go toSite MapUnderstand. The RSS address of the blog is:https://blog.tangwudi.com/feed, welcome to subscribe; if necessary, you can joinTelegram GroupDiscuss the problem together.

Comments

  1. Windows Edge 136.0.0.0
    3 weeks ago
    2025-5-21 23:29:02

    光看文字都大概能感受到你这一通折腾下来之后的舒坦感。哈哈

    • Owner
      Yang
      Macintosh Chrome 136.0.0.0
      3 weeks ago
      2025-5-22 8:10:39

      你懂我啊,满足感那是爆棚~~

  2. asi
    Windows Chrome 136.0.0.0
    4 weeks ago
    2025-5-17 9:32:50

    RN的性能咋样呢 怕不是超售大王 3核不如1核强

    • Owner
      asi
      Macintosh Chrome 136.0.0.0
      4 weeks ago
      2025-5-17 22:13:25

      我主要也没有什么横向对比可做,毕竟从价位上我也没啥选择~~~。从我自己的使用角度来看,不管是1核1G的圣何塞的那台跳板机,还是3核4.5G的芝加哥那台我用来做冗余中心的的VPS,我觉得都完全满足我自己的使用期望。

  3. Linux Chrome 135.0.0.0
    1 month ago
    2025-5-12 10:34:55

    Tencent Cloud is okay, mainly because your customer service manager doesn't care. In fact, there are very cheap old user activities every year, such as 30% off for 3-year old user renewal, 20% off for 4-year user renewal, 10% off for 5-year user renewal (99 in the first year is about 15% off), which means 298 for a one-time purchase of 3 years, 200 in the fourth year, 150 in the fifth year, 70 in the sixth year... It's much cheaper this way (what I like about the two domestic AT companies is that they come with free snapshots. CC, RN, and BugVM hosts are cheap, but they become expensive immediately after taking into account the snapshots)

    • Owner
      Autumn Wind on Weishui River
      Macintosh Chrome 136.0.0.0
      1 month ago
      2025-5-12 10:38:06

      Is there still such a treatment as customer service manager service? No one has ever contacted me, and I didn’t even know that individual users could have this treatment?

      • tangwudi
        Linux Chrome 135.0.0.0
        1 month ago
        2025-5-12 11:18:11

        Yes, it can be called harassment... Tencent Cloud basically calls me once a month, and Alibaba Cloud calls me every week... I guess because the company also uses their services, they think I am IT operation and maintenance, but I am only responsible for project testing. I bought it for project testing...

        • Owner
          Autumn Wind on Weishui River
          Macintosh Chrome 136.0.0.0
          1 month ago
          2025-5-12 11:20:08

          I see. The conclusion is: he looks down on me.

Send Comment Edit Comment


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠(ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ°Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
Emoticons
Emoji
Little Dinosaur
flower!
Previous
Next
       
error:
en_US