Several explosive statements in the era of cloud-native

Since switching to cloud native last year, I've had some thoughts and wanted to share a few explosive theories about cloud native as the last technical blog of the year. This article is purely personal and does not reflect my company's stance.

Overview#

The concept of cloud native was officially proposed around 2014-2015. In 2015, Google led the establishment of the Cloud Native Computing Foundation (CNCF). In 2018, CNCF first defined the concept of cloud native in CNCF Cloud Native Definition v1.0¹.

Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.
These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.

The Chinese translation is as follows:

Cloud native technologies enable organizations to build and run scalable applications in modern dynamic environments such as public, private, and hybrid clouds. Representative technologies of cloud native include containers, service meshes, microservices, immutable infrastructure, and declarative APIs.
These technologies can build fault-tolerant, manageable, and observable loosely coupled systems. Combined with reliable automation, cloud native technologies allow engineers to easily make frequent and predictable significant changes.

From the official definition, I prefer to refer to it as a vision rather than a definition because the expression above does not clearly articulate the specific scope and boundaries of the new concept of cloud native, nor does it clarify the differences between Cloud Native and Non-Cloud Native.

From a personal perspective, a cloud native application possesses the following characteristics:

Containerization
Service-oriented

An organization practicing cloud native should have the following characteristics:

Heavy use of Kubernetes or other container orchestration platforms (such as Shopee's self-developed eru2²)
A complete monitoring system
A complete CI/CD system

Based on this, I've recently seen many people discussing the new concept of cloud native, so I want to share my personal four explosive theories (the data in the theories is based on personal subjective judgment, please be gentle).

Over 95% of companies have not completed the establishment of a CI/CD system. They have also not completed the convergence of online service processes.
Over 90% of companies do not have the technical reserves to implement microservices.
Over 90% of companies do not have the technical reserves to support containerization.

Let's Start the Theories#

1. Over 95% of companies have not completed the establishment of a CI/CD system. They have also not completed the convergence of online service processes.#

CI stands for Continuous Integration, while CD stands for Continuous Delivery. Generally speaking, the definitions of CI and CD are as follows (here I quote Brent Laster's definition in What is CI/CD?³):

Continuous integration (CI) is the process of automatically detecting, pulling, building, and (in most cases) doing unit testing as source code is changed for a product. CI is the activity that starts the pipeline (although certain pre-validations—often called "pre-flight checks"—are sometimes incorporated ahead of CI).
The goal of CI is to quickly make sure a new change from a developer is "good" and suitable for further use in the code base.
Continuous deployment (CD) refers to the idea of being able to automatically take a release of code that has come out of the CD pipeline and make it available for end users. Depending on the way the code is "installed" by users, that may mean automatically deploying something in a cloud, making an update available (such as for an app on a phone), updating a website, or simply updating the list of available releases.

In our practice, the boundaries between CI and CD are not clear. Taking a common Jenkins-based practice as an example, our typical path is:

Create a Jenkins project, set a Pipeline (which includes tasks like code pulling, building, unit testing, etc.), and set trigger conditions.
When there are operations like merging code into the main branch of the specified code repository, execute the Pipeline to generate artifacts.

After generating artifacts, there are two common approaches:

Trigger an automatic deployment process in the next stage of artifact generation, deploying the generated artifact/image directly to the target server according to the deployment script.
Upload the generated artifact to an intermediate platform, where a person manually triggers the deployment task through the deployment platform.

In the process described above, companies with a complete process will also have other auxiliary processes (such as CI processes during PR/MR, CR processes, etc.).

When it comes to deploying to the target platform, my personal view is that most companies have not completed the convergence of online service processes. Here's a joke:

Q: How do you deploy online services? A: nohup, tmux, screen.

For now, a standardized CI/CD process and management of online service processes can provide several foreseeable benefits:

Minimize the risks associated with manual changes.
Better manage the configuration of basic operational dependencies.
Relying on mainstream open-source process management tools like systemd, supervisor, pm2, etc., can provide basic HA guarantees for processes (including process health checks, process restarts, etc.).
Lay the foundation for subsequent service-oriented and containerized steps.

2. Over 90% of companies do not have the technical reserves to implement microservices.#

If for the CI/CD methods mentioned in theory 1, I feel that this is more of an institutional barrier than a technical barrier, then for the following theories, I prefer to describe it as lack of technical reserves.

Let's talk about theory 2: Over 90% of companies do not have the technical reserves to implement microservices.

First, let's discuss the concept of microservices. Microservices have different interpretations in the history of computing. In 2014, Martin Fowler and James Lewis formally proposed the concept of microservices in Microservices a definition of this new architectural term⁴.

Here is a brief overview from Wikipedia:

Microservices are small services that make up a single application, each owning its own process and lightweight handling, designed around business capabilities, deployed in an automated way, and communicating with other services using HTTP APIs. At the same time, services will use minimal centralized management capabilities (such as Docker), and services can be implemented using different programming languages and database components.

Now, let's try to describe the significant differences between microservices and traditional monolithic services in terms of development:

The scope of microservices is smaller, focusing more on a specific function or a category of functions.
Due to their smaller scope, changes and crashes have a smaller impact compared to traditional monoliths.
More friendly for teams with multiple languages and technology stacks.
Aligns with the current internet demand for small, rapid iterations and quick development.

Now we need to consider what technical reserves are needed to implement and practice microservices. I believe there are mainly two aspects: architecture and governance.

First, let's talk about architecture. I think the most troublesome issue for microservices is the splitting from traditional monolithic applications (of course, if microservices were adopted from the very beginning, I won't mention that, although it has its own issues).

As mentioned earlier, microservices, compared to traditional monolithic applications, have a smaller scope and focus more on a specific function or a category of functions. The biggest problem in implementing microservices is reasonably defining functional boundaries and splitting them.

If the splitting is unreasonable, it will lead to tight coupling between services. For example, if I place user authentication in the e-commerce service, it will cause my forum service to depend on an unnecessary e-commerce service. If the splitting is too fine, it can lead to an interesting phenomenon where a small business splits into over 100 service repositories (we call this situation: microservices refugees 2333).

We practice the concept of microservices because, as our business and team scale grow, facing diverse demands and team members' technology stacks, the maintenance cost of traditional monolithic applications will be significant. We hope to introduce microservices to reduce maintenance costs and risks as much as possible. However, unreasonable splitting will lead to maintenance costs far exceeding those of continuing with a monolithic solution.

Another issue hindering the practice of microservices is governance. Let's look at some problems we face after adopting microservices:

Observability issues. As mentioned earlier, the scope of individual services after microservices is smaller, focusing more on a specific function or a category of functions. This may lead to longer request chains needed to complete a business request. According to common views, longer chains carry greater risks. So when a service has an exception (such as a sudden increase in business RT), how do we locate the specific service's problem?
Configuration framework convergence. In microservices scenarios, we may choose to sink some basic functions into specific internal frameworks (like service registration, discovery, routing, etc.), which means we need to maintain our framework while completing configuration convergence.
The usual service governance issues (registration, discovery, circuit breaking), etc.
The demand for a complete CI/CD mechanism becomes more urgent after adopting microservices. If the situation in theory 1 exists, it will become an obstacle to practicing the microservices concept.

Indeed, currently, both the open-source community (like Spring Cloud, Go-Micro, etc.) and the four major cloud vendors (AWS, Azure, Alibaba Cloud, GCP) are trying to provide out-of-the-box microservices solutions. However, in addition to not being able to effectively solve issues like architecture mentioned above, they also have their own problems:

Whether relying on open-source community solutions or cloud vendor solutions, users need a certain level of technical literacy to locate specific issues within the framework.
Vendor Lock-in. Currently, there is no universal open-source standard for out-of-the-box microservices solutions. Therefore, relying on a specific open-source community or cloud vendor's solution will have vendor lock-in issues.
Both open-source community solutions and cloud vendor solutions have issues with multi-language compatibility (it seems everyone now prefers Java a bit more (Python has no rights.jpg)).

So the core point that theory 2 aims to convey is: microservices are not a cost-free endeavor; on the contrary, they require significant technical reserves and human investment. Therefore, please do not think of microservices as a panacea. Use them as needed.

3. Over 90% of companies do not have the technical reserves to support containerization.#

Currently, a mainstream viewpoint is to use containers whenever possible. To be honest, this idea has a certain degree of reasonableness. To review this idea, we need to look at what changes containers bring us.

Containers undoubtedly bring us many benefits:

Keeping development and production environments consistent is very convenient. In other words, when a developer says, "This service has no issues on my local machine," it becomes a useful statement.
It makes deploying some services easier, whether for distribution or deployment.
It can achieve a certain level of resource isolation and allocation.

So, can we use containers without thinking? No, we need to review some potential downsides we may face after containerization:

Container security issues. The most mainstream container implementation (here I name Docker) is essentially based on CGroups + NS for resource and process isolation. Therefore, its security will be a significant concern. After all, Docker has privilege escalation and escape vulnerabilities every year. This means we need a systematic mechanism to regulate our container usage to ensure that related privilege escalation points can be controlled within a manageable range. Another direction is image security issues. Everyone is a programming expert who searches on Baidu/CSDN/Google/Stackoverflow, so there will inevitably be a situation where, when we encounter a problem, we search and directly copy a Dockerfile. At this point, there will be significant risks because no one knows what has been added to the base image.
Container networking issues. When we start several images, how do we handle network communication between containers? In a production environment, there are definitely more than one machine, so how do we ensure stable network communication between containers across hosts?
Container scheduling and operation issues. When a machine is under high load, how do we schedule some containers from that machine to other machines? How do we check if a container is alive? If a container crashes, how do we restart it?
Specific details about containers, such as how to build and package images? How to upload them? (back to theory 1) And how to troubleshoot some corner case issues?
For some specific large-size images (like the official CUDA images commonly used by machine learning colleagues, which package large amounts of data like dictionary models), how to download and release them quickly?

There might be a viewpoint here: no worries, we can just use Kubernetes, and many of these problems will be solved! Well, let's talk about this issue.

First, I have ignored the scenario of building a self-hosted Kubernetes cluster because that is not something an average person can handle. Now, let's look at the situation using public cloud services. Taking Alibaba Cloud as an example, we open the page and see this image:

Now, let me ask:

What is VPC?
What is the difference between Kubernetes 1.16.9 and 1.14.8?
What are Docker 19.03.5 and Alibaba Cloud Security Sandbox 1.1.0, and what are the differences?
What is a private network?
What is a virtual switch?
What is a network plugin? What are Flannel and Terway? What are the differences? When you flip through the documentation, it tells you that Terway is Alibaba Cloud's modified CNI plugin based on Calico. So what is a CNI plugin? What is Calico?
What is Pod CIDR and how to set it?
What is Service CIDR and how to set it?
What is SNAT and how to set it?
How to configure security groups?
What is Kube-Proxy? What is the difference between iptables and IPVS? How to choose?

You can see that the above questions cover several aspects:

In-depth understanding of Kubernetes itself (CNI, runtime, kube-proxy, etc.).
A reasonable network plan.
Familiarity with specific functions of cloud vendors.

In my view, any of these three aspects requires a significant level of technical reserves and understanding of the business (broadly defined technical reserves) for a technical team.

Of course, let me ramble a bit more. In reality, managing Kubernetes incurs significant costs (a bit off-topic, but I'll continue):

You need an image repository, right? It's not expensive; the basic version in China is 780 per month.
Do the services in your cluster need to be exposed? Okay, buy the lowest specification SLB, the simplified version, for 200 per month.
Alright, you need to spend money on logs every month, right? Assuming you have 20G of logs per month, not too much? Okay, 39.1.
Do you need monitoring for your cluster? Okay, buy it; let's say you report 500,000 logs per day? Okay, not expensive, 975 per month.

Calculating for a cluster: (780 + 200 + 39.1 + 975) * 12 = 23292.2 per year, not counting the basic ENI, ECS, and other costs. How delightful.

Moreover, Kubernetes has many esoteric issues that require the technical team to have sufficient technical reserves to troubleshoot (I recall encountering issues like CNI process crash without restart, specific kernel cgroup leaks on certain versions, ingress OOM, etc.). You can check the Kubernetes issue section to see the situation (too much to say brings tears).

Conclusion#

I know this article will generate a lot of controversy. However, the point I want to express is that the introduction of this set of things in the cloud native era (which is actually more of an extension of traditional technologies) is not without cost and is not without expense. For companies with sufficient scale and pain points, such costs can positively promote their business growth, while for more small and medium-sized enterprises, this set may have very little or even negative effects on their business improvement.

I hope that when we technical personnel make technical decisions, we must evaluate our team's technical reserves and the benefits to the business before introducing a certain technology or concept, rather than adopting a technology just because it seems advanced, impressive, or can enhance my resume.

Finally, let me conclude this article with a quote I shared before:

A company pursuing technological advancement for the sake of technology is doomed.

Reference#

1. CNCF Cloud Native Definition v1.0

1. projecteru2

1. What is CI/CD?

1. Microservices a definition of this new architectural term