Introduction to DeepFlow

This document was translated by GPT-4

# 1. What is DeepFlow

DeepFlow is an observability product developed by Yunshan Networks (opens new window), designed to provide in-depth observability for complex cloud infrastructure and cloud-native applications. Based on eBPF, DeepFlow implements application performance metrics, distributed tracing, continuous profiling, and other observation signals with zero-disturbance (Zero Code) collection, integrating intelligent tags (SmartEncoding) technology to achieve a full stack (Full Stack) correlation. By using DeepFlow, cloud-native applications automatically attain deep observability, alleviating the burden on developers and providing monitoring and diagnostic capabilities from code to infrastructure for DevOps/SRE teams.

To encourage global developers and researchers in the observability field to innovate and contribute, DeepFlow's core modules have been open-sourced using the Apache 2.0 License (opens new window) and officially published at the international top networking conference ACM SIGCOMM in 2023 (opens new window) with the academic paper "Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code".

# 2. Core Features

  • Universal map of any Service: Based on the leading AutoMetrics mechanism, eBPF is used to capture a universal service map of a production environment without disruption, including services developed in any language, unknown third-party services, and all cloud-native infrastructure services. It provides interpretation abilities for a large number of application protocols and offers a Wasm plug-in mechanism to extend the interpretation of any private protocol. Zero-disturbance in calculating the full-stack golden metrics for each call in the application and infrascructure quickly identifies performance bottlenecks.
  • Distributed tracing of any Request: Based on the leading AutoTracing mechanism, using eBPF and Wasm technology to achieve distributed tracing without disruption, it supports applications in any language and completely covers gateways, service mesh, databases, message queues, DNS, network cards and other infrastructure, leaving no blind spots. Full stack - automatically collects network performance metrics and file read/write events associated with each Span. Starting from here, distributed tracing enters a new era of zero instrumentation.
  • Continuous profiling of any Function: Based on advanced AutoProfiling mechanism, it leverages eBPF to collect production environment process profiling data with less than 1% overhead zero-disturbance, creates function level OnCPU and OffCPU FlameGraphs, and quickly locates full-stack performance bottlenecks in application functions, library functions, and kernel functions, automatically associating them with distributed tracing data. Even under kernel version 2.6+, network performance profiling capabilities can still be provided to gain insight into code performance bottlenecks.
  • Seamless integration with popular observability technology stacks: Can serve as a storage backend for Prometheus, OpenTelemetry, SkyWalking, Pyroscope, and also provide SQL, PromQL, OTLP data interfaces as data sources for popular technology stacks. Based on the leading AutoTagging mechanism, it automatically injects unified tags for all observation signals, including cloud resources, K8s container resources, K8s Label/Annotation, business properties in CMDB, etc., to eliminate data islands.
  • Storage performance 10x ClickHouse: Based on the leading SmartEncoding mechanism, it injects standardized, pre-encoded meta tags into all observation signals, reducing storage overhead by 10x compared to ClickHouse's String or LowCard solution. Custom tags are stored separately from observation data, so you can safely inject nearly infinite dimensions and cardinality of tags and enjoy an easy querying experience similar to BigTable.

# 3. Solving Two Major Pain Points

In traditional solutions, APM aims to achieve application observability through code instrumentation. Through instrumentation, applications can expose a wealth of observation signals, including metrics, traces, logs, function performance profiling, etc. However, the act of instrumentation actually changes the internal state of the original program, which does not logically conform to the requirement of observability to "determine internal state from external data". In key business systems in important industries such as finance and telecommunications, the landing of APM Agent is very difficult. With the advent of the cloud-native era, this traditional method also faces more severe challenges. In general, the problems with APM are mainly reflected in two aspects: the invasiveness of the agent makes it difficult to land, and the observation blind spots make it impossible to triage.

Firstly, probe invasiveness makes it difficult to implement. The process of instrumentation requires modifying the source code of the application and redeploying it. Even technologies like Java Agent, which enhance bytecode, still need to modify the startup parameters of the application program and redeploy it. However, modifying the application code is just the first barrier, and there are usually many other problems encountered during the landing process:

  1. Code conflict: Do you often encounter runtime conflicts between different Agents when you inject multiple Java Agents for distributed tracing, performance profiling, logs, and even service meshes? When you introduce an observability SDK, have you ever encountered a dependency library version conflict that prevented successful compilation? The conflict is more apparent when there are more business teams.
  2. Difficult to maintain: If you are responsible for maintaining the company's Java Agent or SDK, how frequently can you update it? Right now, how many versions of the probe program are in your company's production environment? How long would it take to update them all to the same version? How many different languages of probe programs do you need to maintain at the same time? When the microservice framework and RPC framework of a company cannot be unified, these maintenance problems will become more serious.
  3. Blurred boundaries: All instrumentation code seamlessly enters the running logic of the business code, not distinguishing between you and me, and not being controlled. This makes it often difficult to escape blame when performance degradation or running errors occur. Even if the probe has been through a long period of practical honing, it is inevitable to ask for suspicion when problems occur.

This is also why invasive instrumentation solutions are rarely seen in successful commercial products and more often seen in active open source communities. The activity of communities such as OpenTelemetry and SkyWalking prove this. In large corporations where division of labor is clear, overcoming collaboration difficulties is an obstacle that a technological solution cannot bypass for successful implementation. Especially in key industries such as finance, telecommunications, and power that bear the national economy and people's livelihood, the distinction of responsibilities and conflicts of interest between departments often make it "impossible" to implement an invasive solution. Even in open collaborating Internet companies, there are still problems such as developers' reluctance to implement instrumentation and the blame from operation and maintenance personnel when performance failures occur. After enduring a long period of effort, people have realized that intrusive solutions are only suitable for each business development team to actively introduce, maintain their versions of various Agents and SDKs, and be responsible for the risks of performance hiding and operation failure.

Secondly, the observation blind spot makes it impossible to triage. Even though APM has landed in the enterprise, we still find that it is hard to define the fault boundary, especially in cloud-native infrastructure. This is because development and operation often use different languages to talk, for instance, when the call delay is too high the development will suspect the network is slow, the gateway is slow, the database is slow, the server is slow, but due to the lack of full-stack observability, the network, gateway, database give answers like the network card has not dropped any packets, the process CPU is not high, there are no slow logs in the DB, the server latency is very low, and a bunch of unrelated indicators still can't solve the problem. Triage is the most critical part of the fault handling process, and its efficiency is extremely important.

If you are a business development engineer, in addition to business itself, you should also be concerned about system calls and network transmission processes; if you are a Serverless tenant, you may also need to pay attention to the service mesh sidecar and its network transmission; if you directly use virtual machines or build your own K8s cluster, then container networking is a critical issue to pay attention to, especially paying attention to CoreDNS, Ingress Gateway and other basic services in K8s; if you are a private cloud computing service administrator, you should be concerned about the network performance on the KVM host; if you are a private cloud gateway, storage, security team, you also need to pay attention to the system calls and network transmission performance on the service nodes. What is more important, in fact, is that the data used for error triage should be described in a similar language: how long each hop in the whole stack path takes for a single application call. We found that the observational data provided by developers through instrumentation may account for only 1/4 of the entire complete stack path. In the cloud-native era, relying solely on APM to solve fault boundaries is itself a misconception.

# 4. eBPF Technology

Suppose you have a basic understanding of eBPF, it is a secure and efficient technology to extend kernel functionality by running programs in a sandbox, it is a revolutionary innovation for traditional methods of modifying kernel source code and writing kernel modules. eBPF programs are event-driven, when the kernel or user program passes an eBPF Hook, the eBPF program loaded on the corresponding Hook point will be executed. The Linux kernel predefines a series of common Hook points, and you can also use kprobe and uprobe technology to dynamically add custom Hook points to the kernel and application programs. Thanks to Just-in-Time (JIT) technology, the operation efficiency of eBPF code can be as good as native kernel code and kernel modules. Thanks to the Verification mechanism, eBPF code will run safely and will not cause kernel crashes or enter dead loops.

https://ebpf.io/what-is-ebpf/#hook-overview

https://ebpf.io/what-is-ebpf/#hook-overview

The sandbox mechanism is where eBPF differs from the APM instrumentation mechanism. The "sandbox" delineates a clear boundary between eBPF code and application code, enabling us to ascertain its internal state by obtaining external data without making any modifications to the application. Let's analyze why eBPF is an excellent solution to the defects of APM code instrumentation:

First, zero-disturbance solves the problem of difficult implementation. Because eBPF programs don't need to modify the application program code, therefore, there are no runtime conflicts like Java Agent and SDK compilation conflicts, solving the problem of code conflict; running eBPF programs doesn't require changing and restarting application processes, doesn't require application program redeployment, it won't have the pain of maintaining Java Agent and SDK versions, solving the difficulty of maintenance problem; because eBPF runs efficiently and safely under the guarantee of JIT technology and Verification mechanism, you don't have to worry about causing unexpected performance degradation or runtime errors of application processes, solving the blurred boundary problem. In addition, from a management perspective, since only one independent eBPF Agent process needs to be run on each host, you can control its CPU and other resource consumption separately and accurately.

Secondly, full-stack capabilities solve the problem of hard fault boundary definition. The capabilities of eBPF cover every layer from the kernel to the user program, thus we are able to trace the full-stack path of a request starting from the application program, going through system calls, network transmission, gateway services, security services, to the database service or peer microservice, ** it provides sufficient neutral observation data to quickly complete the fault boundary delineation**.

For more detailed analysis on this aspect, please refer to our article "eBPF is the key technology for implementing observability" (opens new window).

It is important to underline that this doesn't mean that DeepFlow only uses eBPF technology. Instead, DeepFlow supports seamless integration with popular observability technology stacks. For example, it can be used as the storage backend for observability signals from Prometheus, OpenTelemetry, SkyWalking, Pyroscope and others.

# 5. Mission and Vision

  • Mission: Make observation simpler.
  • Vision: To be the first choice for realizing observability in cloud-native applications.