Calculation Logic of Metrics and Operators
This document was translated by GPT-4
This article will introduce different types of metrics and the calculation logic of different operators.
# 1. Metrics
Metrics are divided into two main categories: application performance metrics
and network performance metrics
.
# 1.1 Application Performance Metrics
Application metrics are used to measure the performance of services in actual operation, focusing mainly on service throughput, response latency, and exceptions. By quantifying these metrics, operations personnel and developers can better understand the performance of the application during actual use and identify potential performance issues, thereby taking appropriate measures for optimization and improvement.
The metrics described below will record a metric quantity in each statistical period. Users can customize the statistical period, and the system currently supports 1m (one minute) and 1s (one second) (these data are collectively referred to as the original data source by the DeepFlow platform). If multiple metric quantities are calculated in a statistical period, they will finally be aggregated and recorded as one metric quantity. The logic for aggregation is described in the subsequent description of the type
.
# 1.1.1 Throughput
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
request | 请求 | 个 | counter | 请求总数 |
response | 响应 | 个 | counter | 响应总数 |
generate from csv file: application.ch?Category=Throughput
# 1.1.2 Delay
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
rrt | 平均时延 | 微秒 | delay | 统计周期内所有应用时延的平均值,一个应用时延由响应时间与请求时间的时差算得 |
rrt_max | 最大时延 | 微秒 | delay | 统计周期内所有应用时延的最大值,一个应用时延由响应时间与请求时间的时差算得 |
generate from csv file: application.ch?Category=Delay
# 1.1.3 Error
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
error | 异常 | 个 | counter | 异常包含客户端异常 + 服务端异常,根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_log 中 response_status 字段的说明 |
client_error | 客户端异常 | 个 | counter | 根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_log 中 response_status 字段的说明 |
server_error | 服务端异常 | 个 | counter | 根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_log 中 response_status 字段的说明 |
timeout | 超时 | 个 | counter | 未采集到响应的请求总数,默认超时时间 TCP 1800s,UDP 150s |
error_ratio | 异常比例 | % | percentage | 异常请求的百分比,通过异常 / 响应 计算得,即 error / response |
client_error_ratio | 客户端异常比例 | % | percentage | 客户端异常请求的百分比,通过客户端异常 / 响应 计算得,即 client_error / response |
server_error_ratio | 服务端异常比例 | % | percentage | 客户端异常请求的百分比,通过服务端异常 / 响应 计算得,即 server_error / response |
generate from csv file: application.ch?Category=Error
# 1.2 Network Performance Metrics
Network metrics are quantitative indicators used to assess network performance, covering the network layer, transport layer, and application layer. These metrics include throughput, latency, performance, and types of exceptions.
# 1.2.1 L3 Throughput
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
byte | 字节 | 字节 | counter | 发送与接收字节的总和,包含从 Ether 头开始的所有内容 |
byte_tx | 发送字节 | 字节 | counter | 发送的字节总和,包含从 Ether 头开始的所有内容 |
byte_rx | 接收字节 | 字节 | counter | 接受的字节总和,包含从 Ether 头开始的所有内容 |
packet | 包数 | 包 | counter | 发送和接收包数的总和 |
packet_tx | 发送包数 | 包 | counter | 发送的包数总和 |
packet_rx | 接收包数 | 包 | counter | 接收的包数总和 |
l3_byte | 网络层载荷 | 字节 | counter | 发送与接收字节的总和,包含 IP 头之后的总字节数 |
l3_byte_tx | 发送网络层载荷 | 字节 | counter | 发送字节的总和,包含 IP 头之后的总字节数 |
l3_byte_rx | 接收网络层载荷 | 字节 | counter | 接收字节的总和,包含 IP 头之后的总字节数 |
bpp | 平均包长 | 字节 | quotient | 平均包长,通过字节 / 包数 计算得到,即 byte / packet |
bpp_tx | 平均发送包长 | 字节 | quotient | 发送包的平均长度,通过发送字节 / 发送包数 计算得到,即 byte_tx / packet_tx |
bpp_rx | 平均接收包长 | 字节 | quotient | 接收包的平均长度,通过接收字节 / 接收包数 计算得到,即 byte_rx / packet_rx |
generate from csv file: network.ch?Category=L3 Throughput
# 1.2.2 L4 Throughput
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
new_flow | 新建连接 | 连接 | counter | 相比上个统计周期,新出现的连接总数,连接 的定义详见文档 |
closed_flow | 关闭连接 | 连接 | counter | 在当前统计周期内关闭的连接总数,连接 的定义详见文档 |
flow_load | 活跃连接 | 连接 | gauge | 统计周期内活跃的连接数,连接 的定义详见文档 |
syn_count | SYN 包数 | 包 | counter | TCP 三次握手阶段 SYN 包的总数 |
synack_count | SYN-ACK 包数 | 包 | counter | TCP 三次握手阶段 SYN-ACK 包的总数 |
l4_byte | 传输层载荷 | 字节 | counter | 发送与接收字节的总和,包含 TCP/UDP 的 payload 的长度 |
l4_byte_tx | 发送传输层载荷 | 字节 | counter | 发送字节的总和,包含 TCP/UDP 的 payload 的长度 |
l4_byte_rx | 接收传输层载荷 | 字节 | counter | 接收字节的总和,包含 TCP/UDP 的 payload 的长度 |
generate from csv file: network.ch?Category=L4 Throughput
Active connection calculation logic:
- The collector counts the original active connections on the unit of four-tuples (client IP, server IP, protocol, server port), and then calculates the active connections corresponding to resources and paths.
- If traffic can be collected within the time interval corresponding to the data source, the active connections will be counted; however, there are some special situations:
- 1s data source: describes the number of active connections counted per second.
- Each minute, the first second: Includes connections that have no traffic but have not ended within this second. This is generally used to evaluate concurrent connections (many non-overlapping connections with a duration of less than a second can cause some errors).
- Each minute, the last 59 seconds: If multiple flows with the same four-tuple have no traffic within this second, this second will ignore the number of connections corresponding to this four-tuple. This is generally used to evaluate the lower limit of concurrent connections.
- 1m data source: describes the number of active connections counted per minute.
- Includes connections that have no traffic but have not ended. This is generally used to evaluate the upper limit of concurrent connections.
- Custom data sources: derived from 1s/1m data sources through Avg/Max/Min calculation. The meaning is the same as directly using the 1s/1m data source and choosing the Avg/Max/Min operator.
- 1s data source: describes the number of active connections counted per second.
# 1.2.3 TCP Performance (TCP Slow)
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
retrans_syn | SYN 重传 | 包 | counter | TCP 三次握手阶段 SYN 包的重传次数 |
retrans_synack | SYN-ACK 重传 | 包 | counter | TCP 三次握手阶段 SYN 包的重传次数 |
retrans | TCP 重传 | 包 | counter | TCP 包重传的次数,包含客户端和服务端重传次数 |
retrans_tx | TCP 客户端重传 | 包 | counter | 客户端发送给服务端,TCP 包重传的次数 |
retrans_rx | TCP 服务端重传 | 包 | counter | 客户端接收到的服务端,TCP 包重传的次数 |
zero_win | TCP 零窗 | 包 | counter | TCP 包零窗的次数 |
zero_win_tx | TCP 客户端零窗 | 包 | counter | 客户端发送给服务端,TCP 包零窗的次数 |
zero_win_rx | TCP 服务端零窗 | 包 | counter | 客户端接收到的服务端,TCP 包零窗的次数 |
retrans_syn_ratio | SYN 重传比例 | % | percentage | SYN 重传比例,通过 TCP SYN 重传 / TCP SYN 包数 计算得,即 retrans_syn / syn_count |
retrans_synack_ratio | SYN-ACK 重传比例 | % | percentage | SYN-ACK 重传比例,通过 TCP SYN-ACK 重传 / TCP SYN-ACK 包数 计算得,即 retrans_synack / synack_count |
retrans_ratio | TCP 重传比例 | % | percentage | TCP 重传比例,通过TCP 重传 / 所有的包 计算得,即 retrans / packet |
retrans_tx_ratio | TCP 客户端重传比例 | % | percentage | TCP 客户端重传比例,通过 TCP 客户端重传 / 所有的发送包数 计算得,即 retrans_tx / packet_tx |
retrans_rx_ratio | TCP 服务端重传比例 | % | percentage | TCP 服务端重传比例,通过TCP 服务端重传 / 所有的接受包数 计算得,即 retrans_rx /packet_rx |
zero_win_ratio | TCP 零窗比例 | % | percentage | TCP 零窗比例,通过TCP 零窗比例 / 所有的包 计算得,即 zero_win /packet |
zero_win_tx_ratio | TCP 客户端零窗比例 | % | percentage | TCP 客户端零窗比例,通过TCP 客户端零窗 / 所有的发送包数 计算得,即 zero_win_tx /packet_tx |
zero_win_rx_ratio | TCP 服务端零窗比例 | % | percentage | TCP 服务端零窗比例,通过TCP 服务端零窗 / 所有的接受包数 计算得,即 zero_win_rx /packet_rx |
generate from csv file: network.ch?Category=TCP Slow
# 1.2.4 TCP Exceptions (TCP Error)
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
tcp_establish_fail | 建连-失败次数 | 次 | counter | TCP 建连失败次数,建连失败场景见文档描述 |
client_establish_fail | 建连-客户端失败次数 | 次 | counter | TCP 建连过程中,客户端导致的失败次数 |
server_establish_fail | 建连-服务端失败次数 | 次 | counter | TCP 建连过程中,服务端导致的失败次数 |
tcp_establish_fail_ratio | 建连-失败比例 | % | percentage | 建连-失败比例,通过 TCP 建连-失败次数 / 所有的关闭连接 计算得,即 tcp_establish_fail / close_flow |
client_establish_fail_ratio | 建连-客户端失败比例 | % | percentage | 建连-客户端失败比例,通过 TCP 建连-客户端失败次数 / 所有的关闭连接 计算得,即 client_establish_fail / close_flow |
server_establish_fail_ratio | 建连-服务端失败比例 | % | percentage | 建连-服务端失败比例,通过 TCP 建连-服务端失败次数 / 所有的关闭连接 计算得,即 server_establish_fail / close_flow |
tcp_transfer_fail | 传输-失败次数 | 次 | counter | TCP 传输过程中失败的次数,传输失败场景见文档描述,包含传输的所有错误 |
tcp_transfer_fail_ratio | 传输-失败比例 | % | percentage | 传输-失败比例,通过 TCP 传输-失败次数 / 所有的关闭连接 计算得,即 tcp_transfer_fail / close_flow |
tcp_rst_fail | 重置次数 | 连接 | counter | TCP 连接被 RST 的次数,包含建连、传输、断连阶段的 RST |
tcp_rst_fail_ratio | 重置比例 | % | percentage | 重置比例,通过 TCP 重置次数 / 所有的关闭连接 计算得,即 tcp_rst_fail / close_flow |
client_source_port_reuse | 建连-客户端端口复用 | 连接 | counter | TCP 建连失败的场景之一,见文档描述 |
server_syn_miss | 建连-服务端 SYN 缺失 | 连接 | counter | TCP 建连失败的场景之一,见文档描述 |
client_establish_other_rst | 建连-客户端其他重置 | 连接 | counter | TCP 建连失败的场景之一,见文档描述 |
client_ack_miss | 建连-客户端 ACK 缺失 | 连接 | counter | TCP 建连失败的场景之一,见文档描述 |
server_reset | 建连-服务端直接重置 | 连接 | counter | TCP 建连失败的场景之一,见文档描述 |
server_establish_other_rst | 建连-服务端其他重置 | 连接 | counter | TCP 建连失败的场景之一,见文档描述 |
client_rst_flow | 传输-客户端重置 | 连接 | counter | TCP 传输失败的场景之一,见文档描述 |
server_rst_flow | 传输-服务端重置 | 连接 | counter | TCP 传输失败的场景之一,见文档描述 |
server_queue_lack | 传输-服务端队列溢出 | 连接 | counter | TCP 传输失败的场景之一,见文档描述 |
tcp_timeout | 传输-TCP 连接超时 | 连接 | counter | TCP 传输失败的场景之一,见文档描述 |
client_half_close_flow | 断连-客户端半关 | 连接 | counter | TCP 断连异常的场景之一,见文档描述 |
server_half_close_flow | 断连-服务端半关 | 连接 | counter | TCP 断连异常的场景之一,见文档描述 |
generate from csv file: network.ch?Category=TCP Error
TCP client Connection exceptions
TCP client Connection exceptions
TCP server Connection exceptions
TCP server Connection exceptions
TCP Transfer exceptions
TCP Transfer exceptions
TCP Disconnection exceptions
TCP Disconnection exceptions
TCP Connection timeouts
TCP Connection timeouts
# 1.2.5 Transport Layer Delay (Delay)
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
rtt | 平均 TCP 建连时延 | 微秒 | delay | 统计周期内,所有 TCP 建连时延的平均值,一次时延的计算见文档描述 |
rtt_client | 平均 TCP 建连客户端时延 | 微秒 | delay | 统计周期内,所有 TCP 建连客户端时延的平均值,一次时延的计算见文档描述 |
rtt_server | 平均 TCP 建连服务端时延 | 微秒 | delay | 统计周期内,所有 TCP 建连服务端时延的平均值,一次时延的计算见文档描述 |
srt | 平均 TCP/ICMP 系统时延 | 微秒 | delay | 统计周期内,所有 TCP/ICMP 系统时延的平均值,一次时延的计算见文档描述 |
art | 平均数据时延 | 微秒 | delay | 统计周期内,所有数据时延的平均值,数据时延包含 TCP/UDP,一次时延的计算见文档描述 |
cit | 平均客户端等待时延 | 微秒 | delay | 统计周期内,所有客户端等待时延的平均值,数据时延仅包含 TCP,一次时延的计算见文档描述 |
rtt_max | 最大 TCP 建连时延 | 微秒 | delay | 统计周期内,所有 TCP 建连时延的最大值,一次时延的计算见文档描述 |
rtt_client_max | 最大 TCP 建连客户端时延 | 微秒 | delay | 统计周期内,所有 TCP 建连客户端时延的最大值,一次时延的计算见文档描述 |
rtt_server_max | 最大 TCP 建连服务端时延 | 微秒 | delay | 统计周期内,所有 TCP 建连服务端时延的最大值,一次时延的计算见文档描述 |
srt_max | 最大 TCP/ICMP 系统时延 | 微秒 | delay | 统计周期内,所有 TCP/ICMP 系统时延的最大值,一次时延的计算见文档描述 |
art_max | 最大数据时延 | 微秒 | delay | 统计周期内,所有数据时延的最大值,数据时延包含 TCP/UDP,一次时延的计算见文档描述 |
cit_max | 最大客户端等待时延 | 微秒 | delay | 统计周期内,所有客户端等待时延的最大值,数据时延仅包含 TCP,一次时延的计算见文档描述 |
generate from csv file: network.ch?Category=Delay
TCP network delay dissection
- Delay caused during connection establishment
- [1] The complete
connection establishment delay
includes the entire time from the client sending a SYN packet to receiving the SYN+ACK packet replied by the server, and then replying with an ACK packet. The connection establishment delay can be further divided into theclient connection establishment delay
and theserver connection establishment delay
. - [2] The
client connection establishment delay
is the time for the client to reply with an ACK packet after receiving the SYN+ACK packet. - [3] The
server connection establishment delay
is the time for the server to reply with a SYN+ACK packet after receiving the SYN packet.
- [1] The complete
- Delay generated during data communication, which can be broken down into
client wait delay
+data transfer delay
.- [4] The
client wait delay
is the time for the client to first send a request after a successful connection; it is the time for the client to send another data packet after receiving the server's data packet. - [5] The
data transfer delay
is the time from the client sending a data packet to receiving a reply data packet from the server. - [6] During the data transfer delay, there will be system protocol stack processing delays, called
system delay
, which is the time for the data packet to receive the ACK packet.
- [4] The
# 1.2.6 Application Layer Metrics (Application)
Field | DisplayName | Unit | Type | Description |
---|---|---|---|---|
l7_request | 应用请求 | 个 | counter | 请求个数 |
l7_response | 应用响应 | 个 | counter | 响应个数 |
rrt | 平均应用时延 | 微秒 | delay | 统计周期内所有应用时延的平均值,一个应用时延由响应时间与请求时间的时差算得 |
rrt_max | 最大应用时延 | 微秒 | delay | 统计周期内所有应用时延的最大值,一个应用时延由响应时间与请求时间的时差算得 |
l7_error | 应用异常 | 个 | counter | 异常包含客户端异常 + 服务端异常,根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_log 中 response_status 字段的说明 |
l7_client_error | 应用客户端异常 | 个 | counter | 根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_log 中 response_status 字段的说明 |
l7_server_error | 应用服务端异常 | 个 | counter | 根据具体应用协议的响应码判断异常,不同协议的定义见 l7_flow_log 中 response_status 字段的说明 |
l7_timeout | 应用超时 | 个 | counter | 未采集到响应的请求总数,默认超时时间 TCP 1800s,UDP 150s |
l7_error_ratio | 应用异常比例 | % | percentage | 异常请求的百分比,通过异常 / 响应 计算得,即 l7_error / l7_response |
l7_client_error_ratio | 应用客户端异常比例 | % | percentage | 客户端异常请求的百分比,通过客户端异常 / 响应 计算得,即 l7_client_error / l7_response |
l7_server_error_ratio | 应用服务端异常比例 | % | percentage | 客户端异常请求的百分比,通过服务端异常 / 响应 计算得,即 l7_server_error / l7_response |
generate from csv file: network.ch?Category=Application
# 1.2.7 Cardinality Statistics (Cardinality)
The number of non-repeated tags counted in the statistical period. For example, if you query the metric "client IP address (ip_0)" that all access to pod_1
, the expression implies how many non-repeated client IP addresses are there in all the traffic visiting pod_1
.
Field | DisplayName | Unit | Type | Description |
---|
generate from csv file: network.ch?Category=Cardinality
# 2. Operators
Operators compute the data in the original data source according to the selected time range and interval. For example, when using a line chart to view the original data source of 1s, the latest 5 minutes, according to 20s time interval Avg data, taking one point on the line chart as an example (14:43:00), it is to read all the data in the time range of 14:42:40 - 14:43:00 in the original data source, and then calculate the average.
Operators support nested stacking. Among them, aggregation operators
do not support stacking. For example, the expression PerSecond(Avg(byte)) means to calculate Avg(byte) first, and then the obtained value is secondarily calculated according to PerSecond.
# 2.1 Aggregation Operators
Operator | Applicable Metric Types | Description |
---|---|---|
Avg | All types | Average |
Sum | All types except Percentage | Sum |
Max | All types | Maximum |
Min | All types | Minimum |
Percentile | All types | Estimated Percentile |
PercentileExact | All types | Exact Percentile |
Spread | All types | Absolute span, the period of statistics inside, Max minus Min |
Rspread | All types | Relative span, the period of statistics inside, Max divided by Min |
Stddev | All types | Standard deviation |
Apdex | Delay type | Latency Satisfaction Index |
Last | All types | Most recent value |
Uniq | Cardinality type | Estimated cardinality statistics |
UniqExact | Cardinality type | Accurate cardinality statistics |
# 2.2 Secondary Operators
Operator | Description |
---|---|
PerSecond | Calculates the rate, dividing the metric quantity by the time interval (in seconds) |
Math | Arithmetic operations, supports +, -, *, / |
Percentage | Unit conversion % |
# 3. Operator Calculation Logic for Different Metrics
# 3.1 Counter/Gauge Type Metrics
- Flow_metric data table:
- First use
Sum
to aggregate according todata_precision
. - Then use the specific operator selected to call the
ClickHouse
function to calculate.
- First use
- Flow_log data table:
- Use the specific operator selected to call the
ClickHouse
function to calculate.
- Use the specific operator selected to call the
# 3.2 Quotient/Percentage Type Metrics
- Flow_metric data table:
- First use
Sum(x)/Sum(y)
to aggregate according todata_precision
. - Then call the
ClickHouse
function to calculate according to the specific operator selected.
- First use
- Flow_log data table:
- Use the specific operator selected to call the
ClickHouse
functionfunc(x/y)
to perform calculation.
- Use the specific operator selected to call the
# 3.3 Delay Type Metrics
- Flow_metric data table:
- Use the specific operator selected to call the
ClickHouse
function to calculate.
- Use the specific operator selected to call the
- Flow_log data table:
- Use the specific operator selected to call the
ClickHouse
function to calculate.
- Use the specific operator selected to call the