共计 15182 个字符,预计需要花费 38 分钟才能阅读完成。
1 什么是黑盒监控
我们监控主机的资源用量、容器的运行状态、数据库中间件的运行数据。 这些都是支持业务和服务的基础设施,通过白盒能够了解其内部的实际运行状态,通过对监控指标的观察能够预判可能出现的问题,从而对潜在的不确定因素进行优化。
而从完整的监控逻辑的角度,除了大量的应用白盒监控以外,还应该添加适当的黑盒监控。黑盒监控即以用户的身份测试服务的外部可见性,常见的黑盒监控包括HTTP探针、TCP探针等用于检测站点或者服务的可访问性,以及访问效率等。
黑盒监控相较于白盒监控最大的不同在于黑盒监控是以故障为导向当故障发生时,黑盒监控能快速发现故障,而白盒监控则侧重于主动发现或者预测潜在的问题。一个完善的监控目标是要能够从白盒的角度发现潜在问题,能够在黑盒的角度快速发现已经发生的问题。
2 介绍
Blackbox Exporter是Prometheus社区提供的官方黑盒监控解决方案,其允许用户通过:HTTP、HTTPS、DNS、TCP以及ICMP的方式对网络进行探测。
应用场景
- HTTP 测试
定义 Request Header 信息
判断 Http status / Http Respones Header / Http Body 内容
- TCP 测试
业务组件端口状态监听
应用层协议定义与监听
- ICMP 测试
主机探活机制
- POST 测试
接口联通性
- SSL 证书过期时间
运行Blackbox Exporter时,需要用户提供探针的配置信息,这些配置信息可能是一些自定义的HTTP头信息,也可能是探测时需要的一些TSL配置,也可能是探针本身的验证行为。在Blackbox Exporter每一个探针配置称为一个module,并且以YAML配置文件的形式提供给Blackbox Exporter。 每一个module主要包含以下配置内容,包括探针类型(prober)、验证访问超时时间(timeout)、以及当前探针的具体配置项:
# 探针类型:http、 tcp、 dns、 icmp.
prober: <prober_string>
# 超时时间
[ timeout: <duration> ]
# 探针的详细配置,最多只能配置其中的一个
[ http: <http_probe> ]
[ tcp: <tcp_probe> ]
[ dns: <dns_probe> ]
[ icmp: <icmp_probe> ]
下面是一个简化的探针配置文件blockbox.yml,包含两个HTTP探针配置项:
modules:
http_2xx:
prober: http
http:
method: GET
http_post_2xx:
prober: http
http:
method: POST
3 安装
3.1 Docker
采用docker
方式部署
docker run -it --name blackbox_exporter -v /data/config/blackbox/:/config -p 9115:9115 -d prom/blackbox-exporter --config.file=/config/blackbox.yml
3.2 配置文件
modules:
http_2xx:
prober: http
http_post_2xx:
prober: http
http:
method: POST
http_4xx:
prober: http
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp
3.3 访问监控数据
容器启动成功后,就可以通过访问http://127.0.0.1:9115/probe?module=http_2xx&target=baidu.com对baidu.com进行探测。这里通过在URL中提供module参数指定了当前使用的探针,target参数指定探测目标,探针的探测结果通过Metrics的形式返回:
[root@VM-10-48-centos blackbox]# curl -s "http://119.91.110.226:9115/probe?module=http_2xx&target=www.baidu.com"
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.001789553
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.020884147
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length -1
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.002984163
probe_http_duration_seconds{phase="processing"} 0.005368911
probe_http_duration_seconds{phase="resolve"} 0.001789553
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0.010486688
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 299300
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 2.882632397e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1
从返回的样本中,用户可以获取站点的DNS解析耗时、站点响应时间、HTTP响应状态码等等和站点访问质量相关的监控指标,从而帮助管理员主动的发现故障和问题。
4 与Prometheus集成
接下来,只需要在Prometheus下配置对Blockbox Exporter实例的采集任务即可。最直观的配置方式:
- job_name: baidu_http2xx_probe
params:
module:
- http_2xx
target:
- baidu.com
metrics_path: /probe
static_configs:
- targets:
- 127.0.0.1:9115
- job_name: prometheus_http2xx_probe
params:
module:
- http_2xx
target:
- prometheus.io
metrics_path: /probe
static_configs:
- targets:
- 127.0.0.1:9115
这里分别配置了名为baidu_http2x_probe和prometheus_http2xx_probe的采集任务,并且通过params指定使用的探针(module)以及探测目标(target)。
那问题就来了,假如我们有N个目标站点且都需要M种探测方式,那么Prometheus中将包含N * M个采集任务,从配置管理的角度来说显然是不可接受的,这里我们也可以采用Relabling的方式对这些配置进行简化,配置方式如下:
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://prometheus.io # Target to probe with http.
- https://prometheus.io # Target to probe with https.
- http://example.com:8080 # Target to probe with http on port 8080.
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
这里针对每一个探针服务(如http_2xx)定义一个采集任务,并且直接将任务的采集目标定义为我们需要探测的站点。在采集样本数据之前通过relabel_configs对采集任务进行动态设置。
- 第1步,根据Target实例的地址,写入
__param_target
标签中。__param_<name>
形式的标签表示,在采集任务时会在请求目标地址中添加<name>
参数,等同于params的设置; - 第2步,获取__param_target的值,并覆写到instance标签中;
- 第3步,覆写Target实例的
__address__
标签值为BlockBox Exporter实例的访问地址。
通过以上3个relabel步骤,即可大大简化Prometheus任务配置的复杂度
5 HTTP探针
HTTP探针是进行黑盒监控时最常用的探针之一,通过HTTP探针能够网站或者HTTP服务建立有效的监控,包括其本身的可用性,以及用户体验相关的如响应时间等等。除了能够在服务出现异常的时候及时报警,还能帮助系统管理员分析和优化网站体验。
在上一小节讲过,Blockbox Exporter中所有的探针均是以Module的信息进行配置。如下所示,配置了一个最简单的HTTP探针:
modules:
http_2xx_example:
prober: http
http:
通过prober配置项指定探针类型。配置项http用于自定义探针的探测方式,这里有没对http配置项添加任何配置,表示完全使用HTTP探针的默认配置,该探针将使用HTTP GET的方式对目标服务进行探测,并且验证返回状态码是否为2XX,是则表示验证成功,否则失败。
5.1 自定义HTTP请求
HTTP服务通常会以不同的形式对外展现,有些可能就是一些简单的网页,而有些则可能是一些基于REST的API服务。 对于不同类型的HTTP的探测需要管理员能够对HTTP探针的行为进行更多的自定义设置,包括:HTTP请求方法、HTTP头信息、请求参数等。对于某些启用了安全认证的服务还需要能够对HTTP探测设置相应的Auth支持。对于HTTPS类型的服务还需要能够对证书进行自定义设置。
如下所示,这里通过method定义了探测时使用的请求方法,对于一些需要请求参数的服务,还可以通过headers定义相关的请求头信息,使用body定义请求内容:
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{}'
如果HTTP服务启用了安全认证,Blockbox Exporter内置了对basic_auth的支持,可以直接设置相关的认证信息即可:
http_basic_auth_example:
prober: http
timeout: 5s
http:
method: POST
headers:
Host: "login.example.com"
basic_auth:
username: "username"
password: "mysecret"
对于使用了Bear Token的服务也可以通过bearer_token配置项直接指定令牌字符串,或者通过bearer_token_file指定令牌文件。
对于一些启用了HTTPS的服务,但是需要自定义证书的服务,可以通过tls_config指定相关的证书信息:
http_custom_ca_example:
prober: http
http:
method: GET
tls_config:
ca_file: "/certs/my_cert.crt"
5.2 自定义探针行为
在默认情况下HTTP探针只会对HTTP返回状态码进行校验,如果状态码为2XX(200 <= StatusCode < 300)则表示探测成功,并且探针返回的指标probe_success值为1。
如果用户需要指定HTTP返回状态码,或者对HTTP版本有特殊要求,如下所示,可以使用valid_http_versions和valid_status_codes进行定义:
http_2xx_example:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: []
默认情况下,Blockbox返回的样本数据中也会包含指标probe_http_ssl,用于表明当前探针是否使用了SSL:
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
而如果用户对于HTTP服务是否启用SSL有强制的标准。则可以使用fail_if_ssl和fail_if_not_ssl进行配置。fail_if_ssl为true时,表示如果站点启用了SSL则探针失败,反之成功。fail_if_not_ssl刚好相反。
fail_if_ssl | Ssl | No ssl |
---|---|---|
True | probe_http_ssl = 0 | probe_http_ssl = 1 |
False | probe_http_ssl = 1 | probe_http_ssl = 0 |
fail_if_not_ssl | Ssl | No ssl |
---|---|---|
True | probe_http_ssl = 1 | probe_http_ssl = 0 |
False | probe_http_ssl = 0 | probe_http_ssl = 1 |
http_2xx_example:
prober: http
timeout: 5s
http:
valid_status_codes: []
method: GET
no_follow_redirects: false
fail_if_ssl: false
fail_if_not_ssl: false
除了基于HTTP状态码,HTTP协议版本以及是否启用SSL作为控制探针探测行为成功与否的标准以外,还可以匹配HTTP服务的响应内容。使用fail_if_matches_regexp和fail_if_not_matches_regexp用户可以定义一组正则表达式,用于验证HTTP返回内容是否符合或者不符合正则表达式的内容。
http_2xx_example:
prober: http
timeout: 5s
http:
method: GET
fail_if_matches_regexp:
- "Could not connect to database"
fail_if_not_matches_regexp:
- "Download the latest version here"
最后需要提醒的时,默认情况下HTTP探针会走IPV6的协议。 在大多数情况下,可以使用preferred_ip_protocol=ip4强制通过IPV4的方式进行探测。在Bloackbox响应的监控样本中,也会通过指标probe_ip_protocol,表明当前的协议使用情况:
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 6
6 Grafana配置
6.1 explore查看数据
6.2 配置Dashboard
此模板为9965号模板,数据源选择Prometheus 模板下载地址 https://grafana.com/grafana/dashboards/9965
此模板需要安装饼状图插件 下载地址 https://grafana.com/grafana/plugins/grafana-piechart-panel
grafana-cli plugins install grafana-piechart-panel
7 helm
采用helm
方式部署
[Kubernetes] helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
[Kubernetes] helm repo update
[Kubernetes] helm pull prometheus-community/prometheus-blackbox-exporter
[Kubernetes] tar zxf prometheus-blackbox-exporter-4.14.0.tgz
[Kubernetes] cd prometheus-blackbox-exporter
[prometheus-blackbox-exporter] ls
Chart.yaml README.md ci templates values.yaml
7.1 修改配置
# cat values.yaml
restartPolicy: Always
kind: Deployment
podDisruptionBudget: {}
# maxUnavailable: 0
## Additional blackbox-exporter container environment variables
## For instance to add a http_proxy
##
## extraEnv:
## HTTP_PROXY: "http://superproxy.com:3128"
## NO_PROXY: "localhost,127.0.0.1"
extraEnv: {}
extraVolumes: []
# - name: secret-blackbox-oauth-htpasswd
# secret:
# defaultMode: 420
# secretName: blackbox-oauth-htpasswd
# - name: storage-volume
# persistentVolumeClaim:
# claimName: example
## Additional volumes that will be attached to the blackbox-exporter container
extraVolumeMounts:
# - name: ca-certs
# mountPath: /etc/ssl/certs/ca-certificates.crt
extraContainers: []
# - name: oAuth2-proxy
# args:
# - -https-address=:9116
# - -upstream=http://localhost:9115
# - -skip-auth-regex=^/metrics
# - -openshift-delegate-urls={"/":{"group":"monitoring.coreos.com","resource":"prometheuses","verb":"get"}}
# image: openshift/oauth-proxy:v1.1.0
# ports:
# - containerPort: 9116
# name: proxy
# resources:
# limits:
# memory: 16Mi
# requests:
# memory: 4Mi
# cpu: 20m
# volumeMounts:
# - mountPath: /etc/prometheus/secrets/blackbox-tls
# name: secret-blackbox-tls
## Enable pod security policy
pspEnabled: true
hostNetwork: false
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
type: RollingUpdate
image:
repository: prom/blackbox-exporter
tag: v0.19.0
pullPolicy: IfNotPresent
## Optionally specify an array of imagePullSecrets.
## Secrets must be manually created in the namespace.
## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
##
# pullSecrets:
# - myRegistrKeySecretName
## User to run blackbox-exporter container as
runAsUser: 1000
readOnlyRootFilesystem: true
runAsNonRoot: true
livenessProbe:
httpGet:
path: /health
port: http
readinessProbe:
httpGet:
path: /health
port: http
nodeSelector: {}
tolerations: []
affinity: {}
# if the configuration is managed as secret outside the chart, using SealedSecret for example,
# provide the name of the secret here. If secretConfig is set to true, configExistingSecretName will be ignored
# in favor of the config value.
configExistingSecretName: ""
# Store the configuration as a `Secret` instead of a `ConfigMap`, useful in case it contains sensitive data
secretConfig: false
config:
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
no_follow_redirects: false
preferred_ip_protocol: "ip4"
http_4xx:
prober: http
timeout: 5s
http:
valid_status_codes: [200, 201, 202, 300, 301, 302, 303, 400, 401, 402, 403, 404]
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
no_follow_redirects: false
preferred_ip_protocol: "ip4"
tcp_connect:
prober: tcp
timeout: 5s
tcp:
preferred_ip_protocol: "ip4"
# Set custom config path, other than default /config/blackbox.yaml. If let empty, path will be "/config/blackbox.yaml"
# configPath: "/foo/bar"
extraConfigmapMounts: []
# - name: certs-configmap
# mountPath: /etc/secrets/ssl/
# subPath: certificates.crt # (optional)
# configMap: certs-configmap
# readOnly: true
# defaultMode: 420
## Additional secret mounts
# Defines additional mounts with secrets. Secrets must be manually created in the namespace.
extraSecretMounts: []
# - name: secret-files
# mountPath: /etc/secrets
# secretName: blackbox-secret-files
# readOnly: true
# defaultMode: 420
allowIcmp: false
resources: {}
# limits:
# memory: 300Mi
# requests:
# memory: 50Mi
priorityClassName: ""
service:
annotations: {}
labels: {}
type: ClusterIP
port: 9115
# Only changes container port. Application port can be changed with extraArgs (--web.listen-address=:9115)
# https://github.com/prometheus/blackbox_exporter/blob/998037b5b40c1de5fee348ffdea8820509d85171/main.go#L55
containerPort: 9115
serviceAccount:
# Specifies whether a ServiceAccount should be created
create: true
# The name of the ServiceAccount to use.
# If not set and create is true, a name is generated using the fullname template
name:
annotations: {}
## An Ingress resource can provide name-based virtual hosting and TLS
## termination among other things for CouchDB deployments which are accessed
## from outside the Kubernetes cluster.
## ref: https://kubernetes.io/docs/concepts/services-networking/ingress/
ingress:
enabled: false
hosts: []
# - chart-example.local
path: '/'
annotations: {}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: "true"
tls: []
# Secrets must be manually created in the namespace.
# - secretName: chart-example-tls
# hosts:
# - chart-example.local
podAnnotations: {}
# Hostaliases allow to add additional DNS entries to be injected directly into pods.
# This will take precedence over your implemented DNS solution
hostAliases: []
# - ip: 192.168.1.1
# hostNames:
# - test.example.com
# - another.example.net
pod:
labels: {}
extraArgs: []
# --history.limit=1000
replicas: 1
serviceMonitor:
## If true, a ServiceMonitor CRD is created for a prometheus operator
## https://github.com/coreos/prometheus-operator
##
enabled: true
# Default values that will be used for all ServiceMonitors created by `targets`
defaults:
additionalMetricsRelabels:
pod: prometheus-blackbox-exporter
labels: {}
interval: 20s
scrapeTimeout: 5s
module: http_2xx
## scheme: HTTP scheme to use for scraping. Can be used with `tlsConfig` for example if using istio mTLS.
scheme: http
## tlsConfig: TLS configuration to use when scraping the endpoint. For example if using istio mTLS.
## Of type: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#tlsconfig
tlsConfig: {}
bearerTokenFile:
targets:
- name: boke
url: https://www.srelife.cn
labels: {}
interval: 60s
scrapeTimeout: 60s
module: http_2xx
additionalMetricsRelabels: {}
- name: baidu
url: https://www.baidu.com
labels: {}
interval: 60s
scrapeTimeout: 60s
module: http_2xx
additionalMetricsRelabels: {}
- name: tencent
url: https://www.tencent.com
labels: {}
interval: 60s
scrapeTimeout: 60s
module: http_2xx
additionalMetricsRelabels: {}
- name: icmp-114
url: 114.114.114.114
labels: {}
interval: 60s
scrapeTimeout: 60s
module: tcp_connect
additionalMetricsRelabels: {}
## Custom PrometheusRules to be defined
## ref: https://github.com/coreos/prometheus-operator#customresourcedefinitions
prometheusRule:
enabled: false
additionalLabels: {}
namespace: ""
rules: []
## Network policy for chart
networkPolicy:
# Enable network policy and allow access from anywhere
enabled: false
# Limit access only from monitoring namespace
# Before setting this value to true, you must add the name=monitoring label to the monitoring namespace
# Network Policy uses label filtering
allowMonitoringNamespace: false
## dnsPolicy and dnsConfig for Deployments and Daemonsets if you want non-default settings.
## These will be passed directly to the PodSpec of same.
dnsPolicy:
dnsConfig:
7.2 部署到集群
[prometheus-blackbox-exporter] helm install -n monitoring blackbox-exporter-1 .
NAME: blackbox-exporter-1
LAST DEPLOYED: Thu Jun 10 11:51:44 2021
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
See https://github.com/prometheus/blackbox_exporter/ for how to configure Prometheus and the Blackbox Exporter.
[prometheus-blackbox-exporter] kubectl get all -n monitoring
NAME READY STATUS RESTARTS AGE
pod/blackbox-exporter-1-prometheus-blackbox-exporter-84f894b89wpdtc 1/1 Running 0 7s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/blackbox-exporter-1-prometheus-blackbox-exporter ClusterIP 10.255.254.215 <none> 9115/TCP 9s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/blackbox-exporter-1-prometheus-blackbox-exporter 1/1 1 1 8s
NAME DESIRED CURRENT READY AGE
replicaset.apps/blackbox-exporter-1-prometheus-blackbox-exporter-84f894b894 1 1 1 8s