Prometheus黑盒监测之blackbox_exporter

2,041次阅读

共计 15182 个字符，预计需要花费 38 分钟才能阅读完成。

我们监控主机的资源用量、容器的运行状态、数据库中间件的运行数据。这些都是支持业务和服务的基础设施，通过白盒能够了解其内部的实际运行状态，通过对监控指标的观察能够预判可能出现的问题，从而对潜在的不确定因素进行优化。

而从完整的监控逻辑的角度，除了大量的应用白盒监控以外，还应该添加适当的黑盒监控。黑盒监控即以用户的身份测试服务的外部可见性，常见的黑盒监控包括HTTP探针、TCP探针等用于检测站点或者服务的可访问性，以及访问效率等。

黑盒监控相较于白盒监控最大的不同在于黑盒监控是以故障为导向当故障发生时，黑盒监控能快速发现故障，而白盒监控则侧重于主动发现或者预测潜在的问题。一个完善的监控目标是要能够从白盒的角度发现潜在问题，能够在黑盒的角度快速发现已经发生的问题。

Blackbox Exporter是Prometheus社区提供的官方黑盒监控解决方案，其允许用户通过：HTTP、HTTPS、DNS、TCP以及ICMP的方式对网络进行探测。

应用场景

HTTP 测试

定义 Request Header 信息
判断 Http status / Http Respones Header / Http Body 内容

TCP 测试

业务组件端口状态监听
应用层协议定义与监听

ICMP 测试

主机探活机制

POST 测试

接口联通性

SSL 证书过期时间

运行Blackbox Exporter时，需要用户提供探针的配置信息，这些配置信息可能是一些自定义的HTTP头信息，也可能是探测时需要的一些TSL配置，也可能是探针本身的验证行为。在Blackbox Exporter每一个探针配置称为一个module，并且以YAML配置文件的形式提供给Blackbox Exporter。每一个module主要包含以下配置内容，包括探针类型（prober）、验证访问超时时间（timeout）、以及当前探针的具体配置项：

  # 探针类型：http、 tcp、 dns、 icmp.
  prober: <prober_string>

  # 超时时间
  [ timeout: <duration> ]

  # 探针的详细配置，最多只能配置其中的一个
  [ http: <http_probe> ]
  [ tcp: <tcp_probe> ]
  [ dns: <dns_probe> ]
  [ icmp: <icmp_probe> ]

下面是一个简化的探针配置文件blockbox.yml，包含两个HTTP探针配置项：

modules:
  http_2xx:
    prober: http
    http:
      method: GET
  http_post_2xx:
    prober: http
    http:
      method: POST

采用docker方式部署

docker run -it --name blackbox_exporter -v /data/config/blackbox/:/config -p 9115:9115 -d prom/blackbox-exporter --config.file=/config/blackbox.yml

modules:
  http_2xx:
    prober: http
  http_post_2xx:
    prober: http
    http:
      method: POST
  http_4xx:
    prober: http
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp

容器启动成功后，就可以通过访问http://127.0.0.1:9115/probe?module=http_2xx&target=baidu.com对baidu.com进行探测。这里通过在URL中提供module参数指定了当前使用的探针，target参数指定探测目标，探针的探测结果通过Metrics的形式返回：

[root@VM-10-48-centos blackbox]# curl -s "http://119.91.110.226:9115/probe?module=http_2xx&target=www.baidu.com"
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.001789553
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 0.020884147
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length -1
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.002984163
probe_http_duration_seconds{phase="processing"} 0.005368911
probe_http_duration_seconds{phase="resolve"} 0.001789553
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0.010486688
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 200
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 299300
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 1.1
# HELP probe_ip_addr_hash Specifies the hash of IP address. It's useful to detect if the IP address changes.
# TYPE probe_ip_addr_hash gauge
probe_ip_addr_hash 2.882632397e+09
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 1

Prometheus黑盒监测之blackbox_exporter

从返回的样本中，用户可以获取站点的DNS解析耗时、站点响应时间、HTTP响应状态码等等和站点访问质量相关的监控指标，从而帮助管理员主动的发现故障和问题。

接下来，只需要在Prometheus下配置对Blockbox Exporter实例的采集任务即可。最直观的配置方式：

- job_name: baidu_http2xx_probe
  params:
    module:
    - http_2xx
    target:
    - baidu.com
  metrics_path: /probe
  static_configs:
  - targets:
    - 127.0.0.1:9115
- job_name: prometheus_http2xx_probe
  params:
    module:
    - http_2xx
    target:
    - prometheus.io
  metrics_path: /probe
  static_configs:
  - targets:
    - 127.0.0.1:9115

这里分别配置了名为baidu_http2x_probe和prometheus_http2xx_probe的采集任务，并且通过params指定使用的探针（module）以及探测目标（target）。

那问题就来了，假如我们有N个目标站点且都需要M种探测方式，那么Prometheus中将包含N * M个采集任务，从配置管理的角度来说显然是不可接受的，这里我们也可以采用Relabling的方式对这些配置进行简化，配置方式如下：

scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://prometheus.io    # Target to probe with http.
        - https://prometheus.io   # Target to probe with https.
        - http://example.com:8080 # Target to probe with http on port 8080.
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115

这里针对每一个探针服务（如http_2xx）定义一个采集任务，并且直接将任务的采集目标定义为我们需要探测的站点。在采集样本数据之前通过relabel_configs对采集任务进行动态设置。

第1步，根据Target实例的地址，写入__param_target标签中。__param_<name>形式的标签表示，在采集任务时会在请求目标地址中添加<name>参数，等同于params的设置；
第2步，获取__param_target的值，并覆写到instance标签中；
第3步，覆写Target实例的__address__标签值为BlockBox Exporter实例的访问地址。

通过以上3个relabel步骤，即可大大简化Prometheus任务配置的复杂度

HTTP探针是进行黑盒监控时最常用的探针之一，通过HTTP探针能够网站或者HTTP服务建立有效的监控，包括其本身的可用性，以及用户体验相关的如响应时间等等。除了能够在服务出现异常的时候及时报警，还能帮助系统管理员分析和优化网站体验。

在上一小节讲过，Blockbox Exporter中所有的探针均是以Module的信息进行配置。如下所示，配置了一个最简单的HTTP探针：

modules:
  http_2xx_example:
    prober: http
    http:

通过prober配置项指定探针类型。配置项http用于自定义探针的探测方式，这里有没对http配置项添加任何配置，表示完全使用HTTP探针的默认配置，该探针将使用HTTP GET的方式对目标服务进行探测，并且验证返回状态码是否为2XX，是则表示验证成功，否则失败。

HTTP服务通常会以不同的形式对外展现，有些可能就是一些简单的网页，而有些则可能是一些基于REST的API服务。对于不同类型的HTTP的探测需要管理员能够对HTTP探针的行为进行更多的自定义设置，包括：HTTP请求方法、HTTP头信息、请求参数等。对于某些启用了安全认证的服务还需要能够对HTTP探测设置相应的Auth支持。对于HTTPS类型的服务还需要能够对证书进行自定义设置。

如下所示，这里通过method定义了探测时使用的请求方法，对于一些需要请求参数的服务，还可以通过headers定义相关的请求头信息，使用body定义请求内容：

http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{}'

如果HTTP服务启用了安全认证，Blockbox Exporter内置了对basic_auth的支持，可以直接设置相关的认证信息即可：

http_basic_auth_example:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Host: "login.example.com"
      basic_auth:
        username: "username"
        password: "mysecret"

对于使用了Bear Token的服务也可以通过bearer_token配置项直接指定令牌字符串，或者通过bearer_token_file指定令牌文件。

对于一些启用了HTTPS的服务，但是需要自定义证书的服务，可以通过tls_config指定相关的证书信息：

 http_custom_ca_example:
    prober: http
    http:
      method: GET
      tls_config:
        ca_file: "/certs/my_cert.crt"

在默认情况下HTTP探针只会对HTTP返回状态码进行校验，如果状态码为2XX（200 <= StatusCode < 300）则表示探测成功，并且探针返回的指标probe_success值为1。

如果用户需要指定HTTP返回状态码，或者对HTTP版本有特殊要求，如下所示，可以使用valid_http_versions和valid_status_codes进行定义：

  http_2xx_example:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2"]
      valid_status_codes: []

默认情况下，Blockbox返回的样本数据中也会包含指标probe_http_ssl，用于表明当前探针是否使用了SSL：

# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0

而如果用户对于HTTP服务是否启用SSL有强制的标准。则可以使用fail_if_ssl和fail_if_not_ssl进行配置。fail_if_ssl为true时，表示如果站点启用了SSL则探针失败，反之成功。fail_if_not_ssl刚好相反。

fail_if_ssl	Ssl	No ssl
True	probe_http_ssl = 0	probe_http_ssl = 1
False	probe_http_ssl = 1	probe_http_ssl = 0

fail_if_not_ssl	Ssl	No ssl
True	probe_http_ssl = 1	probe_http_ssl = 0
False	probe_http_ssl = 0	probe_http_ssl = 1

  http_2xx_example:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: []
      method: GET
      no_follow_redirects: false
      fail_if_ssl: false
      fail_if_not_ssl: false

除了基于HTTP状态码，HTTP协议版本以及是否启用SSL作为控制探针探测行为成功与否的标准以外，还可以匹配HTTP服务的响应内容。使用fail_if_matches_regexp和fail_if_not_matches_regexp用户可以定义一组正则表达式，用于验证HTTP返回内容是否符合或者不符合正则表达式的内容。

  http_2xx_example:
    prober: http
    timeout: 5s
    http:
      method: GET
      fail_if_matches_regexp:
        - "Could not connect to database"
      fail_if_not_matches_regexp:
        - "Download the latest version here"

最后需要提醒的时，默认情况下HTTP探针会走IPV6的协议。在大多数情况下，可以使用preferred_ip_protocol=ip4强制通过IPV4的方式进行探测。在Bloackbox响应的监控样本中，也会通过指标probe_ip_protocol，表明当前的协议使用情况：

# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 6

Prometheus黑盒监测之blackbox_exporter

此模板为9965号模板，数据源选择Prometheus 模板下载地址 https://grafana.com/grafana/dashboards/9965

此模板需要安装饼状图插件下载地址 https://grafana.com/grafana/plugins/grafana-piechart-panel

grafana-cli plugins install grafana-piechart-panel

Prometheus黑盒监测之blackbox_exporter

采用helm方式部署

[Kubernetes] helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
[Kubernetes] helm repo update
[Kubernetes] helm pull prometheus-community/prometheus-blackbox-exporter
[Kubernetes] tar zxf prometheus-blackbox-exporter-4.14.0.tgz
[Kubernetes] cd prometheus-blackbox-exporter
[prometheus-blackbox-exporter] ls
Chart.yaml  README.md   ci          templates   values.yaml

# cat values.yaml
restartPolicy: Always

kind: Deployment

podDisruptionBudget: {}
  # maxUnavailable: 0

## Additional blackbox-exporter container environment variables
## For instance to add a http_proxy
##
## extraEnv:
##   HTTP_PROXY: "http://superproxy.com:3128"
##   NO_PROXY: "localhost,127.0.0.1"
extraEnv: {}

extraVolumes: []
  # - name: secret-blackbox-oauth-htpasswd
  #   secret:
  #     defaultMode: 420
  #     secretName: blackbox-oauth-htpasswd
  # - name: storage-volume
  #   persistentVolumeClaim:
  #     claimName: example

## Additional volumes that will be attached to the blackbox-exporter container
extraVolumeMounts:
  # - name: ca-certs
  #   mountPath: /etc/ssl/certs/ca-certificates.crt

extraContainers: []
  # - name: oAuth2-proxy
  #   args:
  #     - -https-address=:9116
  #     - -upstream=http://localhost:9115
  #     - -skip-auth-regex=^/metrics
  #     - -openshift-delegate-urls={"/":{"group":"monitoring.coreos.com","resource":"prometheuses","verb":"get"}}
  #   image: openshift/oauth-proxy:v1.1.0
  #   ports:
  #       - containerPort: 9116
  #         name: proxy
  #   resources:
  #       limits:
  #         memory: 16Mi
  #       requests:
  #         memory: 4Mi
  #         cpu: 20m
  #   volumeMounts:
  #     - mountPath: /etc/prometheus/secrets/blackbox-tls
  #       name: secret-blackbox-tls

## Enable pod security policy
pspEnabled: true

hostNetwork: false

strategy:
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0
  type: RollingUpdate

image:
  repository: prom/blackbox-exporter
  tag: v0.19.0
  pullPolicy: IfNotPresent

  ## Optionally specify an array of imagePullSecrets.
  ## Secrets must be manually created in the namespace.
  ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
  ##
  # pullSecrets:
  #   - myRegistrKeySecretName

## User to run blackbox-exporter container as
runAsUser: 1000
readOnlyRootFilesystem: true
runAsNonRoot: true

livenessProbe:
  httpGet:
    path: /health
    port: http

readinessProbe:
  httpGet:
    path: /health
    port: http

nodeSelector: {}
tolerations: []
affinity: {}

# if the configuration is managed as secret outside the chart, using SealedSecret for example,
# provide the name of the secret here. If secretConfig is set to true, configExistingSecretName will be ignored
# in favor of the config value.
configExistingSecretName: ""
# Store the configuration as a `Secret` instead of a `ConfigMap`, useful in case it contains sensitive data
secretConfig: false
config:
  modules:
    http_2xx:
      prober: http
      timeout: 5s
      http:
        valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
        no_follow_redirects: false
        preferred_ip_protocol: "ip4"

    http_4xx:
      prober: http
      timeout: 5s
      http:
        valid_status_codes: [200, 201, 202, 300, 301, 302, 303, 400, 401, 402, 403, 404]
        valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
        no_follow_redirects: false
        preferred_ip_protocol: "ip4"

    tcp_connect:
      prober: tcp
      timeout: 5s
      tcp:
        preferred_ip_protocol: "ip4"

# Set custom config path, other than default /config/blackbox.yaml. If let empty, path will be "/config/blackbox.yaml"
# configPath: "/foo/bar"

extraConfigmapMounts: []
  # - name: certs-configmap
  #   mountPath: /etc/secrets/ssl/
  #   subPath: certificates.crt # (optional)
  #   configMap: certs-configmap
  #   readOnly: true
  #   defaultMode: 420

## Additional secret mounts
# Defines additional mounts with secrets. Secrets must be manually created in the namespace.
extraSecretMounts: []
  # - name: secret-files
  #   mountPath: /etc/secrets
  #   secretName: blackbox-secret-files
  #   readOnly: true
  #   defaultMode: 420

allowIcmp: false

resources: {}
  # limits:
  #   memory: 300Mi
  # requests:
  #   memory: 50Mi

priorityClassName: ""

service:
  annotations: {}
  labels: {}
  type: ClusterIP
  port: 9115

# Only changes container port. Application port can be changed with extraArgs (--web.listen-address=:9115)
# https://github.com/prometheus/blackbox_exporter/blob/998037b5b40c1de5fee348ffdea8820509d85171/main.go#L55
containerPort: 9115

serviceAccount:
  # Specifies whether a ServiceAccount should be created
  create: true
  # The name of the ServiceAccount to use.
  # If not set and create is true, a name is generated using the fullname template
  name:
  annotations: {}

## An Ingress resource can provide name-based virtual hosting and TLS
## termination among other things for CouchDB deployments which are accessed
## from outside the Kubernetes cluster.
## ref: https://kubernetes.io/docs/concepts/services-networking/ingress/
ingress:
  enabled: false
  hosts: []
     # - chart-example.local
  path: '/'
  annotations: {}
    # kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  tls: []
    # Secrets must be manually created in the namespace.
    # - secretName: chart-example-tls
    #   hosts:
    #     - chart-example.local

podAnnotations: {}

# Hostaliases allow to add additional DNS entries to be injected directly into pods.
# This will take precedence over your implemented DNS solution
hostAliases: []
#  - ip: 192.168.1.1
#    hostNames:
#      - test.example.com
#      - another.example.net

pod:
  labels: {}

extraArgs: []
#  --history.limit=1000

replicas: 1

serviceMonitor:
  ## If true, a ServiceMonitor CRD is created for a prometheus operator
  ## https://github.com/coreos/prometheus-operator
  ##
  enabled: true

  # Default values that will be used for all ServiceMonitors created by `targets`
  defaults:
    additionalMetricsRelabels:
      pod: prometheus-blackbox-exporter
    labels: {}
    interval: 20s
    scrapeTimeout: 5s
    module: http_2xx
  ## scheme: HTTP scheme to use for scraping. Can be used with `tlsConfig` for example if using istio mTLS.
  scheme: http
  ## tlsConfig: TLS configuration to use when scraping the endpoint. For example if using istio mTLS.
  ## Of type: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#tlsconfig
  tlsConfig: {}
  bearerTokenFile:

  targets:
   - name: boke
     url: https://www.srelife.cn
     labels: {}
     interval: 60s
     scrapeTimeout: 60s
     module: http_2xx
     additionalMetricsRelabels: {}

   - name: baidu
     url: https://www.baidu.com
     labels: {}
     interval: 60s
     scrapeTimeout: 60s
     module: http_2xx
     additionalMetricsRelabels: {}

   - name: tencent
     url: https://www.tencent.com
     labels: {}
     interval: 60s
     scrapeTimeout: 60s
     module: http_2xx
     additionalMetricsRelabels: {}

   - name: icmp-114
     url: 114.114.114.114
     labels: {}
     interval: 60s
     scrapeTimeout: 60s
     module: tcp_connect
     additionalMetricsRelabels: {}
## Custom PrometheusRules to be defined
## ref: https://github.com/coreos/prometheus-operator#customresourcedefinitions
prometheusRule:
  enabled: false
  additionalLabels: {}
  namespace: ""
  rules: []

## Network policy for chart
networkPolicy:
  # Enable network policy and allow access from anywhere
  enabled: false
  # Limit access only from monitoring namespace
  # Before setting this value to true, you must add the name=monitoring label to the monitoring namespace
  # Network Policy uses label filtering
  allowMonitoringNamespace: false

## dnsPolicy and dnsConfig for Deployments and Daemonsets if you want non-default settings.
## These will be passed directly to the PodSpec of same.
dnsPolicy:
dnsConfig:

[prometheus-blackbox-exporter] helm install -n monitoring blackbox-exporter-1 .
NAME: blackbox-exporter-1
LAST DEPLOYED: Thu Jun 10 11:51:44 2021
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
See https://github.com/prometheus/blackbox_exporter/ for how to configure Prometheus and the Blackbox Exporter.
[prometheus-blackbox-exporter] kubectl get all -n monitoring
NAME                                                                  READY   STATUS    RESTARTS   AGE
pod/blackbox-exporter-1-prometheus-blackbox-exporter-84f894b89wpdtc   1/1     Running   0          7s

NAME                                                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/blackbox-exporter-1-prometheus-blackbox-exporter   ClusterIP   10.255.254.215   <none>        9115/TCP   9s

NAME                                                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/blackbox-exporter-1-prometheus-blackbox-exporter   1/1     1            1           8s

NAME                                                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/blackbox-exporter-1-prometheus-blackbox-exporter-84f894b894   1         1         1       8s

Prometheus黑盒监测之blackbox_exporter