为Prometheus实现全局视图和高可用性

作者:Ricardo Castro 编译:沈建苗

确保系统可靠运行是网站可靠性工程师的一项关键任务,主要是收集指标、创建警报和绘制数据图。下面这项工作至关重要:从多个位置和服务收集系统指标,并将它们关联起来,以了解系统功能并支持故障排除。

Prometheus是云原生计算基金会(CNCF)的一个项目,已成为最流行的应用程序和系统监控开源解决方案之一。单个实例可以处理数百万个时间序列,但是系统变得庞大后,Prometheus需要能够扩展并处理增加的负载。由于纵向扩展最终会遇到极限,你需要另一种解决方案。

本文逐步介绍将简单的Prometheus环境转换成Thanos部署环境。那样你就能从单个端点对多个Prometheus实例执行可靠的查询,从而无缝地处理高可用性的Prometheus环境。

实现全局视图和高可用性

Thanos提供了一系列组件,可以提供高可用性的度量系统,存储容量几乎无限制。它可以添加到现有的Prometheus部署环境上,提供全局查询视图、数据备份和历史数据访问等功能。此外,这些功能可彼此独立使用,这使得你只要在需要时引入Thanos功能。

初始集群设置

你将在Kubernetes集群中部署Prometheus,然后在其中模拟所需的场景。kind工具是在本地启动Kubernetes集群的好方法。你将使用以下配置。

# config.yaml

kind: Cluster

apiVersion: kind.x-k8s.io/v1alpha4

name: thanos-demo

nodes:

  - role: control-plane

   Image: kindest/node:v1.23.0@sha256:2f93d3c7b12a3e93e6c1f34f331415e105979961fcddbe69a4e3ab5a93ccbb35

  - role: worker

   Image: kindest/node:v1.23.0@sha256:2f93d3c7b12a3e93e6c1f34f331415e105979961fcddbe69a4e3ab5a93ccbb35

  - role: worker

   image: kindest/node:v1.23.0@sha256:2f93d3c7b12a3e93e6c1f34f331415e105979961fcddbe69a4e3ab5a93ccbb35

有了这个配置,你可以随时启动集群。

~ kind create cluster --config config.yaml

Creating cluster "thanos-demo" ...

✓ Ensuring node image (kindest/node:v1.23.0)

✓ Preparing nodes

✓ Writing configuration

✓ Starting control-plane

✓ Installing CNI

✓ Installing StorageClass

✓ Joining worker nodes

Set kubectl context to "kind-thanos-demo"

You can now use your cluster with:kubectl cluster-info --context kind-thanos-demoHave a nice day!

集群启动并运行后,你要检查安装,以确保可以随时启动Prometheus。你需要kubectl与Kubernetes集群进行交互。

~ kind get clusters

thanos-demo

~ kubectl get nodes

NAME                      STATUS   ROLES                AGE    VERSION

thanos-demo-control-plane Ready    control-plane,master 119s   v1.23.0

thanos-demo-worker        Ready    <none>                88s   v1.23.0

thanos-demo-worker2       Ready    <none>                88s   v1.23.0

~ kubectl get pods -o name -Apod/coredns-64897985d-mz8bv</p>

pod/coredns-64897985d-pxzkq

pod/etcd-thanos-demo-control-plane

pod/kindnet-27cdw

pod/kindnet-42kcv

pod/kindnet-5rlcj

pod/kube-apiserver-thanos-demo-control-plane

pod/kube-controller-manager-thanos-demo-control-plane

pod/kube-proxy-49mgg

pod/kube-proxy-nhvkm

pod/kube-proxy-z4fpn

pod/kube-scheduler-thanos-demo-control-plane

pod/local-path-provisioner-5bb5788f44-hj5c4

有了这个配置,你可以随时启动集群。

~ kind create cluster --config config.yaml

Creating cluster "thanos-demo" ...

 ✓ Ensuring node image (kindest/node:v1.23.0)

 ✓ Preparing nodes

 ✓ Writing configuration

 ✓ Starting control-plane

 ✓ Installing CNI

 ✓ Installing StorageClass

 ✓ Joining worker nodes 

Set kubectl context to "kind-thanos-demo"

You can now use your cluster with:

kubectl cluster-info --context kind-thanos-demo

Have a nice day!

集群启动并运行后,你要检查安装,以确保可以随时启动Prometheus。你需要kubectl与Kubernetes集群进行交互。

~ kind get clusters

thanos-demo

~ kubectl get nodes

NAME                        STATUS   ROLES                  AGE    VERSION

thanos-demo-control-plane   Ready    control-plane,master   119s   v1.23.0

thanos-demo-worker          Ready    <none>                 88s    v1.23.0

thanos-demo-worker2         Ready    <none>                 88s    v1.23.0

~ kubectl get pods -o name -A

pod/coredns-64897985d-mz8bv

pod/coredns-64897985d-pxzkq

pod/etcd-thanos-demo-control-plane

pod/kindnet-27cdw

pod/kindnet-42kcv

pod/kindnet-5rlcj

pod/kube-apiserver-thanos-demo-control-plane

pod/kube-controller-manager-thanos-demo-control-plane

pod/kube-proxy-49mgg

pod/kube-proxy-nhvkm

pod/kube-proxy-z4fpn

pod/kube-scheduler-thanos-demo-control-plane

pod/local-path-provisioner-5bb5788f44-hj5c4

初始Prometheus设置

你的目标是在现有的Prometheus安装环境上部署Thanos,并扩展其功能。考虑到这一点,需要从启动三个Prometheus服务器入手。拥有多个Prometheus实例出于多个原因,比如分片、高可用性或聚合来自多个位置的查询。

针对这种场景,不妨想象以下设置:你在美国的集群有一台Prometheus服务器,在欧洲有Prometheus服务器的两个副本,它们抓取同样的目标。

若要部署Prometheus,你将使用kube-prometheus-stack图,还需要Helm。安装Helm后,你需要添加kube-prometheus-stack存储库。

~ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

"prometheus-community" has been added to your repositories

~   helm repo update

Hang tight while we grab the latest from your chart repositories...

...Successfully got an update from the "prometheus-community" chart repository

Update Complete. ⎈Happy Helming!⎈

由于实际上你只有一个Kubernetes集群,所以你将通过在不同的命名空间中部署Prometheus来模拟多个区域。你将为europe创建一个命名空间,为united-states创建另一个命名空间。

~ kubectl create namespace europe

namespace/europe created

~ kubectl create namespace united-states

namespace/united-states created

你已有了区域,可以随时部署Prometheus。

# prometheus-europe.yaml

nameOverride: "eu"

namespaceOverride: "europe"

nodeExporter:

  enabled: false

grafana:

  enabled: false

alertmanager:

  enabled: false

kubeStateMetrics:

  enabled: false

prometheus:

  prometheusSpec:

    replicas: 2

    replicaExternalLabelName: "replica"

    prometheusExternalLabelName: "cluster"



# prometheus-united-states.yaml

nameOverride: "us"

namespaceOverride: "united-states"

nodeExporter:

  enabled: false

grafana:

  enabled: false

alertmanager:

  enabled: false

kubeStateMetrics:

  enabled: false

prometheus:

  prometheusSpec:

    replicaExternalLabelName: "replica"

    prometheusExternalLabelName: "cluster"

使用上述配置,你将在每个区域部署Prometheus实例。

~ helm -n europe upgrade -i prometheus-europe prometheus-community/kube-prometheus-stack -f prometheus-europe.yaml

Release "prometheus-europe" does not exist. Installing it now.

NAME: prometheus-europe

LAST DEPLOYED: Sat Jan 22 18:26:22 2022

NAMESPACE: europe

STATUS: deployed

REVISION: 1

TEST SUITE: None

NOTES:

kube-prometheus-stack has been installed. Check its status by running:

  kubectl --namespace europe get pods -l "release=prometheus-europe"





~ helm -n united-states upgrade -i prometheus-united-states prometheus-community/kube-prometheus-stack -f prometheus-united-states.yaml

Release "prometheus-united-states" does not exist. Installing it now.

NAME: prometheus-united-states

LAST DEPLOYED: Sat Jan 22 18:26:48 2022

NAMESPACE: united-states

STATUS: deployed

REVISION: 1

TEST SUITE: None

NOTES:

kube-prometheus-stack has been installed. Check its status by running:

  kubectl --namespace united-states get pods -l "release=prometheus-united-states"


Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.

现在可以确保你的Prometheus按预期的方式运行。

~ kubectl -n europe get pods -l app.kubernetes.io/name=prometheus                                                                

NAME                                        READY   STATUS    RESTARTS   AGE

prometheus-prometheus-europe-prometheus-0   2/2     Running   0          18s

prometheus-prometheus-europe-prometheus-1   2/2     Running   0          18s

~ kubectl -n united-states get pods -l app.kubernetes.io/name=prometheus

NAME                                               READY   STATUS    RESTARTS   AGE

prometheus-prometheus-united-states-prometheus-0   2/2     Running   0          39s

你现在可以在每个单独的实例上查询任何指标,但无法执行多集群查询。

部署Thanos Sidecar

kube-prometheus-stack支持将Thanos部署为sidecar,这意味着它将与Prometheus本身一起部署。Thanos sidecar通过StoreAPI来公开Prometheus,而StoreAPI是一个通用的gRPC API,允许Thanos组件从诸多系统获取指标。

# prometheus-europe.yaml

nameOverride: "eu"

namespaceOverride: "europe"

nodeExporter:

  enabled: false

grafana:

  enabled: false

alertmanager:

  enabled: false

kubeStateMetrics:

   enabled: false

prometheus:

  prometheusSpec:

    replicas: 2

   replicaExternalLabelName: "replica"

   prometheusExternalLabelName: "cluster"

   thanos:

     baseImage: quay.io/thanos/thanos

     version: v0.24.0



# prometheus-united-states.yaml

nameOverride: "us"

namespaceOverride: "united-states"

nodeExporter:

  enabled: false

grafana:

  enabled: false

alertmanager:

  enabled: false

kubeStateMetrics:

  enabled: false

prometheus:

  prometheusSpec:

   replicaExternalLabelName: "replica"

   prometheusExternalLabelName: "cluster"

   thanos:

     baseImage: quay.io/thanos/thanos

     version: v0.24.0

有了更新后的配置,你可以随时升级Prometheus。

~ helm -n europe upgrade -i prometheus-europe prometheus-community/kube-prometheus-stack -f 2/prometheus-europe.yaml

Release "prometheus-europe" has been upgraded. Happy Helming!

NAME: prometheus-europe

LAST DEPLOYED: Sat Jan 22 18:42:24 2022

NAMESPACE: europe

STATUS: deployed

REVISION: 2

TEST SUITE: None

NOTES:

kube-prometheus-stack has been installed. Check its status by running:

  kubectl --namespace europe get pods -l "release=prometheus-europe"



~ helm -n united-states upgrade -i prometheus-united-states prometheus-community/kube-prometheus-stack -f 2/prometheus-united-states.yaml

Release "prometheus-united-states" has been upgraded. Happy Helming!

NAME: prometheus-united-states

LAST DEPLOYED: Sat Jan 22 18:43:06 2022

NAMESPACE: united-states

STATUS: deployed

REVISION: 2

TEST SUITE: None

NOTES:

kube-prometheus-stack has been installed. Check its status by running:

  kubectl --namespace united-states get pods -l "release=prometheus-united-states"



Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.

你应核查Prometheus pod有一个额外的容器与它们一起运行。

~ kubectl -n europe get pods -l app.kubernetes.io/name=prometheus                                                                

NAME                                        READY   STATUS    RESTARTS   AGE

prometheus-prometheus-europe-prometheus-0   3/3     Running   0          48s

prometheus-prometheus-europe-prometheus-1   3/3     Running   0          65s

~ kubectl -n united-states get pods -l app.kubernetes.io/name=prometheus

NAME                                               READY   STATUS    RESTARTS   AGE

prometheus-prometheus-united-states-prometheus-0   3/3     Running   0          44s

部署Thanos Querier以实现全局视图

Querier实现Prometheus HTTP v1 API,以便通过PromQL查询Thanos集群中的数据。它将允许你从单个端点获取指标。它先从底层StoreAPI收集评估查询所需的数据,之后评估查询,最后返回结果。

你利用kube-prometheus-stack来部署Thanos sidecar。遗憾的是,该图不支持其他Thanos 组件。为此,你将利用Banzai Cloud Helm Charts存储库。与以前一样,你先从添加存储库开始,就跟之前的做法一样。

~ helm repo add banzaicloud https://kubernetes-charts.banzaicloud.com

"banzaicloud" has been added to your repositories

~ helm repo update

Hang tight while we grab the latest from your chart repositories...

...Successfully got an update from the "prometheus-community" chart repository

...Successfully got an update from the "banzaicloud" chart repository

Update Complete. ⎈Happy Helming!⎈

为了模拟集中式监控解决方案,你将创建monitoring命名空间。

~ kubectl create namespace monitoring

namespace/monitoring created

下列配置可配置Thanos Querier,并将它指向Prometheus实例。

# query.yaml

store: # https://thanos.io/tip/components/store/

  enabled: false

compact: # https://thanos.io/tip/components/compact.md/

  enabled: false

bucket: https://thanos.io/v0.8/components/bucket/

  enabled: false

rule: # https://thanos.io/tip/components/rule/

  enabled: false

sidecar: # https://thanos.io/tip/components/sidecar/

  enabled: false

queryFrontend: # https://thanos.io/tip/components/query-frontend.md/

  enabled: false

query: # https://thanos.io/tip/components/query/

  enabled: true

  replicaLabels:

    - replica

 stores:

   - "dnssrv+_grpc._tcp.prometheus-operated.europe.svc.cluster.local"

   - "dnssrv+_grpc._tcp.prometheus-operated.united-states.svc.cluster.local"

有了上述配置,你可以随时部署Querier。

~ helm -n monitoring upgrade -i thanos banzaicloud/thanos -f query.yaml

Release "thanos" does not exist. Installing it now.

NAME: thanos

LAST DEPLOYED: Sat Jan 22 18:48:03 2022

NAMESPACE: monitoring

STATUS: deployed

REVISION: 1

TEST SUITE: None



~ kubectl -n monitoring port-forward svc/thanos-query-http 10902:10902

Forwarding from 127.0.0.1:10902 -> 10902

Forwarding from [::1]:10902 -> 10902

使用port-forward,你可以连接到集群。应确保自己能执行多集群查询。你部署Prometheus后,设置replicaExternalLabelName: “replica”和prometheusExternalLabelName: “cluster”。重复数据删除功能将充分利用这些设置。启用该功能后,你可以确保对来自europe集群的指标执行重复数据删除。那是由于Thanos假设它们来自同一个高可用性组。之所以出现这种情况,是由于它们有相同的标签,除了副本标签外。

部署Thanos Query Frontend以提高可读性

最后一部分是部署Query Frontend(查询前端),这项服务可以放在Querier的前面,以提高可读性。它基于Cortex Query Frontend组件,支持拆分、重试、缓存和慢查询日志等功能。

# query.yaml

store:

  enabled: false

compact:

  enabled: false

bucket:

  enabled: false

rule:

  enabled: false

sidecar:

  enabled: false

queryFrontend:

  enabled: true

query:

  enabled: true

  replicaLabels:

    - replica

 stores:

   - "dnssrv+_grpc._tcp.prometheus-operated.europe.svc.cluster.local"

   - "dnssrv+_grpc._tcp.prometheus-operated.united-states.svc.cluster.local"

更新前面的配置以部署Query Frontend,你现在可以更新设置了。

~ helm -n monitoring upgrade -i thanos banzaicloud/thanos -f query.yaml

Release "thanos" has been upgraded. Happy Helming!

NAME: thanos

LAST DEPLOYED: Sat Jan 22 18:56:29 2022

NAMESPACE: monitoring

STATUS: deployed

REVISION: 2

TEST SUITE: None



~ kubectl -n monitoring port-forward svc/thanos-query-frontend-http 10902:10902

Forwarding from 127.0.0.1:10902 -> 10902

Forwarding from [::1]:10902 -> 10902

再次使用port-forward,你就能够访问Query Frontend了。

Query Frontend是向多个Prometheus实例发送查询的入口点。执行这类查询的服务(比如Grafana)应通过Query Frontend进行查询。

参考链接:

https://thenewstack.io/implement-global-view-and-high-availability-for-prometheus/

K8S中文社区微信公众号

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址