作者:Ricardo Castro 编译:沈建苗
确保系统可靠运行是网站可靠性工程师的一项关键任务,主要是收集指标、创建警报和绘制数据图。下面这项工作至关重要:从多个位置和服务收集系统指标,并将它们关联起来,以了解系统功能并支持故障排除。
Prometheus是云原生计算基金会(CNCF)的一个项目,已成为最流行的应用程序和系统监控开源解决方案之一。单个实例可以处理数百万个时间序列,但是系统变得庞大后,Prometheus需要能够扩展并处理增加的负载。由于纵向扩展最终会遇到极限,你需要另一种解决方案。
本文逐步介绍将简单的Prometheus环境转换成Thanos部署环境。那样你就能从单个端点对多个Prometheus实例执行可靠的查询,从而无缝地处理高可用性的Prometheus环境。
实现全局视图和高可用性
Thanos提供了一系列组件,可以提供高可用性的度量系统,存储容量几乎无限制。它可以添加到现有的Prometheus部署环境上,提供全局查询视图、数据备份和历史数据访问等功能。此外,这些功能可彼此独立使用,这使得你只要在需要时引入Thanos功能。
初始集群设置
你将在Kubernetes集群中部署Prometheus,然后在其中模拟所需的场景。kind工具是在本地启动Kubernetes集群的好方法。你将使用以下配置。
# config.yaml kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 name: thanos-demo nodes: - role: control-plane Image: kindest/node:v1.23.0@sha256:2f93d3c7b12a3e93e6c1f34f331415e105979961fcddbe69a4e3ab5a93ccbb35 - role: worker Image: kindest/node:v1.23.0@sha256:2f93d3c7b12a3e93e6c1f34f331415e105979961fcddbe69a4e3ab5a93ccbb35 - role: worker image: kindest/node:v1.23.0@sha256:2f93d3c7b12a3e93e6c1f34f331415e105979961fcddbe69a4e3ab5a93ccbb35
有了这个配置,你可以随时启动集群。
~ kind create cluster --config config.yaml Creating cluster "thanos-demo" ... ✓ Ensuring node image (kindest/node:v1.23.0) ✓ Preparing nodes ✓ Writing configuration ✓ Starting control-plane ✓ Installing CNI ✓ Installing StorageClass ✓ Joining worker nodes Set kubectl context to "kind-thanos-demo" You can now use your cluster with:kubectl cluster-info --context kind-thanos-demoHave a nice day!
集群启动并运行后,你要检查安装,以确保可以随时启动Prometheus。你需要kubectl与Kubernetes集群进行交互。
~ kind get clusters thanos-demo ~ kubectl get nodes NAME STATUS ROLES AGE VERSION thanos-demo-control-plane Ready control-plane,master 119s v1.23.0 thanos-demo-worker Ready <none> 88s v1.23.0 thanos-demo-worker2 Ready <none> 88s v1.23.0 ~ kubectl get pods -o name -Apod/coredns-64897985d-mz8bv</p> pod/coredns-64897985d-pxzkq pod/etcd-thanos-demo-control-plane pod/kindnet-27cdw pod/kindnet-42kcv pod/kindnet-5rlcj pod/kube-apiserver-thanos-demo-control-plane pod/kube-controller-manager-thanos-demo-control-plane pod/kube-proxy-49mgg pod/kube-proxy-nhvkm pod/kube-proxy-z4fpn pod/kube-scheduler-thanos-demo-control-plane pod/local-path-provisioner-5bb5788f44-hj5c4
有了这个配置,你可以随时启动集群。
~ kind create cluster --config config.yaml Creating cluster "thanos-demo" ... ✓ Ensuring node image (kindest/node:v1.23.0) ✓ Preparing nodes ✓ Writing configuration ✓ Starting control-plane ✓ Installing CNI ✓ Installing StorageClass ✓ Joining worker nodes Set kubectl context to "kind-thanos-demo" You can now use your cluster with: kubectl cluster-info --context kind-thanos-demo Have a nice day!
集群启动并运行后,你要检查安装,以确保可以随时启动Prometheus。你需要kubectl与Kubernetes集群进行交互。
~ kind get clusters thanos-demo ~ kubectl get nodes NAME STATUS ROLES AGE VERSION thanos-demo-control-plane Ready control-plane,master 119s v1.23.0 thanos-demo-worker Ready <none> 88s v1.23.0 thanos-demo-worker2 Ready <none> 88s v1.23.0 ~ kubectl get pods -o name -A pod/coredns-64897985d-mz8bv pod/coredns-64897985d-pxzkq pod/etcd-thanos-demo-control-plane pod/kindnet-27cdw pod/kindnet-42kcv pod/kindnet-5rlcj pod/kube-apiserver-thanos-demo-control-plane pod/kube-controller-manager-thanos-demo-control-plane pod/kube-proxy-49mgg pod/kube-proxy-nhvkm pod/kube-proxy-z4fpn pod/kube-scheduler-thanos-demo-control-plane pod/local-path-provisioner-5bb5788f44-hj5c4
初始Prometheus设置
你的目标是在现有的Prometheus安装环境上部署Thanos,并扩展其功能。考虑到这一点,需要从启动三个Prometheus服务器入手。拥有多个Prometheus实例出于多个原因,比如分片、高可用性或聚合来自多个位置的查询。
针对这种场景,不妨想象以下设置:你在美国的集群有一台Prometheus服务器,在欧洲有Prometheus服务器的两个副本,它们抓取同样的目标。
若要部署Prometheus,你将使用kube-prometheus-stack图,还需要Helm。安装Helm后,你需要添加kube-prometheus-stack存储库。
~ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts "prometheus-community" has been added to your repositories ~ helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "prometheus-community" chart repository Update Complete. ⎈Happy Helming!⎈
由于实际上你只有一个Kubernetes集群,所以你将通过在不同的命名空间中部署Prometheus来模拟多个区域。你将为europe创建一个命名空间,为united-states创建另一个命名空间。
~ kubectl create namespace europe namespace/europe created ~ kubectl create namespace united-states namespace/united-states created
你已有了区域,可以随时部署Prometheus。
# prometheus-europe.yaml nameOverride: "eu" namespaceOverride: "europe" nodeExporter: enabled: false grafana: enabled: false alertmanager: enabled: false kubeStateMetrics: enabled: false prometheus: prometheusSpec: replicas: 2 replicaExternalLabelName: "replica" prometheusExternalLabelName: "cluster" # prometheus-united-states.yaml nameOverride: "us" namespaceOverride: "united-states" nodeExporter: enabled: false grafana: enabled: false alertmanager: enabled: false kubeStateMetrics: enabled: false prometheus: prometheusSpec: replicaExternalLabelName: "replica" prometheusExternalLabelName: "cluster"
使用上述配置,你将在每个区域部署Prometheus实例。
~ helm -n europe upgrade -i prometheus-europe prometheus-community/kube-prometheus-stack -f prometheus-europe.yaml Release "prometheus-europe" does not exist. Installing it now. NAME: prometheus-europe LAST DEPLOYED: Sat Jan 22 18:26:22 2022 NAMESPACE: europe STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace europe get pods -l "release=prometheus-europe" ~ helm -n united-states upgrade -i prometheus-united-states prometheus-community/kube-prometheus-stack -f prometheus-united-states.yaml Release "prometheus-united-states" does not exist. Installing it now. NAME: prometheus-united-states LAST DEPLOYED: Sat Jan 22 18:26:48 2022 NAMESPACE: united-states STATUS: deployed REVISION: 1 TEST SUITE: None NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace united-states get pods -l "release=prometheus-united-states" Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
现在可以确保你的Prometheus按预期的方式运行。
~ kubectl -n europe get pods -l app.kubernetes.io/name=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-europe-prometheus-0 2/2 Running 0 18s prometheus-prometheus-europe-prometheus-1 2/2 Running 0 18s ~ kubectl -n united-states get pods -l app.kubernetes.io/name=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-united-states-prometheus-0 2/2 Running 0 39s
你现在可以在每个单独的实例上查询任何指标,但无法执行多集群查询。
部署Thanos Sidecar
kube-prometheus-stack支持将Thanos部署为sidecar,这意味着它将与Prometheus本身一起部署。Thanos sidecar通过StoreAPI来公开Prometheus,而StoreAPI是一个通用的gRPC API,允许Thanos组件从诸多系统获取指标。
# prometheus-europe.yaml nameOverride: "eu" namespaceOverride: "europe" nodeExporter: enabled: false grafana: enabled: false alertmanager: enabled: false kubeStateMetrics: enabled: false prometheus: prometheusSpec: replicas: 2 replicaExternalLabelName: "replica" prometheusExternalLabelName: "cluster" thanos: baseImage: quay.io/thanos/thanos version: v0.24.0 # prometheus-united-states.yaml nameOverride: "us" namespaceOverride: "united-states" nodeExporter: enabled: false grafana: enabled: false alertmanager: enabled: false kubeStateMetrics: enabled: false prometheus: prometheusSpec: replicaExternalLabelName: "replica" prometheusExternalLabelName: "cluster" thanos: baseImage: quay.io/thanos/thanos version: v0.24.0
有了更新后的配置,你可以随时升级Prometheus。
~ helm -n europe upgrade -i prometheus-europe prometheus-community/kube-prometheus-stack -f 2/prometheus-europe.yaml Release "prometheus-europe" has been upgraded. Happy Helming! NAME: prometheus-europe LAST DEPLOYED: Sat Jan 22 18:42:24 2022 NAMESPACE: europe STATUS: deployed REVISION: 2 TEST SUITE: None NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace europe get pods -l "release=prometheus-europe" ~ helm -n united-states upgrade -i prometheus-united-states prometheus-community/kube-prometheus-stack -f 2/prometheus-united-states.yaml Release "prometheus-united-states" has been upgraded. Happy Helming! NAME: prometheus-united-states LAST DEPLOYED: Sat Jan 22 18:43:06 2022 NAMESPACE: united-states STATUS: deployed REVISION: 2 TEST SUITE: None NOTES: kube-prometheus-stack has been installed. Check its status by running: kubectl --namespace united-states get pods -l "release=prometheus-united-states" Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
你应核查Prometheus pod有一个额外的容器与它们一起运行。
~ kubectl -n europe get pods -l app.kubernetes.io/name=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-europe-prometheus-0 3/3 Running 0 48s prometheus-prometheus-europe-prometheus-1 3/3 Running 0 65s ~ kubectl -n united-states get pods -l app.kubernetes.io/name=prometheus NAME READY STATUS RESTARTS AGE prometheus-prometheus-united-states-prometheus-0 3/3 Running 0 44s
部署Thanos Querier以实现全局视图
Querier实现Prometheus HTTP v1 API,以便通过PromQL查询Thanos集群中的数据。它将允许你从单个端点获取指标。它先从底层StoreAPI收集评估查询所需的数据,之后评估查询,最后返回结果。
你利用kube-prometheus-stack来部署Thanos sidecar。遗憾的是,该图不支持其他Thanos 组件。为此,你将利用Banzai Cloud Helm Charts存储库。与以前一样,你先从添加存储库开始,就跟之前的做法一样。
~ helm repo add banzaicloud https://kubernetes-charts.banzaicloud.com "banzaicloud" has been added to your repositories ~ helm repo update Hang tight while we grab the latest from your chart repositories... ...Successfully got an update from the "prometheus-community" chart repository ...Successfully got an update from the "banzaicloud" chart repository Update Complete. ⎈Happy Helming!⎈
为了模拟集中式监控解决方案,你将创建monitoring命名空间。
~ kubectl create namespace monitoring namespace/monitoring created
下列配置可配置Thanos Querier,并将它指向Prometheus实例。
# query.yaml store: # https://thanos.io/tip/components/store/ enabled: false compact: # https://thanos.io/tip/components/compact.md/ enabled: false bucket: https://thanos.io/v0.8/components/bucket/ enabled: false rule: # https://thanos.io/tip/components/rule/ enabled: false sidecar: # https://thanos.io/tip/components/sidecar/ enabled: false queryFrontend: # https://thanos.io/tip/components/query-frontend.md/ enabled: false query: # https://thanos.io/tip/components/query/ enabled: true replicaLabels: - replica stores: - "dnssrv+_grpc._tcp.prometheus-operated.europe.svc.cluster.local" - "dnssrv+_grpc._tcp.prometheus-operated.united-states.svc.cluster.local"
有了上述配置,你可以随时部署Querier。
~ helm -n monitoring upgrade -i thanos banzaicloud/thanos -f query.yaml Release "thanos" does not exist. Installing it now. NAME: thanos LAST DEPLOYED: Sat Jan 22 18:48:03 2022 NAMESPACE: monitoring STATUS: deployed REVISION: 1 TEST SUITE: None ~ kubectl -n monitoring port-forward svc/thanos-query-http 10902:10902 Forwarding from 127.0.0.1:10902 -> 10902 Forwarding from [::1]:10902 -> 10902
使用port-forward,你可以连接到集群。应确保自己能执行多集群查询。你部署Prometheus后,设置replicaExternalLabelName: “replica”和prometheusExternalLabelName: “cluster”。重复数据删除功能将充分利用这些设置。启用该功能后,你可以确保对来自europe集群的指标执行重复数据删除。那是由于Thanos假设它们来自同一个高可用性组。之所以出现这种情况,是由于它们有相同的标签,除了副本标签外。
部署Thanos Query Frontend以提高可读性
最后一部分是部署Query Frontend(查询前端),这项服务可以放在Querier的前面,以提高可读性。它基于Cortex Query Frontend组件,支持拆分、重试、缓存和慢查询日志等功能。
# query.yaml store: enabled: false compact: enabled: false bucket: enabled: false rule: enabled: false sidecar: enabled: false queryFrontend: enabled: true query: enabled: true replicaLabels: - replica stores: - "dnssrv+_grpc._tcp.prometheus-operated.europe.svc.cluster.local" - "dnssrv+_grpc._tcp.prometheus-operated.united-states.svc.cluster.local"
更新前面的配置以部署Query Frontend,你现在可以更新设置了。
~ helm -n monitoring upgrade -i thanos banzaicloud/thanos -f query.yaml Release "thanos" has been upgraded. Happy Helming! NAME: thanos LAST DEPLOYED: Sat Jan 22 18:56:29 2022 NAMESPACE: monitoring STATUS: deployed REVISION: 2 TEST SUITE: None ~ kubectl -n monitoring port-forward svc/thanos-query-frontend-http 10902:10902 Forwarding from 127.0.0.1:10902 -> 10902 Forwarding from [::1]:10902 -> 10902
再次使用port-forward,你就能够访问Query Frontend了。
Query Frontend是向多个Prometheus实例发送查询的入口点。执行这类查询的服务(比如Grafana)应通过Query Frontend进行查询。
参考链接:
https://thenewstack.io/implement-global-view-and-high-availability-for-prometheus/