alert:NodeMemorySpaceFillingUp expr:((1 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"}) * on(instance) group_left(nodename) (node_uname_info) > 0.8) * 100) for: 5m labels: cluster: critical type: node annotations: description: Memory usage on `{{$labels.nodename}}`({{ $labels.instance }}) up to {{ printf "%.2f" $value }}%. summary: Node memory will be exhausted.
2、NodeCpuUtilisationHigh
监控Node CPU使用率,如果大于80%则报警。
1 2 3 4 5 6 7 8 9
alert:NodeFilesystemAlmostOutOfSpace expr:((node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 20 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) * on(instance) group_left(nodename) (node_uname_info)) for: 5m labels: cluster: critical type: node annotations: description: Filesystem on `{{ $labels.device }}` at `{{$labels.nodename}}`({{ $labels.instance }}) has only {{ printf "%.2f" $value }}% available space left. summary: Node filesystem has less than 20% space left.
3、NodeFilesystemAlmostOutOfSpace
监控Node磁盘使用率,剩余空间<10%则报警。
1 2 3 4 5 6 7 8 9
alert:NodeFilesystemAlmostOutOfSpace expr:((node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 10 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) * on(instance) group_left(nodename) (node_uname_info)) for: 5m labels: cluster: critical type: node annotations: description: Filesystem on `{{ $labels.device }}` at `{{$labels.nodename}}`({{ $labels.instance }}) has only {{ printf "%.2f" $value }}% available space left. summary: Node filesystem has less than 10% space left.
4、NodeFilesystemAlmostOutOfFiles
监控Node索引节点使用率,剩余空间<10%则报警。
1 2 3 4 5 6 7 8 9
alert:NodeFilesystemAlmostOutOfFiles expr:((node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 10 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) * on(instance) group_left(nodename) (node_uname_info)) for: 5m labels: cluster: critical type: node annotations: description: Filesystem on `{{ $labels.device }}` at `{{$labels.nodename}}`({{ $labels.instance }}) has only {{ printf "%.2f" $value }}% available inodes left. summary: Node filesystem has less than 10% inodes left.
5、KubeNodeNotReady
监控Node状态,如果有Node Not Ready则报警。
1 2 3 4 5 6 7 8 9
alert:KubeNodeNotReady expr:(kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true"} == 0) for: 5m labels: cluster: critical type: node annotations: description: {{ $labels.node }} has been unready for more than 15 minutes. summary: Node is not ready.
alert:NamespaceCpuUtilisationHigh expr:(sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate) by (namespace) / sum(namespace_cpu:kube_pod_container_resource_limits:sum) by (namespace) * 100 > 90) for: 5m labels: cluster: critical type: namespace annotations: description: CPU utilisation on `{{$labels.namespace}}` up to {{ printf "%.2f" $value }}%. summary: Namespace CPU utilisation high.
2、NamespaceCpuUtilisationLow
监控namespace cpu使用率,低于10%则报警。
1 2 3 4 5 6 7 8 9
alert:NamespaceCpuUtilisationLow expr:(sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate) by (namespace) / sum(namespace_cpu:kube_pod_container_resource_limits:sum) by (namespace) * 100 < 10) for: 5m labels: cluster: critical type: namespace annotations: description: CPU utilisation on `{{$labels.namespace}}` as low as {{ printf "%.2f" $value }}%. summary: Namespace CPU underutilization.
3、NamespaceMemorySpaceFillingUp
监控namespace 内存使用率,高于90%则报警。
1 2 3 4 5 6 7 8 9
alert:NamespaceMemorySpaceFillingUp expr:(sum(node_namespace_pod_container:container_memory_working_set_bytes) by (namespace) / sum(namespace_memory:kube_pod_container_resource_limits:sum) by (namespace) * 100 > 90) for: 5m labels: cluster: critical type: namespace annotations: description: Memory usage on `{{$labels.namespace}}` up to {{ printf "%.2f" $value }}%. summary: Namespace memory will be exhausted.
4、NamespaceMemorySpaceLow
监控namespace 内存使用率,低于10%则报警。
1 2 3 4 5 6 7 8 9
alert:NamespaceMemorySpaceLow expr:(sum(node_namespace_pod_container:container_memory_working_set_bytes) by (namespace) / sum(namespace_memory:kube_pod_container_resource_limits:sum) by (namespace) * 100 < 10) for: 5m labels: cluster: critical type: namespace annotations: description: Memory usage on `{{$labels.namespace}}` as low as {{ printf "%.2f" $value }}%. summary: Under-utilized namespace memory.
5、KubePodNotReady
监控pod状态,如果存在pod持续not-ready达到十五分钟则报警。
1 2 3 4 5 6 7 8 9
alert:KubePodNotReady expr:(sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~".*",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"}))) > 0) for: 5m labels: cluster: critical type: namespace annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes. summary: Pod has been in a non-ready state for more than 15 minutes.
6、KubeContainerWaiting
监控pod状态,如果存在pod持续waiting达到十五分钟则报警。
1 2 3 4 5 6 7 8 9
alert:KubeContainerWaiting expr:(sum by(namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics",namespace=~".*"}) > 0) for: 5m labels: cluster: critical type: namespace annotations: description:Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{$labels.container}} has been in waiting state for longer than 15 minutes. summary: Pod container waiting longer than 15 minutes.