SNMP Poller¶
Overview¶
The SNMP Poller microservice can perform snmp device discovery created through the discovery-service's discovery RESTful API or execute periodic snmp polling once the devices have been provisioned in Unified Assurance through the discovery-service's inventory RESTful API.
The microservice is designed using the Controller-Worker architecture, composed of two components: a single coordinator and multiple instances of workers. The coordinator is responsible for managing workers, publishing metrics, calculating and coordinating snmp workloads between its workers. Whereas the workers are only responsible in executing those workloads and publishing their results to the appropriate microservice pipelines through the Apache Pulsar bus.
The microservice is expected to run in a separate microservices cluster for each Device Zone alongside other mandatory microservices. See Part 3 of the Prerequisites section below for more details.
Prerequisites¶
-
A microservices cluster must be setup. Refer to Microservice Cluster Setup.
-
Apache Pulsar must be installed. Refer to Apache Pulsar microservice.
-
The following core microservices must be installed as per the requirement
Setup¶
su - assure1
export NAMESPACE=a1-zone1-pri
export WEBFQDN=<Primary Presentation Web FQDN>
a1helm install snmp-poller assure1/snmp-poller -n $NAMESPACE --set global.imageRegistry=$WEBFQDN
Default Global Configuration¶
Name | Value | Possible Values | Notes |
---|---|---|---|
LOG_LEVEL | INFO | FATAL, ERROR, WARN, INFO, DEBUG | Global logging level between coordinator and workers. |
PORT_COORDINATOR | 38890 | Integer | Internal port used by the coordinator service. |
PORT_WORKER | 38891 | Integer | Internal port used by the worker service. |
GRPC_CLIENT_KEEPALIVE | false | Text (true/false) | Whether to use client-side keep-alive pings. See Keep-alive. |
GRPC_CLIENT_KEEPALIVE_TIME | 30s | Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) | Duration after no activity to ping server. |
GRPC_CLIENT_KEEPALIVE_TIMEOUT | 5s | Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) | Duration after no ping ack to consider server connection dead. |
GRPC_SERVER_KEEPALIVE | false | Text (true/false) | Whether to use server-side keep-alive pings. See Keep-alive. |
GRPC_SERVER_KEEPALIVE_TIME | 30s | Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) | Duration after no activity to ping client. |
GRPC_SERVER_KEEPALIVE_TIMEOUT | 5s | Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) | Duration after no ping ack to consider client connection dead. |
Above configurations can be changed by passing the values to the a1helm install
prefixed with configData.
Example of setting the logging level to DEBUG for both coordinator and worker¶
a1helm install ... --set configData.LOG_LEVEL=DEBUG
Default Coordinator-Only Configuration¶
Name | Value | Possible Values | Notes |
---|---|---|---|
LOG_LEVEL | INFO | FATAL, ERROR, WARN, INFO, DEBUG | Coordinator logging level. This overrides the global configuration. |
GRPC_FALLBACK_USE_IP | true | Text (true/false) | Should coordinator communicate with workers using ip addresses instead of hostnames. |
WORKER_CONCURRENCY | 2000 | Integer (0 < value) | How many concurrent snmp workloads can a single worker instance perform. |
DISCOVERY_WORKER_PERCENTAGE | 25 | Integer (0 <= value <= 100) | What percentage of workers are allocated only to perform discovery workloads. |
POLLER_RESYNC_PERIOD | 15m | Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) | How frequently should the coordinator re-synchronize with the Unified Assurance database. |
PULSAR_SNMP_DISCOVERY_TOPIC_OVERRIDE | "" | Text | Override for the topic from which the coordinator listens for discovery workload requests. |
REDUNDANCY_INIT_DELAY | 20s | Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) | How long should redundancy wait for primary up status before becoming active over during startup. |
REDUNDANCY_POLL_PERIOD | 5s | Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) | How frequently should the secondary microservice poll for primary microservices failure. |
REDUNDANCY_FAILOVER_THRESHOLD | 4 | Integer (0 < value) | The number of failed checks after which the secondary microservice becomes active. |
REDUNDANCY_FALLBACK_THRESHOLD | 1 | Integer (0 < value) | The number of successful checks after which the secondary microservice goes back to sleep. |
Configurations can be changed by passing the values to the a1helm install
prefixed with coordinator.configData.
Example of setting the logging level to DEBUG only for the coordinator¶
a1helm install ... --set coordinator.configData.LOG_LEVEL=DEBUG
Default Worker-Only Configuration¶
Name | Value | Possible Values | Notes |
---|---|---|---|
LOG_LEVEL | INFO | FATAL, ERROR, WARN, INFO, DEBUG | Worker logging level. This overrides the global configuration. |
GRPC_GRACEFUL_CONN_TIME | 60s | Integer + Text ("ns", "us" (or "µs"), "ms", "s", "m", "h".) | Up to how long should the workers try attempt connecting with the coordinator before failing. |
STREAM_OUTPUT_METRIC | "" | Text | Override for the topic where performance polling workload results are published. |
STREAM_OUTPUT_AVAILABILITY | "" | Text | Override for the topic where availability polling workload results are published. |
PULSAR_DISCOVERY_CALLBACK_OVERRIDE | "" | Text | Override for the topic where discovery workload results are published. |
Configurations can be changed by passing the values to the a1helm install
prefixed with worker.configData.
Example of setting the logging level to DEBUG only for workers¶
a1helm install ... --set worker.configData.LOG_LEVEL=DEBUG
Keep-alive¶
Both the coordinator and worker instances support individual server-side and client-side keep-alive pings to help ensure constant connectivity between each other.
Info
Both server-side and client-side keep alive is disabled by default.
Warn
Client-side keep-alive pings have mandatory enforcement policies on the receiving server; If the client pings too frequently, the connection will be dropped with an ENHANCE_YOUR_CALM(too_many_pings) error.
Autoscaling¶
The SNMP Poller Microservice uses the below formulae to determine the best number of workers required to perform snmp workloads:
polling workers = round up(unique devices being polled / worker concurrency)
discovery workers = round up(polling workers * discovery worker percentage / 100)
total workers required = polling workers + discovery workers
During re-synchronisation with the Unified Assurance database, the coordinator determines the number of unique polled devices and performs the above calculations.
The result is then exposed as snmp_coordinator_metric_workers_required prometheus metric, to be ingested by KEDA to make the scaling decision.
Info
Autoscaling is disabled by default.
Warn
While the provided autoscaling has almost out-of-the-box functionality, You will need to manually configure the upper bound autoscaling limit during installation.
Using the expected number of devices to be polled in each Device Zone, decide the percentage of discovery workers (or leave it default), apply the formulae and configure the upper-bound limit.
For common microservice scaling configuration options, please refer to the autoscaling docs.
Examples¶
-
For 100,000 polled devices with worker concurrency of 2000 and 25% discovery workers, total required workers = 63
- 50 workers will be assigned to perform polling workloads
- 13 workers will be assigned to perform discovery workloads
-
For 250,000 polled devices with worker concurrency of 3000 and 33% discovery workers, total required workers = 112
- 84 workers will be assigned to perform polling workloads
- 28 workers will be assigned to perform discovery workloads
Modifying scaling triggers¶
By default, only a single autoscaling trigger is defined. You can define additional triggers during installation alongside the common configuration options.
autoscaling:
...
triggers:
- type: prometheus
metadata:
metricName: required_total_workers
serverAddress: http://prometheus-kube-prometheus-prometheus.a1-monitoring.svc.cluster.local:9090
query: snmp_coordinator_metric_required_total_workers
threshold: '1'
metricType: Value
Microservice self-metrics¶
The SNMP Poller Microservice exposes the following self-metrics to Prometheus.
Coordinator metrics table¶
Note
Each of the below metrics is prefixed with snmp_coordinator prefix. Example of a full metric name: snmp_coordinator_metric_worker_count
Note
Metrics from the table below that are suffixed with an asterix (*) are not available/exposed if autoscaling is disabled.
Metric Name | Type | Labels | Description |
---|---|---|---|
metric_worker_count | Gauge | N/A | The number of workers currently enrolled with the coordinator. |
metric_workforce_count | Gauge | N/A | The number of workers multiplied by worker concurrency. |
metric_discovery_worker_count | Gauge | N/A | The number of discovery workers currently enrolled with the coordinator. |
metric_polling_worker_count | Gauge | N/A | The number of polling workers currently enrolled with the coordinator. |
metric_required_discovery_workers* | Gauge | N/A | The number of workers required for discovery when using autoscaling. |
metric_required_polling_workers* | Gauge | N/A | The number of workers required for polling when using autoscaling. |
metric_required_total_workers* | Gauge | N/A | The number of workers required for polling and discovery when using autoscaling. |
metric_discovery_requests_queued | Gauge | N/A | The number of discovery requests. (queued, realtime) |
metric_discovery_requests_processing | Gauge | N/A | The number of discovery requests. (processing, realtime) |
metric_polling_requests_queued | Gauge | N/A | The number of polling requests. (queued, realtime) |
metric_polling_requests_processing | Gauge | N/A | The number of polling requests. (processing, realtime) |
metric_polled_devices_count | GaugeVec | domain, cycle | The number of polled devices per domain and cycle. |
metric_polled_objects_count | GaugeVec | domain, cycle | The number of polled objects per domain and per cycle. |
metric_polling_duration | GaugeVec | domain, cycle | The total polling duration in seconds for last cycle per domain and per cycle. |
metric_polling_average | GaugeVec | domain, cycle | The average polling duration in seconds for last cycle per domain and per cycle. |
metric_polling_average95 | GaugeVec | domain, cycle | The 95th percentile average polling duration in seconds for last cycle per domain and per cycle. |
metric_polling_utilisation | GaugeVec | domain, cycle | The polling utilisation in percent for last cycle per domain and per cycle. |
metric_polling_utilisation95 | GaugeVec | domain, cycle | The 95th percentile polling utilisation in percent for last cycle per domain and per cycle. |
Microservice redundancy¶
Redundancy in the SNMP Poller Microservice controls which of the two microservices in a redundant pair is considered active to run periodic device polling.
Info
Redundancy is disabled by default.
Example of enabling redundancy¶
a1helm install ... --set redundancy.enabled=true