SNMP Poller¶

The SNMP Poller microservice discovers SNMP devices for the Discovery Service microservice and periodically polls the discovered devices for availability.

This microservice is part of the Discovery microservice pipeline. It uses a worker-coordinator design to balance workloads and allow for scaling. You deploy instances of this microservice and others in the same pipeline to separate clusters for each device zone. See Understanding the Discovery Pipeline in Unified Assurance Concepts for conceptual information.

You can enable redundancy for this microservice when you deploy it. See Configuring Microservice Redundancy for general information.

Autoscaling is supported but disabled by default for this microservice. You can optionally enable autoscaling when you deploy the microservice. See Configuring Autoscaling and SNMP Poller Autoscaling Configuration.

This microservice provides additional Prometheus monitoring metrics. See SNMP Poller Self-Monitoring Metrics.

SNMP Poller Prerequisites¶

Before deploying the microservice, confirm that the following prerequisites are met:

A microservice cluster is set up. See Microservice Cluster Setup.
The following microservices are deployed:
- Pulsar
- Discovery Service

Deploying SNMP Poller¶

To deploy the microservice, run the following commands:

su - assure1
export NAMESPACE=<namespace>
export WEBFQDN=<WebFQDN> 
a1helm install <microservice-release-name> assure1/snmp-poller -n $NAMESPACE --set global.imageRegistry=$WEBFQDN

In the commands:

<namespace> is the namespace where you are deploying the microservice. The default namespace is a1-zone1-pri, but you can change the zone number and, when deploying to a redundant cluster, change pri to sec.
<WebFQDN> is the fully-qualified domain name of the primary presentation server for the cluster.
<microservice-release-name> is the name to use for the microservice instance. Oracle recommends using the microservice name (snmp-poller) unless you are deploying multiple instances of the microservice to the same cluster.

You can also use the Unified Assurance UI to deploy microservices. See Deploying a Microservice by Using the UI for more information.

Changing SNMP Poller Configuration Parameters¶

When running the install command, you can optionally change default configuration parameter values by including them in the command with additional --set arguments. You can add as many additional --set arguments as you need.

For example:

Set a global parameter described in Default Global SNMP Poller Configuration by adding --set configData.<parameter_name>=<parameter_value>. For example, --set configData.LOG_LEVEL=DEBUG.
Set a coordinator-specific or worker-specific parameter by prefixing configData in the argument with coordinator or worker. For example, --set coordinator.configData.LOG_LEVEL=DEBUG. This overrides the global parameter.
Enable redundancy for the microservice by adding --set redundancy=enabled.
Enable client-side (worker) or server-side (coordinator) keep-alive checks by adding --set configData.GRPC_CLIENT_KEEPALIVE=true or --set configData.GRPC_SERVER_KEEPALIVE=true. See About Keep-Alive Configurations.
Enable autoscaling for the microservice and set the maximum replica count to an appropriate value for your environment by adding --set autoscaling.enabled=true --set autoscaling.maxReplicaCount=<N>. See SNMP Autoscaling Configuration for information about choosing an appropriate value for <N>.

Default SNMP Poller Configuration¶

Some SNMP Poller configurations apply to workers and coordinators, some apply only to coordinators, and some apply only to workers. The parameters set for the workers or coordinators specifically override the global parameters. For example, if you set global log levels to DEBUG, but the log level for coordinators to INFO, then the coordinator logs will use INFO and worker logs will use DEBUG.

Default Global SNMP Poller Configuration¶

The following table describes the default global configuration parameters found in the Helm chart under configData for the microservice. These apply to both workers and coordinators.

Name	Default Value	Possible Values	Notes
LOG_LEVEL	INFO	FATAL, ERROR, WARN, INFO, DEBUG	Global logging level between coordinator and workers. Any setting at the worker or coordinator level overrides this.
GRPC_CONN_DOWN_DEADLINE	5s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period to wait for a GPRC connection before it is considered failed.
GRPC_CLIENT_KEEPALIVE	false	Text (true or false)	Whether to use client-side keepalive checks, sent from the workers, to validate communication with the coordinator. See About Keep-Alive Configurations for information about the keepalive parameters.
GRPC_CLIENT_KEEPALIVE_TIME	30s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period, after no communication, to ping the server (coordinator).
GRPC_CLIENT_KEEPALIVE_TIMEOUT	5s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period to wait for a response to the ping before the server connection is considered down.
GRPC_SERVER_KEEPALIVE	false	Text (true or false)	Whether to use server-side keepalive checks, sent from the coordinator, to validate communication with the workers.
GRPC_SERVER_KEEPALIVE_TIME	30s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period, after no communication, to ping the clients (workers).
GRPC_SERVER_KEEPALIVE_TIMEOUT	5s	Integer + Text (ns, us, µs, ms, s, m, h)	The time period to wait for a response to the ping before the client connection is considered down.

About Keep-Alive Configurations¶

By default, the coordinator and individual workers periodically send heartbeat messages between each other, with no validation, to check that the connection is not idle. To validate the connection, you can optionally enable ping-based gPRC keepalive checks, which expect a response within a configurable timeframe. If no response is received, the connection is considered down and the workers attempt to reestablish communication.

In the SNMP Poller microservice, the coordinator acts as the gPRC server and the workers act as clients. You enable keepalive checks from the coordinator to workers in the GRPC_SERVER_KEEPALIVE parameter and from workers to the coordinator in the GRPC_CLIENT_KEEPALIVE parameter. You set the interval at which the checks are made in the GRPC_SERVER_KEEPALIVE_TIME and GRPC_CLIENT_KEEPALIVE parameters, and the time within which a response is expected in the GRPC_SERVER_KEEPALIVE_TIMEOUT and GRPC_CLIENT_KEEPALIVE_TIMEOUT parameters.

Client-side keepalive checks have mandatory enforcement policies. If the client checks too frequently, the connection will be dropped with an ENHANCE_YOUR_CALM(too_many_pings) error. When you enable client-side keepalive checks, the SNMP Poller automatically sets the enforcement policy to allow no more than the value of GRPC_CLIENT_KEEPALIVE_TIME minus the value of GRPC_CLIENT_KEEPALIVE_TIMEOUT.

Default SNMP Poller Coordinator Configuration¶

The following table describes the default configuration parameters for coordinators found in the Helm chart under configData for the microservice.

Name	Default Value	Possible Values	Notes
LOG_LEVEL	INFO	FATAL, ERROR, WARN, INFO, DEBUG	Coordinator logging level. This overrides the global configuration.
POLLER_RESYNC_PERIOD	15m	Integer + Text (ns, us, µs, ms, s, m, h)	The time to wait before the coordinator re-synchronizes with the Unified Assurance database.
DISCOVERY_WORKERS_PERCENTAGE	25	Integer, 0 up to 100	The percentage of workers allocated to perform discovery workloads exclusively.
WORKER_CONCURRENCY	2000	Integer, greater than 0	The number of concurrent SNMP workloads that a single worker instance can perform.
WORKER_STREAM_FAILURE_THRESHOLD	5	Integer, greater than 0	The number of concurrent reconnections in the timeframe specified in WORKER_STREAM_FAILURE_WINDOW before forcing the worker to restart.
WORKER_STREAM_FAILURE_WINDOW	30m	Integer + Text (ns, us, µs, ms, s, m, h)	The timeframe to count concurrent reconnections for before for forcing the worker to restart.
PULSAR_SNMP_DISCOVERY_TOPIC_OVERRIDE	""	Text	Override for the topic from which the coordinator listens for discovery workload requests.
REDUNDANCY_INIT_DELAY	20s	Integer + Text (ns, us, µs, ms, s, m, h)	At startup, the amount of time to wait for the primary microservice to come up before initiating redundancy.
REDUNDANCY_POLL_PERIOD	5s	Integer + Text (ns, us, µs, ms, s, m, h)	The amount of time between status checks from the secondary microservice to the primary microservice.
REDUNDANCY_FAILOVER_THRESHOLD	4	Integer, greater than 0	The number of times the primary microservice must fail checks before the secondary microservice becomes active.
REDUNDANCY_FALLBACK_THRESHOLD	1	Integer, greater than 0	The number of times the primary microservice must succeed checks before the secondary microservice becomes inactive.
PROBE_V2_SUPPORT_ENABLED	""	Bool	Whether to enable SNMP probe v2c (true) or v1 (false) for v2c enabled devices during device discovery. If no value is provided, the default is false.

Default SNMP Poller Worker Configuration¶

The following table describes the default configuration parameters for workers found in the Helm chart under configData for the microservice.

Name	Default Value	Possible Values	Notes
LOG_LEVEL	INFO	FATAL, ERROR, WARN, INFO, DEBUG	Worker logging level. This overrides the global configuration.
GRPC_GRACEFUL_CONN_TIME	60s	Integer + Text (ns, us, µs, ms, s, m, h)	The amount of time the workers should try to connect with the coordinator before failing.
STREAM_OUTPUT_METRIC	""	Text	Override for the topic where performance polling workload results are published.
STREAM_OUTPUT_AVAILABILITY	""	Text	Override for the topic where availability polling workload results are published.
PULSAR_DISCOVERY_CALLBACK_OVERRIDE	""	Text	Override for the topic where discovery workload results are published.

SNMP Poller Autoscaling Configuration¶

Autoscaling is supported for the SNMP Poller microservice, but disabled by default. See Configuring Autoscaling for general information and details about the standard autoscaling configurations.

When you deploy the SNMP Poller microservice with autoscaling enabled, you must specify the total number of workers that will be required in the maxReplicaCount configuration parameter. You can calculate this as follows:

Number of polling workers: The number of unique devices to be polled, divided by the WORKER_CONCURRENCY value. Round up the result to the nearest whole number.
Number of discovery workers: The number of polling workers, multiplied by the DISCOVERY_WORKERS_PERCENTAGE value, divided by 100. Round up the result to the nearest whole number.
Total workers: The number of polling workers plus the number of discovery workers.

For example:

For 100,000 polled devices, when WORKER_CONCURRENCY is set to 2000 and DISCOVERY_WORKERS_PERCENTAGE is set to 25:
- Polling workers: 100,000 / 2000 = 50
- Discovery workers: Number of polling workers x 0.25 = 12.5, rounded up to 13
- Total required workers and maxReplicaCount value: Number of polling workers + number of discovery workers = 63
For 250,000 polled devices, when WORKER_CONCURRENCY is set to 3000 and DISCOVERY_WORKERS_PERCENTAGE is set to 33:
- Polling workers: 250,000 / 3000 = 83.3, rounded up to 84
- Discovery workers: Number of polling workers x 0.33 = 27.72, rounded up to 28
- Total required workers and maxReplicaCount value: Number of polling workers + number of discovery workers = 112

After you deploy the SNMP Poller microservice, the coordinator also does this calculation and assigns polling and discovery workers. The coordinator repeats the calculation every time the microservice resynchronizes with the database. The total number of workers is dynamically updated in the snmp_coordinator_metric_workers_required Prometheus metric, which KEDA uses to make scaling decisions, but the coordinator cannot dynamically update the maxReplicaCount value.

When deploying the microservice, if you have already discovered devices and want to skip or confirm your manual calculation, you can:

Deploy SNMP Poller with autoscaling enabled and maxReplicaCount set to your estimate (or the default value of 20).
Wait for the coordinator to make the calculation based on the discovered devices, monitoring the logs for the following lines:
```
Calculating autoscaling requirements 
Estimated discovery worker count is <N1>
Estimated polling worker count is <N2>
Estimated total worker count is <N1+N2>
```
where <N1> and <N2> are the whole numbers calculated for the workers.
If the value you used for maxReplicaCount is less than the estimated total worker count, undeploy SNMP Poller and redeploy it using the new number. See Undeploying a Microservice for information about undeploying.

Modifying Scaling Triggers¶

By default, only the snmp_coordinator_metric_workers_required metric is configured as an autoscaling trigger. You can define additional triggers in the Helm chart by adding them under the triggers section.

For example, the default trigger configuration is:

autoscaling:
  ...
  triggers:
    - type: prometheus
      metadata:
        metricName: required_total_workers
        serverAddress: http://prometheus-kube-prometheus-prometheus.a1-monitoring.svc.cluster.local:9090
        query: snmp_coordinator_metric_required_total_workers
        threshold: '1'
        metricType: Value

SNMP Poller Self-Monitoring Metrics¶

The SNMP Poller microservice exposes the self-monitoring metrics for coordinators described in the following table to Prometheus.

Metric Name	Type	Labels	Description
metric_worker_count	Gauge	N/A	The number of workers currently enrolled with the coordinator.
metric_workforce_count	Gauge	N/A	The number of workers multiplied by worker concurrency.
metric_discovery_worker_count	Gauge	N/A	The number of discovery workers currently enrolled with the coordinator.
metric_polling_worker_count	Gauge	N/A	The number of polling workers currently enrolled with the coordinator.
metric_required_discovery_workers	Gauge	N/A	The number of workers required for discovery when using autoscaling. Only available when autoscaling is enabled.
metric_required_polling_workers	Gauge	N/A	The number of workers required for polling when using autoscaling. Only available when autoscaling is enabled.
metric_required_total_workers	Gauge	N/A	The number of workers required for polling and discovery when using autoscaling. Only available when autoscaling is enabled.
metric_discovery_requests_queued	Gauge	N/A	The number of discovery requests. (queued, realtime)
metric_discovery_requests_processing	Gauge	N/A	The number of discovery requests. (processing, realtime)
metric_polling_requests_queued	Gauge	N/A	The number of polling requests. (queued, realtime)
metric_polling_requests_processing	Gauge	N/A	The number of polling requests. (processing, realtime)
metric_polled_devices_count	GaugeVec	domain, cycle	The number of polled devices per domain and cycle.
metric_polled_objects_count	GaugeVec	domain, cycle	The number of polled objects per domain and per cycle.
metric_polling_duration	GaugeVec	domain, cycle	The total polling duration in seconds for last cycle per domain and per cycle.
metric_polling_average	GaugeVec	domain, cycle	The average polling duration in seconds for last cycle per domain and per cycle.
metric_polling_average95	GaugeVec	domain, cycle	The 95th percentile average polling duration in seconds for last cycle per domain and per cycle.
metric_polling_utilisation	GaugeVec	domain, cycle	The polling utilisation in percent for last cycle per domain and per cycle.
metric_polling_utilisation95	GaugeVec	domain, cycle	The 95th percentile polling utilisation in percent for last cycle per domain and per cycle.

Note

In the database, each of the metrics is prefixed with prom and snmp_coordinator to indicate the services that inserted them.