Monitoring Trident

Trident provides a set of Prometheus metrics endpoints that you can use to monitor Trident’s performance and understand how it functions. The metrics exposed by Trident provide a convenient way of:

  1. Keeping tabs on Trident’s health and configuration. You can examine how successful Trident operations are and if it can communicate with the backends as expected.
  2. Examining backend usage information and understanding how many volumes are provisioned on a backend and the amount of space consumed and so on.
  3. Maintaining a mapping of the amount of volumes provisioned on available backends.
  4. Tracking Trident’s performance metrics. You can take a look at how long it takes to communicate to backends and perform operations.

Requirements

  1. A Kubernetes cluster with Trident installed. By default, Trident’s metrics are exposed on the target port 8001 at the /metrics endpoint. These metrics are enabled by default when Trident is installed.
  2. A Prometheus instance. This can be a containerized Prometheus deployment such as the Prometheus Operator or you could choose to run Prometheus as a native application.

Once these requirements are satisfied, you can define a Prometheus target to gather the metrics exposed by Trident and obtain information on the backends it manages, the volumes it creates and so on. As stated above, Trident reports its metrics by default; to disable them from being reported, you will have to generate custom YAMLs (using the --generate-custom-yaml flag) and edit them to remove the --metrics flag from being invoked for the trident-main container.

This blog is a great place to start. It explains how Prometheus and Grafana can be used with Trident to retrieve metrics. The blog explains how you can run Prometheus as an operator in your Kubernetes cluster and the creation of a ServiceMonitor to obtain Trident’s metrics.

Note

Metrics that are part of the Trident core subsystem are deprecated and marked for removal in a later release.

Scraping Trident metrics

After installing Trident with the --metrics flag (done by default), Trident will return its Prometheus metrics at the metrics port that is defined in the trident-csi service. To consume them, you will need to create a Prometheus ServiceMonitor that watches the trident-csi service and listens on the metrics port. A sample ServiceMonitor looks like this.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: trident-sm
  namespace: monitoring
  labels:
    release: prom-operator
spec:
  jobLabel: trident
  selector:
    matchLabels:
      app: controller.csi.trident.netapp.io
  namespaceSelector:
    matchNames:
    - trident
  endpoints:
  - port: metrics
    interval: 15s

This ServiceMonitor definition will retrieve metrics returned by the trident-csi service and specifically looks for the metrics endpoint of the service. As a result, Prometheus is now configured to understand Trident’s metrics. This can now be extended to work on the metrics returned by creating PromQL queries or creating custom Grafana dashboards. PromQL is good for creating expressions that return time-series or tabular data. With Grafana it is easier to create visually descriptive dashboards that are completely customizable.

In addition to metrics available directly from Trident, kubelet will expose many kubelet_volume_* metrics via it’s own metrics endpoint. Kubelet can provide information about the volumes that are attached, pods and other internal operations it handles. You can find more information on it here.

Querying Trident metrics with PromQL

Here are some PromQL queries that you can use:

Trident Health Information

Percentage of HTTP 2XX responses from Trident

(sum (trident_rest_ops_seconds_total_count{status_code=~"2.."} OR on() vector(0)) / sum (trident_rest_ops_seconds_total_count)) * 100

Percentage of REST responses from Trident via status code

(sum (trident_rest_ops_seconds_total_count) by (status_code)  / scalar (sum (trident_rest_ops_seconds_total_count))) * 100

Average duration in ms of operations performed by Trident

sum by (operation) (trident_operation_duration_milliseconds_sum{success="true"}) / sum by (operation) (trident_operation_duration_milliseconds_count{success="true"})

Trident Usage Information

Average volume size

trident_volume_allocated_bytes/trident_volume_count

Total volume space provisioned by each backend

sum (trident_volume_allocated_bytes) by (backend_uuid)

Individual volume usage

Note

This is only enabled if kubelet metrics are also gathered

Percentage of used space for each volume

kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100

Trident Autosupport Telemetry

By default, Trident will send Prometheus metrics and basic backend information to NetApp on a daily cadence. This behavior can be disabled during Trident installation by passing the --silence-autosupport flag. In addition, this information can be sent to NetApp support on-demand via tridentctl send autosupport.

An example payload sent by Trident looks like this:

{
  "items": [
    {
      "backendUUID": "ff3852e1-18a5-4df4-b2d3-f59f829627ed",
      "protocol": "file",
      "config": {
        "version": 1,
        "storageDriverName": "ontap-nas",
        "debug": false,
        "debugTraceFlags": null,
        "disableDelete": false,
        "serialNumbers": [
          "nwkvzfanek_SN"
        ],
        "limitVolumeSize": ""
      },
      "state": "online",
      "online": true
    }
  ]
}

This information is collected and sent via a trident-autosupport container that is installed alongside Trident. You can obtain the container image at netapp/trident-autosupport