Writing your own glue services

Writing a glue service has a few requirements:

  1. Knowing how to integrate with your monitoring system so that your glue service will be notified of all alerts.

  2. Knowing how to use the Argus API.

We can only provide hints for how to think about number 1. This guide will therefore mostly focus on the second requirement.

How to get alerts from your monitoring system

This depends entirely on your source system and what mechanisms it provides for this. Some systems, like NAV, will provide scripting hooks for receiving raw streams of all alerts and state changes that occur. Others will provide scripting hooks as part of its notification systems. Some systems may even be as simple as cron jobs that only send e-mails on errors and keep no internal state.

Things you need to consider:

  • Does your system keep track of its own alert states, or is every notification stateless? If your system keeps state, what is the state identifier? This identifier should be part of the incident posted to Argus, as means to ensure you are keeping Argus in sync with your source system.

  • Do you need or want to backfill a history of incidents from your source system, from before your integration with Argus, or do you only care about new incidents?

  • What happens if the communication between your glue service and Argus temporarily breaks down? How will you synchronize the Argus incident database with everything that happened in your source system during this outage? Strategies may include:

    • Your glue service maintains its own persistent queue of unsynchronized incidents and makes sure to flush this once communication with Argus is restored.

    • Your source system maintains its own logs of state changes, which can be used as a data source to synchronize Argus once communication is restored.

Argus API access libraries

If you prefer to work with Python, you are in luck: There is already a Python library to help you access and consume the Argus API, called PyArgus (available on PyPI as argus-api-client).

Incidents and the Argus API

Argus models Incidents. An incident will, in most cases, be stateful, meaning it’s either open or closed, and has both a start timestamp and an end timestamp. An open, stateful incident is represented by a value of infinity in its end timestamp. Stateless incidents are also supported by Argus, but are not the main focus of the API and UI (See On stateful incidents vs. stateless incidents for details).

An incident has a description, which is a string of text, usually derived from the source system, to describe a problem that happened.

An incident can have any number of tag values. These are useful metadata to categorize an incident, and thereby filtering it in either the Argus frontend dashboard or in a user’s notification profile.

A glue service mainly concerns itself with:

  • Creating new incidents in Argus whenever a source system reports a problem.

  • Describing and tagging created incidents in an expressive, meaningful way for the users’ consumption.

  • Closing existing Argus incidents it has created, when the source system reports that a problem has been resolved.

Creating a new incident

Creating a new incident in Argus is done by POST-ing a JSON payload to the REST API endpoint /api/v1/incidents/ (See the Incident endpoints documentation).

At minimum, you need to provide these incident attributes:

  • decription

  • start_time

  • tags

  • To indicate that this incident is stateful (meaning it is waiting to be resolved), you must also give the end_time attribute the value "infinity". If you don’t do this, end_time will default to a null value, which means this incident is stateless, and does not need to be resolved.

Optionally, you may want to provide these attributes as well:

  • source_incident_id: An identifier that can be used to match up this incident with some state/alert/incident in the source system in the future. If provided, Argus will enforce uniqueness of source incident identifiers.

  • details_url: A relative (or fully-qualified) URL to a page in the source system’s web-based user interface, which will give more details about this specific incident.

Complete example of an incident JSON payload
{
    "description": "foobar-sw.example.org stopped responding to ping requests",
    "source_incident_id": "42",
    "details_url": "/alerts/detail/42/",
    "start_time": "2020-12-11 15:50:42",
    "end_time": "infinity",
    "tags": [
      {"tag": "host=foobar-sw.example.org"},
      {"tag": "location=Campus Rotvoll"},
      {"tag": "customer=himunkholmen.no"}
    ]
}

Describing an incident through tags

Using tags to describe your incidents is a good idea. Tags can be used to describe almost any structured or unstructed incident metadata not covered by the standard incident attributes.

Examples include:

  • The hostname of an affected device.

  • A reference to an affected customer.

  • A geographical location where the problem occurred.

  • A reference to an affected service instance.

  • A URL to a relevant section of the affected service’s operating instructions.

This kind of metadata will enable:

  • Your first line of support to correctly assess, prioritize and react to incidents.

  • Once Argus gains proper integration with ticketing systems, the metadata in tags can also be carried forward automatically to tickets.

  • Your devops teams can create notification filters specifically for the services, devices or customers they care about.

  • Generating reports and statistics on the number and duration of incidents per service, per customer, per device and so forth.

Tag syntax

Incident tags are defined syntactically as key=value. This syntax is employed by the API both when posting and retrieving incident tags. Any alphanumeric string (excluding spaces and the = character itself) can be used as a tag key, whereas the value can be any string.

On the importance of tag conventions

When integrating multiple types of source systems into Argus, it is important to implement a convention for which tag keys to use, so that the incidents reported by your monitoring systems are consistent.

You may, for example, have two separate monitoring systems that monitor different aspects of the device foobar-sw.example.org. If one system reports incidents with the tag host=foobar-sw.example.org, and the other uses fqdn=foobar-sw.example.org, then you will just have a mess on your hands.

Closing incidents that have been resolved

Once the source system reports an incident as resolved, the glue service needs to close the corresponding Argus incident. But, how can it keep track of which Argus incident maps to the resolved problem?

There could be a multitude of approaches to this, but in essence, there are two distinct scenarios that come into play:

  • The source system already keeps track of its own state.

  • The source system does not keep track of state.

When the source system already tracks state

In this scenario, the source system should already have some identifier for the resolved state, and you should already have posted this value in the source_incident_id when you first created the Argus incident.

The API endpoint /api/v1/incidents/mine/ is useful in this regard. It functions mostly the same as the /api/v1/incidents/ endpoint, but will only ever look at incidents reported from the source system whose API token you are currently using to access the API.

If your source system reports that it resolved a problem whose identifier was 42, you can simply find the corresponding Argus incident by issuing a GET request for /api/v1/incidents/mine/?source_incident_id=42.

When the source system does not track internal state

In this case, things immediately become more involved. Your glue service needs a strategy to track state itself. Suggested strategies may be:

  • The glue service needs to track state in its own database.

  • The glue service can potentially calculate a hash value of incident attributes that will be the same for events that close an incident as for events that open an incident. This hash value can be used as the Argus incident’s source_incident_id, and then use the same strategy as for state-tracking source systems.

  • The glue service can fetch the list of open Argus incidents posted by itself (from /api/v1/incidents/mine/?open=true), then use as complicated a custom algorithm as necessary to determine which of these Incidents match up with the resolving event it is currently processing.

Performing the close operation

Closing an open Argus incident normally entails changing the incident’s end_time attribute to a proper timestamp (representing the time the source system detected that the incident had been resolved). However, Argus will not simply allow you to set this value on an existing incident.

Instead, Argus keeps a log of events for each incident it tracks. When you created the original incident, a creation event was implicitly logged alongside it. An Argus incident is closed by posting a closing event to the incident’s event log. The closing event can contain its own description, if need be.

An incident with the id 27 can be closed by POST-ing a new event to /api/v1/incidents/27/events/:

{
  "timestamp": "2020-12-11 15:57:00",
  "type": "END",
  "description": "Foobar was resolved somehow"
}

You should only ever use the END event type to indicate that the incident was resolved from the source system. The available types are documented in the API endpoint documentation.

On stateful incidents vs. stateless incidents

Argus incidents are primarily stateful, but the concept of stateless incidents is also supported. The difference may not be immediately obvious, and depending on your needs, stateless incidents may seem useless.

Stateful incidents

A stateful incident, by definition, has an extent in time. The incident began at some point in time, and ended (or will end) at a later point in time. The state of an such an incident is therefore either open or closed.

An incident must always have a start_time value. If a definite end_time value has not been set for it yet, its state is considered open. Once a definite end_time value is set, it is considered closed.

Stateless incidents

A stateless incident only represents something that happened at a single point in time, and otherwise has no extent in time. It can never be considered to be open nor closed.

Whether stateless incidents are useful to you, depends on your needs and the source systems you want to integrate with Argud. Some source systems will generate alerts that are just one-off notifications, and are not considered to represent a state or an ongoing problem.

One such example is from Network Administration Visualized (NAV), which will send one-off early warnings that devices have stopped responding to ICMP ping requests. These are sent a few minutes in advance of NAV actually declaring the device to be down/non-responsive. If several such warnings messages are sent, while the device is never actually declared to be down, this may indicate that there is a problem with “flapping”, even though the device appears to be responding most of the time.

Stateless incidents can be matched by notification profiles, if so desired. The Argus API incidents endpoint (and the Argus UI) will, by default, only show open/stateful incidents unless explicitly instructed to also include stateless incidents. Normally, open stateful incidents are the ones you want to act on.

Representation

Internally, to represent a stateful incident that is still open, the special value infinity is used as the value of end_time. This signals that the incident is expected to end at some unknown point in the future, and is quite useful when doing time-based queries on stateful incidents.

Conversely, stateless incidents will never have a end_time value, which means that these incidents explicitly set this to a null value.

These special values are also exposed through the API.