Writing your own glue services¶
Writing a glue service has a few requirements:
Knowing how to integrate with your monitoring system so that your glue service will be notified of all alerts.
Knowing how to use the Argus API.
We can only provide hints for how to think about number 1. This guide will therefore mostly focus on the second requirement.
How to get alerts from your monitoring system¶
This depends entirely on your source system and what mechanisms it provides for this. Some systems, like NAV, will provide scripting hooks for receiving raw streams of all alerts and state changes that occur. Others will provide scripting hooks as part of its notification systems. Some systems may even be as simple as cron jobs that only send e-mails on errors and keep no internal state.
Things you need to consider:
Does your system keep track of its own alert states, or is every notification stateless? If your system keeps state, what is the state identifier? This identifier should be part of the incident posted to Argus, as means to ensure you are keeping Argus in sync with your source system.
Do you need or want to backfill a history of incidents from your source system, from before your integration with Argus, or do you only care about new incidents?
What happens if the communication between your glue service and Argus temporarily breaks down? How will you synchronize the Argus incident database with everything that happened in your source system during this outage? Strategies may include:
Your glue service maintains its own persistent queue of unsynchronized incidents and makes sure to flush this once communication with Argus is restored.
Your source system maintains its own logs of state changes, which can be used as a data source to synchronize Argus once communication is restored.
Argus API access libraries¶
If you prefer to work with Python, you are in luck: There is already a Python library to help you access and consume the Argus API, called PyArgus (available on PyPI as argus-api-client).
Incidents and the Argus API¶
Argus models Incidents. An incident will, in most cases, be stateful,
meaning it’s either open or closed, and has both a start timestamp and an
end timestamp. An open, stateful incident is represented by a value of
infinity
in its end timestamp. Stateless incidents are also supported by
Argus, but are not the main focus of the API and UI (See On stateful
incidents vs. stateless incidents for details).
An incident has a description, which is a string of text, usually derived from the source system, to describe a problem that happened.
An incident can have any number of tag values. These are useful metadata to categorize an incident, and thereby filtering it in either the Argus frontend dashboard or in a user’s notification profile.
A glue service mainly concerns itself with:
Creating new incidents in Argus whenever a source system reports a problem.
Describing and tagging created incidents in an expressive, meaningful way for the users’ consumption.
Closing existing Argus incidents it has created when the source system reports that a problem has been resolved.
Creating a new incident¶
Creating a new incident in Argus is done by POST-ing a JSON payload to the
REST API endpoint /api/v1/incidents/
(See the Incident endpoints
documentation).
At minimum, you need to provide these incident attributes:
decription
start_time
tags
To indicate that this incident is stateful (meaning it is waiting to be resolved), you must also give the
end_time
attribute the value"infinity"
. If you don’t do this,end_time
will default to anull
value, which means this incident is stateless, and does not need to be resolved.
Optionally, you may want to provide these attributes as well:
source_incident_id
: An identifier that can be used to match up this incident with some state/alert/incident in the source system in the future. If provided, Argus will enforce uniqueness of source incident identifiers.details_url
: A relative (or fully-qualified) URL to a page in the source system’s web-based user interface, which will give more details about this specific incident.
{
"description": "foobar-sw.example.org stopped responding to ping requests",
"source_incident_id": "42",
"details_url": "/alerts/detail/42/",
"start_time": "2020-12-11 15:50:42",
"end_time": "infinity",
"tags": [
{"tag": "host=foobar-sw.example.org"},
{"tag": "location=Campus Rotvoll"},
{"tag": "customer=himunkholmen.no"}
]
}
Closing incidents that have been resolved¶
Once the source system reports an incident as resolved, the glue service needs to close the corresponding Argus incident. But how can it keep track of which Argus incident maps to the resolved problem?
There could be a multitude of approaches to this, but in essence there are two distinct scenarios that come into play:
The source system already keeps track of its own state.
The source system does not keep track of state.
When the source system already tracks state¶
In this scenario, the source system should already have some identifier for the
resolved state, and you should already have posted this value in the
source_incident_id
when you first created the Argus incident.
The API endpoint /api/v1/incidents/mine/
is useful in this regard. It
functions mostly the same as the /api/v1/incidents/
endpoint, but will only
ever look at incidents reported from the source system whose API token you are
currently using to access the API.
If your source system reports that it resolved a problem whose identifier was
42
, you can simply find the corresponding Argus incident by issuing a
GET request for /api/v1/incidents/mine/?source_incident_id=42
.
When the source system does not track internal state¶
In this case, things immediately become more involved. Your glue service needs a strategy to track state itself. Suggested strategies may be:
The glue service needs to track state in its own database.
The glue service can potentially calculate a hash value of incident attributes that will be the same for events that close an incident as for events that open an incident. This hash value can be used as the Argus incident’s
source_incident_id
, and then use the same strategy as for state-tracking source systems.The glue service can fetch the list of open Argus incidents posted by itself (from
/api/v1/incidents/mine/?open=true
), then use as complicated a custom algorithm as necessary to determine which of these Incidents match up with the resolving event it is currently processing.
Performing the close operation¶
Closing an open Argus incident normally entails changing the incident’s
end_time
attribute to a proper timestamp (representing the time the source
system detected that the incident had been resolved). However, Argus will not
simply allow you to set this value on an existing incident.
Instead, Argus keeps a log of events for each incident it tracks. When you created the original incident, a creation event was implicitly logged alongside it. An Argus incident is closed by posting a closing event to the incident’s event log. The closing event can contain its own description, if need be.
An incident with the id 27
can be closed by POST-ing a new event to
/api/v1/incidents/27/events/
:
{
"timestamp": "2020-12-11 15:57:00",
"type": "END",
"description": "Foobar was resolved somehow"
}
You should only ever use the END
event type to indicate that the incident
was resolved from the source system. The available types of events are
documented in Incident endpoints.
On stateful incidents vs. stateless incidents¶
Argus incidents are primarily stateful, but the concept of stateless incidents is also supported. The difference may not be immediately obvious, and depending on your needs, stateless incidents may seem useless.
Stateful incidents¶
A stateful incident, by definition, has an extent in time. The incident began at some point in time, and ended (or will end) at a later point in time. The state of an such an incident is therefore either open or closed.
An incident must always have a start_time
value. If a definite end_time
value has not been set for it yet, its state is considered open. Once a
definite end_time
value is set, it is considered closed.
Stateless incidents¶
A stateless incident only represents something that happened at a single point in time, and otherwise has no extent in time. It can never be considered to be open nor closed.
Whether stateless incidents are useful to you depends on your needs and the source systems you want to integrate with Argus. Some source systems will generate alerts that are just one-off notifications, and are not considered to represent a state or an ongoing problem.
One such example is from Network Administration Visualized (NAV), which will send one-off early warnings that devices have stopped responding to ICMP ping requests. These are sent a few minutes in advance of NAV actually declaring the device to be down/non-responsive. If several such warnings messages are sent, while the device is never actually declared to be down, this may indicate that there is a problem with “flapping”, even though the device appears to be responding most of the time.
Stateless incidents can be matched by notification profiles, if so desired. The Argus API incidents endpoint (and the Argus UI) will, by default, only show open/stateful incidents unless explicitly instructed to also include stateless incidents. Normally, open stateful incidents are the ones you want to act on.
Representation¶
Internally, to represent a stateful incident that is still open, the special
value infinity
is used as the value of end_time
. This signals that the
incident is expected to end at some unknown point in the future, and is quite
useful when doing time-based queries on stateful incidents.
Conversely, stateless incidents will never have an end_time
value, which
means that these incidents explicitly set this to a null
value.
These special values are also exposed through the API.