Writing your own glue services¶
Writing a glue service has a few requirements:
Knowing how to integrate with your monitoring system so that your glue service will be notified of all alerts.
Knowing how to use the Argus API.
We can only provide hints for how to think about number 1. This guide will therefore mostly focus on the second requirement.
How to get alerts from your monitoring system¶
This depends entirely on your source system and what mechanisms it provides for this. Some systems, like NAV, will provide scripting hooks for receiving raw streams of all alerts and state changes that occur. Others will provide scripting hooks as part of its notification systems. Some systems may even be as simple as cron jobs that only send e-mails on errors and keep no internal state.
Things you need to consider:
Does your system keep track of its own alert states, or is every notification stateless? If your system keeps state, what is the state identifier? This identifier should be part of the incident posted to Argus, as means to ensure you are keeping Argus in sync with your source system.
Do you need or want to backfill a history of incidents from your source system, from before your integration with Argus, or do you only care about new incidents?
What happens if the communication between your glue service and Argus temporarily breaks down? How will you synchronize the Argus incident database with everything that happened in your source system during this outage? Strategies may include:
Your glue service maintains its own persistent queue of unsynchronized incidents and makes sure to flush this once communication with Argus is restored.
Your source system maintains its own logs of state changes, which can be used as a data source to synchronize Argus once communication is restored.
Argus API access libraries¶
If you prefer to work with Python, you are in luck: There is already a Python library to help you access and consume the Argus API, called PyArgus (available on PyPI as argus-api-client).
Incidents and the Argus API¶
Argus models Incidents. An incident will, in most cases, be stateful,
meaning it’s either open or closed, and has both a start timestamp and an
end timestamp. An open, stateful incident is represented by a value of
infinity
in its end timestamp. Stateless incidents are also supported by
Argus, but are not the main focus of the API and UI (See On stateful
incidents vs. stateless incidents for details).
An incident has a description, which is a string of text, usually derived from the source system, to describe a problem that happened.
An incident can have any number of tag values. These are useful metadata to categorize an incident, and thereby filtering it in either the Argus frontend dashboard or in a user’s notification profile.
A glue service mainly concerns itself with:
Creating new incidents in Argus whenever a source system reports a problem.
Describing and tagging created incidents in an expressive, meaningful way for the users’ consumption.
Closing existing Argus incidents it has created, when the source system reports that a problem has been resolved.
Creating a new incident¶
Creating a new incident in Argus is done by POST-ing a JSON payload to the
REST API endpoint /api/v1/incidents/
(See the Incident endpoints
documentation).
At minimum, you need to provide these incident attributes:
decription
start_time
tags
To indicate that this incident is stateful (meaning it is waiting to be resolved), you must also give the
end_time
attribute the value"infinity"
. If you don’t do this,end_time
will default to anull
value, which means this incident is stateless, and does not need to be resolved.
Optionally, you may want to provide these attributes as well:
source_incident_id
: An identifier that can be used to match up this incident with some state/alert/incident in the source system in the future. If provided, Argus will enforce uniqueness of source incident identifiers.details_url
: A relative (or fully-qualified) URL to a page in the source system’s web-based user interface, which will give more details about this specific incident.
{
"description": "foobar-sw.example.org stopped responding to ping requests",
"source_incident_id": "42",
"details_url": "/alerts/detail/42/",
"start_time": "2020-12-11 15:50:42",
"end_time": "infinity",
"tags": [
{"tag": "host=foobar-sw.example.org"},
{"tag": "location=Campus Rotvoll"},
{"tag": "customer=himunkholmen.no"}
]
}
Describing an incident through tags¶
Using tags to describe your incidents is a good idea. Tags can be used to describe almost any structured or unstructed incident metadata not covered by the standard incident attributes.
Examples include:
The hostname of an affected device.
A reference to an affected customer.
A geographical location where the problem occurred.
A reference to an affected service instance.
A URL to a relevant section of the affected service’s operating instructions.
This kind of metadata will enable:
Your first line of support to correctly assess, prioritize and react to incidents.
Once Argus gains proper integration with ticketing systems, the metadata in tags can also be carried forward automatically to tickets.
Your devops teams can create notification filters specifically for the services, devices or customers they care about.
Generating reports and statistics on the number and duration of incidents per service, per customer, per device and so forth.
Tag syntax¶
Incident tags are defined syntactically as key=value
. This syntax is
employed by the API both when posting and retrieving incident tags. Any
alphanumeric string (excluding spaces and the =
character itself) can be
used as a tag key, whereas the value can be any string.
On the importance of tag conventions¶
When integrating multiple types of source systems into Argus, it is important to implement a convention for which tag keys to use, so that the incidents reported by your monitoring systems are consistent.
You may, for example, have two separate monitoring systems that monitor
different aspects of the device foobar-sw.example.org
. If one system
reports incidents with the tag host=foobar-sw.example.org
, and the other
uses fqdn=foobar-sw.example.org
, then you will just have a mess on your
hands.
Closing incidents that have been resolved¶
Once the source system reports an incident as resolved, the glue service needs to close the corresponding Argus incident. But, how can it keep track of which Argus incident maps to the resolved problem?
There could be a multitude of approaches to this, but in essence, there are two distinct scenarios that come into play:
The source system already keeps track of its own state.
The source system does not keep track of state.
When the source system already tracks state¶
In this scenario, the source system should already have some identifier for the
resolved state, and you should already have posted this value in the
source_incident_id
when you first created the Argus incident.
The API endpoint /api/v1/incidents/mine/
is useful in this regard. It
functions mostly the same as the /api/v1/incidents/
endpoint, but will only
ever look at incidents reported from the source system whose API token you are
currently using to access the API.
If your source system reports that it resolved a problem whose identifier was
42
, you can simply find the corresponding Argus incident by issuing a
GET request for /api/v1/incidents/mine/?source_incident_id=42
.
When the source system does not track internal state¶
In this case, things immediately become more involved. Your glue service needs a strategy to track state itself. Suggested strategies may be:
The glue service needs to track state in its own database.
The glue service can potentially calculate a hash value of incident attributes that will be the same for events that close an incident as for events that open an incident. This hash value can be used as the Argus incident’s
source_incident_id
, and then use the same strategy as for state-tracking source systems.The glue service can fetch the list of open Argus incidents posted by itself (from
/api/v1/incidents/mine/?open=true
), then use as complicated a custom algorithm as necessary to determine which of these Incidents match up with the resolving event it is currently processing.
Performing the close operation¶
Closing an open Argus incident normally entails changing the incident’s
end_time
attribute to a proper timestamp (representing the time the source
system detected that the incident had been resolved). However, Argus will not
simply allow you to set this value on an existing incident.
Instead, Argus keeps a log of events for each incident it tracks. When you created the original incident, a creation event was implicitly logged alongside it. An Argus incident is closed by posting a closing event to the incident’s event log. The closing event can contain its own description, if need be.
An incident with the id 27
can be closed by POST-ing a new event to
/api/v1/incidents/27/events/
:
{
"timestamp": "2020-12-11 15:57:00",
"type": "END",
"description": "Foobar was resolved somehow"
}
You should only ever use the END
event type to indicate that the incident
was resolved from the source system. The available types are documented in the API endpoint documentation.
On stateful incidents vs. stateless incidents¶
Argus incidents are primarily stateful, but the concept of stateless incidents is also supported. The difference may not be immediately obvious, and depending on your needs, stateless incidents may seem useless.
Stateful incidents¶
A stateful incident, by definition, has an extent in time. The incident began at some point in time, and ended (or will end) at a later point in time. The state of an such an incident is therefore either open or closed.
An incident must always have a start_time
value. If a definite end_time
value has not been set for it yet, its state is considered open. Once a
definite end_time
value is set, it is considered closed.
Stateless incidents¶
A stateless incident only represents something that happened at a single point in time, and otherwise has no extent in time. It can never be considered to be open nor closed.
Whether stateless incidents are useful to you, depends on your needs and the source systems you want to integrate with Argud. Some source systems will generate alerts that are just one-off notifications, and are not considered to represent a state or an ongoing problem.
One such example is from Network Administration Visualized (NAV), which will send one-off early warnings that devices have stopped responding to ICMP ping requests. These are sent a few minutes in advance of NAV actually declaring the device to be down/non-responsive. If several such warnings messages are sent, while the device is never actually declared to be down, this may indicate that there is a problem with “flapping”, even though the device appears to be responding most of the time.
Stateless incidents can be matched by notification profiles, if so desired. The Argus API incidents endpoint (and the Argus UI) will, by default, only show open/stateful incidents unless explicitly instructed to also include stateless incidents. Normally, open stateful incidents are the ones you want to act on.
Representation¶
Internally, to represent a stateful incident that is still open, the special
value infinity
is used as the value of end_time
. This signals that the
incident is expected to end at some unknown point in the future, and is quite
useful when doing time-based queries on stateful incidents.
Conversely, stateless incidents will never have a end_time
value, which
means that these incidents explicitly set this to a null
value.
These special values are also exposed through the API.