Best-practice Handling of Errors#
Motivation#
Historically, errors produced log messages (on the terminal, or in
fgfs.log). Regular users typically don’t see these messages for
desktop software, and may regard using the terminal as difficult. However
aircraft and core developers typically need the log information, in order to
actually diagnose any problems.
Furthermore, when problems are only reported at lower log levels, users may not even be aware there is a problem, or may perceive the entire simulator as being broken, when simply facing a problem with an unmaintained aircraft, or their local configuration.
The aim of the error-reporting system is to collect sufficient information internally about errors for a developer to diagnose it, while surfacing a summary to the user, so they understand the general problem, and can take remedial steps. This might mean opening a bug report, checking for an updated version of the aircraft, or adjusting their local setup.
Approach#
The error reporting system defines a low-level callback in SimGear, which can
be called from arbitrary code, when an error occurs. Information on the kind
of error, a detailed message, and the location (for instance a file path or
property XML node) can be provided. The callback is
simgear::reportFailure().
If error reporting is active, a dedicated subsystem in FlightGear
(flightgear::ErrorReporter) collects error reports, and combines
them. This is important to avoid spamming the user with a lot of information
if many related errors occur, which is very likely. When one or more errors
occur in an aggregation category (such as “the current aircraft”, or a
particular scenery path), a message is shown to the user after a timeout. The
timeout is scaled so that all the relevant error occurrences have triggered,
before we show any user interface.
If no callback is registered, or the subsystem is disabled, error reports are ignored.
Usage#
Errors should be reported at whichever place runs a particular semantic
operation, such as loading a configuration file or model, or triggering some
command. Typically, this interacts with C++ exception handling: the relevant
code is wrapped in a try ... catch block, and if an exception is caught,
an error is reported.
Most existing places which report a log message at SG_ALERT level are
good candidates for reporting an error instead.
The error reporter is enabled or disabled using the
/sim/error-reporter/enabled property. Its default value is defined
in defaults.xml.
Context#
To assist in understanding and attributing the error, context values are tracked on a per-thread basis. A helper class exists to push and pop these values based on C++ object scope. The context mechanism allows higher-level code to add information which is collected when an error occurs at a lower level. For example, when parsing an autopilot XML file, the report for an error in a condition or expression could include the name of the autopilot component as context.
Use simgear::ErrorReportContext to push and pop context values.
Typically, the aggregation code in
flightgear::ErrorReporter::getAggregateForOccurence() uses the
context information to make the best guess about the underlying origin of the
error, using some rules and heuristics.
Overhead#
Reporting an error is moderately expensive, but it’s assumed it would never be done on a critical path. Code which reports errors should typically disable itself after the error is reported, since repeating an error on each frame or update is unhelpful. This often means adding some tracking state so that failed components know they should skip further processing. The error collection code does attempt to suppress duplicates so that only a single report is surfaced to the UI/Sentry, even if multiple related occurrences exist.