Error Logging, Monitoring, and Resolution

One of the most powerful and unique aspects of Redpoint infrastructure is its comprehensive full-stack error logging, monitoring, and resolution process. We believe these features are crucial in order to support the continuous deployment (CD) model in favor of manual QA processes because it automates the aggregation, prioritization, and alerting of errors directly to the development team which is responsible for fixing them.

Additionally, it greatly facilitates the troubleshooting of those errors and automatically closes the associated work ticket(s) and alert(s) when the errors are no longer occurring.

Architecture

Logging

It all starts with logging, which comes out of the box for both the frontend and backend. As you continue to develop your application, simply continue throwing useful exceptions in the code and they will automatically get reported into New Relic (or alternative).

We highly recommend the following practices for throwing exceptions in order to maximize the value of error aggregation:

  • Always use static strings in the exception message and do not insert dynamic values into the string. For example, prefer “Failed to insert record into users table” over “Failed to insert record into users table: ${dbErrorMessage}”.

  • Provide the dynamic values instead as metadata in the exception. Such values should be relevant to the failure, such as the DB error message, an input value that was invalid, the resource ID(s) that were being operated on, etc.

By using this approach, error aggregation on the static message string is achieved, and valuable metadata analytics for each aggregated error become available within New Relic. For example, if an API endpoint is regularly failing due to the inability to save a resource, and it so happens that 90% of the failures are related to a single resource ID, that will be immediatley apparent within New Relic analytics for that aggregated error. From there, it gives the development team valuable insight as to how to reproduce the problem and investigate the root cause.

User Context

The global error handler we provide out of the box is prepared to always include the user ID (but never any PII) for any errors which occur within a user’s session. This unlocks several valuable features:

  • For each unique error, we know exactly how many unique users it affects. We use that information to automatically prioritize the associated OpsGenie alert and Jira ticket such that errors which are causing widespread problems have a higher priority.

  • Within New Relic analytics, the user ID is a piece of metadata just like any other metadata, and so it is immediately apparent if a particular error is predominantly affecting a particular user or set of users, or if it is basically indiscriminate. This is highly valuable information for the development team to help troubleshoot the root cause.

Monitoring and Alerting

The monitoring and alerting of errors is done automatically by way of a sophisticated synchronization process that we have built as an AWS Lambda function which runs on a 1 minute schedule, thus providing near real-time feedback to the development team. The function queries the error log data source (e.g. New Relic) for all errors that have occurred over the last 24 hours, normalizes and aggregates them, automatically prioritizes them based on the affected user percentage (affected users / total active users), and then synchronizes them to both OpsGenie and Jira.

The OpsGenie alert serves as the call to action, just as any alerting system is intended for. The associated Jira ticket serves to track the actual work that is done in order to resolve the error. More on this is provided in the resolution section below.

As each error may increase or decrease in frequency and change in other ways, it is continuously kept in sync with its associated OpsGenie alert and Jira ticket. For example, if a low-frequency error suddenly becomes much more common, then the prioritization of the alert and ticket will increase accordingly, thus automatically driving more attention to the problem.

Resolution

The resolution of errors is the final piece of the process, in which a development team is responsible for responding to and fixing errors. Due to the automatic alerting described above, the development team knows when a new problem is introduced (via OpsGenie alert), and can always see the full set of active problems by looking at a specific Jira board (provisioned by us). This Jira board is also particularly useful for PMs and other stakeholders to see what problems are happening in production and to help prioritize them further beyond the automatic prioritization they are given.

Once a developer is assigned a Jira ticket to work on that represents an error, he/she can easily navigate to the relevant New Relic analytic data for that error by clicking on a link that is provided within the ticket. From there, they can troubleshoot the problem based on unique aspects of the error such as which resource(s) it is affecting, if it disproportionately affects certain users, if it only occurs at certain times of the day, etc.

Once the developer has figured out the problem and has created a code change within a git branch, he/she should open a PR in GitHub for review. Due to how the CI/CD process works, this automatically generates a preview environment for the branch, thus allowing for live testing by anyone in the company to help validate that the problem has been solved. It is of course recommended that a regression test be added as part of the code change, and thus the CI/CD process will also run that test to ensure it passes.

Lastly, once the PR is reviewed and the fix is sufficiently validated, it should be merged. The merge will automatically trigger the CI/CD process, and the last step of that process deploys it to production!

No other action is needed, though it is advised that the same developer briefly monitor New Relic for that specific error in order to ensure it was in fact completely resolved. In any case, the automatic synchronization process will end up resolving the Jira ticket and OpsGenie alert automatically after it recognizes that the error has not been seen for a sufficient period of time. The developer may optionally close the Jira ticket prior to that (just as he/she would for any piece of work), and it will remain closed unless the error resurfaces, in which case it will be re-opened (as opposed to creating a new ticket).