SBN

How DataDome Automated Post-Mortem Creation with DomeScribe AI Agent

Writing post-mortems is a necessary but often frustrating task for on-call engineers. While these reports help teams learn and improve, they can also be time-consuming—especially when the process feels more like paperwork than problem-solving.

During a recent internal hackathon, we set out to change that. Our goal? To make post-mortem documentation faster and easier without sacrificing the insights that drive improvement. The result: DomeScribe, a Slackbot and AI agent that automates post-mortem creation in Notion.

In this blog, we’ll take you through the journey of building DomeScribe—how it streamlines our incident management workflow, the AWS services we used to bring it to life, and how you can replicate it to automate your own post-mortem process.

How we handle incident management at DataDome

At DataDome, incident management involves minimizing impact, resolving issues swiftly to maintain our Service Level Agreements (SLAs), and ensuring seamless customer experiences. When an incident occurs, it’s declared in a dedicated Slack channel, where the response is coordinated.

The first goal in any incident is to mitigate the impact. The team documents each action taken to create a clear timeline of what has been done and when, helping maintain transparency throughout the process.

Incidents are primarily the result of internal monitoring systems, but can also be triggered by the support team or our customers. Internal monitoring is key to detecting issues before they affect users, enabling proactive management, and reducing the likelihood of customer-facing impacts.

How we approach post-mortems

Post-mortems at DataDome follow a blameless approach, focusing on learning and improving rather than attributing fault. This fosters a collaborative culture where team members feel safe to share insights. Post-mortems serve as a valuable tool for understanding what went wrong and ensuring that follow-up actions are clearly outlined and communicated to customers.

Every post-mortem includes:

  • High-level summary & impact: A brief overview of the incident and its effects.
  • Incident timeline: A detailed sequence of events from detection to resolution.
  • Root-cause analysis: Identifying the underlying cause of the incident.
  • Actions tracking: Documenting corrective actions and preventive measures.
  • Lessons learned: Highlighting key takeaways to avoid similar incidents in the future.

By consistently conducting post-mortems, DataDome aims to learn from every incident, support follow-up action plans, and improve overall response quality. For a deeper dive into DataDome’s post-mortem culture, check out our article on how post-mortems help the team learn from incidents.

Post-mortems are more than documentation, they drive follow-up actions and strong customer communication. But creating them is tedious, from building timelines to pulling shared links. That’s why we built DomeScribe, an AI agent that automates the first draft, freeing engineers to focus on analysis rather than admin work. It provides a solid starting point, letting teams refine and add insights efficiently.

How to build an incident management AI Agent

Now that we’ve explored DataDome’s incident management processes, it’s time to get technical. In the next section, you’ll learn how to build a custom incident management AI agent—step by step.

DomeScribe’s anatomy & tech

Let’s explore how DomeScribe is hosted and the AWS services powering post-mortem generation.

How DataDome Automated Post-Mortem Creation with DomeScribe AI Agent

 

DomeScribe is encapsulated in a Docker container, stored in an AWS Elastic Container Registry (ECR), and fed with images built and pushed from a GitHub action. From a scheduling perspective, it runs as a pod AWS Elastic Kubernetes Service (EKS) cluster.

DomeScribe does not require any heavy scaling, but by using EKS to deploy it, we can ensure its high availability while making it easy to deploy and maintain with ArgoCD.

An interesting advantage of Slack’s WebSocket Secure (WSS) communication is that it eliminates the need for a DNS record, reducing the Kubernetes resources required for deployment.

All it takes is a deployment, multiple external secrets, and a service account to leverage the IAM Role for Service Account (IRSA) and enable communication with AWS Bedrock.

The backbone of DomeScribe is its use of AWS Bedrock. With the increasing need of enterprises to leverage Large Language Models (LLMs), Bedrock stands as a way to expose Foundation Models (FM), and act as an alternative to OpenAI usage by exposing a high diversity of LLMs. The most common being Claude from Anthropic, Titan from Amazon, Mistral from Mistral AI, and Llama from Meta. After multiple tests, considering both model availability and cost, we decided to deploy Llama 3.1 (305B version).

Finally, let’s talk about cost. DomeScribe’s expenses are minimal due to:

  • Its pod running requires almost no computing resources.
  • Bedrock’s pay-per-token pricing, with the low volume of post-mortem requests—even during development—adds only a few cents to the bill.

How the different components interact

Now, let’s focus on the interaction between the components behind DomeScribe. It will act as an intermediary between Slack, Bedrock, and Notion.

How DataDome Automated Post-Mortem Creation with DomeScribe AI Agent

1.) Selection of meaningful messages: After an incident is resolved, the on-call engineer invokes DomeScribe with the /unroll command in Slack. Once the bot receives the call, it asks the on-call engineer to select the list of relevant Slack threads. In most cases, a single Slack thread will contain all the information related to a particular incident, but DomeScribe does allow the user to select multiple threads or messages.

How DataDome Automated Post-Mortem Creation with DomeScribe AI Agent

2.) Thread message retrieval & refinement: DomeScribe then dives into each Slack thread to retrieve and format the content of the conversation. Sometimes, these raw Slack messages can lack meaningful context for post-mortem creation. We will take the example of Slack mentioning functionality: @someone. When retrieving the messages from Slack, the raw messages do not contain the name of “someone”, but rather an internal identifier such as <@RAND_UUID>. This identifier does not provide any interesting information to the LLM and may in some cases, cause the model to hallucinate.

3.) Prompting AWS Bedrock: Once the messages are refined, we prepend a predefined prompt before sending them to AWS Bedrock. We’ll discuss the prompt later, but it ensures the LLM generates a comprehensive, structured post-mortem that is aligned with our template.

4.) Creation of the Notion page: Last but not least, the AWS Bedrock result is formatted and used to create an entry in our Notion post-mortem database.

How DataDome Automated Post-Mortem Creation with DomeScribe AI Agent

Prompt customization & LLM selection

Surprisingly, one of the biggest challenges in DomeScribe was fine-tuning the prompt and selecting the right LLM. In this section, we’ll walk through the process that led to a well-structured and coherent post-mortem.

First, we collected Slack messages from past incidents to test and validate different LLMs and prompts with relevant examples. We selected incidents with varying complexities—some straightforward, others requiring a deeper knowledge of our infrastructure and business to understand.

Next, we set up a simple prompt asking the LLM to create a post-mortem with specific sections (introduction, timeline, root cause analysis, etc). We refined the prompt to provide direction to the LLM. We stated the prompt should act as an SRE engineer responsible for writing a post-mortem. We instructed the LLM that it would receive a list of Slack messages related to an incident and should format the output in Markdown.

We paired our incident examples with an initial draft of the prompt and tested it across multiple Foundation Models on AWS Bedrock. After evaluating the results, we chose Llama 3.1 405B for its consistency, ability to handle complex information, and price.

We soon realized the model struggled to generate certain post-mortem sections. During incidents, engineers often take shortcuts in communication, implying details about our infrastructure and business. Since the LLM relies solely on Slack messages, this lack of context led to misunderstandings and occasional hallucinations.

As a consequence, the prompt was updated to explicitly ask it to write three sections:

  • A summary of the incident with a maximum of 5 sentences.
  • A customer impact section that only includes details mentioned in the messages, such as the number of affected customers or the duration of the impact.
  • A timeline to include the details of the incident, the actions taken by the on-caller, and to ignore the irrelevant questions or messages of the incident.

By directing the LLM to focus on these three sections, it generated more meaningful and useful post-mortems.

While not perfect due to its limited knowledge of our business, it serves as a strong starting point—quickly drafting post-mortems and handling the most time-consuming sections.

Below is an example of one of the post-mortems it created for us.

How DataDome Automated Post-Mortem Creation with DomeScribe AI Agent

Save time with your own incident response AI agent

Building DomeScribe was faster and easier than expected. We had a production-ready solution in under two days. Powered by EKS, ECR, and Bedrock, the infrastructure was quickly set up, allowing us to focus on prompt engineering and refining the model with real-world testing.

The time savings for our engineers are difficult to quantify, but they can now focus on the post-mortem sections where they add the most value.

Best of all, the cost is negligible, making DomeScribe an efficient and practical solution for streamlining post-mortem writing.

*** This is a Security Bloggers Network syndicated blog from Blog – DataDome authored by Nicolas Spenlihauer. Read the original post at: https://datadome.co/engineering/how-datadome-automated-post-mortem-creation-with-domescribe-ai-agent/