Notifications play an essential part in the user experience of Slack. Staying informed and engaged requires timely alerts for mentions and direct communications. Any hiccup in notification transmission can reduce users' faith in Slack's dependability.
A notification passes through almost all of the systems in our infrastructure on its way to its destination. A notification request travels through the online application, including our application logic and the monorepo for browser and desktop clients. Before it reaches our vast client, which includes iOS, Android, Desktop, and online clients, it then travels through the work queue, pushes service, and interfaces with various third-party services.
What is the Purpose of Tracing the Flow of Notifications?
Slack notification workflow has become increasingly complex with the introduction of features like Huddles and Canvas. Consequently, troubleshooting notification-related issues involves extensive, multi-day debugging efforts across multiple teams. Additionally, customer inquiries regarding notifications received the lowest NPS scores and required more resolution time than other concerns.
Addressing notification problems within Slack systems proved challenging due to disparate logging pipelines and data formats across each system. This necessitated navigating through various data formats and backends. The debugging process demanded extensive technical proficiency and spanned several days. Furthermore, the context in which events were logged differed among systems, further extending investigations. This led to a time-intensive procedure, demanding expertise across the entire technology stack to comprehend the sequence of events.
Tracing the flow of notifications aims to establish a uniform data format and meaning for events, simplifying comprehending and troubleshooting notification data.
Slack Notification Flow
Slack team created a notification flow to understand all the events in a notification trace. This involved identifying all the events in a trace, creating an idealized funnel, and setting the context in which each event will be logged. They also had to agree on the semantics of a span and the names of the events, which was a challenging task across different platforms. They decided to make a Notification Flow to understand all the events.
Slack Notification flow is divided into two parts: Server Side and Client Side.
In the server-side notification flow:
- Notification Trigger: An event in the system warrants a notification, such as a new message or a mention.
- Event detection: The server's monitoring system detects and identifies the event as notification-worthy.
- User Notification: The server gets ready to notify the user connected to the event.
- Notification Sent: The server notifies the user via the chosen channel, such as email or a push notification.
In the client-side notification flow:
- Notification Received: The software sends a notice to the user's mobile alerting them to an event or message in the workspace.
- Notification Opened: The user responds to the notification by launching the app to read the event or message's specifics.
- Notification Read in App: The user views the notification's content within the app to get the complete context of the message or occurrence.
Mapping Notification Flow to a Trace
After planning their system's notification flow, Slack Team needed a way to keep track of their information. They choose to use SlackTrace to represent the flow, and all the parts of their system can already send information in the span event format.
Major challenges overcome while modeling notification flows as trace are:
100% sampling for notification flows:
For notification flows, the Slack team decided to implement a 100% sampling rate, in contrast to backend requests, which were sampled at 1%. This choice stems from the need for complete accuracy in handling customer requests as per the request of our CE team. When messages are broadcast widely, such as using
@channel, a single trace for a Slack message could generate enormous spans, potentially reaching billions across numerous users and devices. This volume of spans would put a significant strain on our trace ingestion pipeline and storage systems. Opting for no sampling also means that they will be capturing traces for every Slack message sent.
Tracing Notifications as a Distinct Flow:
They trace notifications separated from the trace of the original message sent. Presently, the OpenTelemetry (OpenTracing) instrumentation tightly associates tracing with a request context. However, in a notification flow, this tight coupling breaks down because the flow operates in multiple contexts and doesn't neatly correspond to a single request context. Additionally, integrating tracing across our codebase became problematic due to the complexity of managing multiple trace contexts simultaneously.
Solution to Overcome Tracing Notification Challenge
To address tracing notification challenges, the Slack team treated each notification sent as an independent trace. They employed span links to link the spans causally to establish a connection between the sender's trace and the notifications sent. Each notification received a unique notification_id, utilized as the trace_id for the notification flow.
The advantages of this approach are:
- SlackTrace's instrumentation, which doesn't tightly bind trace context propagation with request context propagation, is significantly simplified by modeling these flows.
- Treating each notification as its trace results in more minor, more manageable traces that are easier to store and query.
- It enables the implementation of a 100% sampling rate for notification traces while maintaining a 1% sampling rate for senders.
- Using span linking allows the maintenance of causality in the trace data.
Benefits of Modeling a Notification Flow as a Trace
Some of the benefits of Modeling a Notification Flow as a Trace are:
- Uniform Data Format: All services report data as a Span. This ensures that data from various backend and client systems is in a consistent format.
- Service Name for Source Identification: They utilize the service name field (Desktop, iOS, or Android) to identify the client or service that generated an event uniquely.
- Standard Context Names: They employ the span and service names to identify an event across systems uniquely. For example, the service name for a "notification: received" event is set to iOS, Android, or Web to tag these events accurately. This standardization helps in uniformly querying events from different clients.
- Consistent Timestamps and Durations: All events have a uniform timestamp with the same resolution and time zone as other events. If an event has a duration associated, they set the duration field. For one-off events, the default duration is 1. This approach provides a centralized location for storing all duration-related information.
- Flexible and Extensible Data Model: This model allows for flexibility and extension. Clients requiring additional context can add extra tags to an existing span. If none of the existing spans suit their needs, they can introduce a new span to the trace without affecting the existing trace data or queries.
- Elimination of Duplicate Events: Using SpanID in events ensures uniqueness at the source. This has significantly reduced double-reported events, eliminating the need for backend de-duplication processes. In contrast, the older method, which reported thrift objects without unique IDs, required de-duplication jobs to identify double-reported events.
- Span Linking for Trace Cohesion: Linking spans across traces is instrumental in maintaining causal relationships without ad hoc data modeling. This preserves the flow and order of events, providing a more accurate representation of the sequence of operations.
- Built-in Sessions: The method incorporates the notification ID as the trace ID for the entire flow, effectively sessionizing all events. This eliminates the need for additional sessionization steps. Although not all events possess a notification ID, we can link them together using the trace ID instead of relying on custom events.
- Clean, simple, and reliable instrumentation: This approach offers a clean, straightforward, and dependable process. By sessionizing the trace, we only need to apply tags once when modeling the notification flow. This streamlines the instrumentation code, making it easier to test and maintain. Furthermore, it simplifies data usage by employing a single universal join key rather than multiple specialized keys for specific event subsets.
How Developers Use Notification Trace Data at Slack?
Developers use the notification trace data to prioritize issues. Previously, tracking notification failures involved going through logs of several systems to understand where a notification was dropped. This process involved several hours of senior engineers’ time to understand what happened. However, after notification tracing, anyone could look at a trace of the notification to precisely see where a trace was sent and where the notification flow was dropped.
With the implementation of notification tracing, Slack has improved the dependability of its notification system and established a model for how intricate workflows can be effectively managed. This improvement demonstrates Slack's dedication to providing all of its users with a reliable and smooth communication experience.
Checkout SuprSend if you want to get that robust Slack notification architecture without having to code and build the components yourselves.