Last Updated: May 2026
A user files a support ticket: "I never got my password reset email." Your on-call engineer checks the email vendor dashboard. SendGrid says it delivered 4.2 million emails today with a 99.1% success rate. Everything looks green. But that one user still did not get their email, and you have no way to trace what happened between the moment your backend called the notification API and the moment the email was supposed to land in their inbox.
This is the notification observability gap. Most engineering teams have robust observability for their application layer (APM traces, structured logs, error tracking) but treat notifications as fire-and-forget. You trigger a notification, the vendor accepts it, and you move on. When something breaks, you are left piecing together logs from three or four systems, trying to reconstruct what happened to a single message for a single user.
The problem gets worse at scale. When you are sending millions of notifications across email, push, SMS, and in-app channels, "something went wrong" is not a useful diagnosis. You need to know exactly where in the pipeline it went wrong, for which user, on which channel, and why. That is what notification observability actually means, and it is a fundamentally different problem from notification analytics.
This post covers what real notification observability looks like in production, how to structure logs and traces so that debugging a single notification takes seconds instead of hours, and the common failure patterns that only surface when you have proper tracing in place.
For the SuprSend product story on this same problem, see how SuprSend fixes notification observability.
What Is Notification Observability?
In application monitoring, observability means you can understand the internal state of a system by examining its external outputs: logs, metrics, and traces. Notification observability applies the same principle to the notification pipeline. Given any notification event, you should be able to answer: What triggered it? What decisions did the system make? Was delivery attempted? Did the user see it?
This is different from notification analytics, which deals in aggregates. Analytics tells you that your email open rate is 34% this week. Observability tells you that user #48291's password reset email was triggered at 14:32 UTC, matched the "auth-alerts" workflow, passed the preference check, was routed to email, rendered with the password-reset-v2 template, sent via SendGrid at 14:32.4 UTC, and bounced because the email address had a typo in the domain.
The distinction matters because they answer different questions. Analytics answers "How are notifications performing in aggregate?" Observability answers "What happened to this specific notification for this specific user?" When your VP of Engineering asks why the delivery rate dropped 3% this week, that is analytics. When a user says "I did not get my two-factor code," that is observability. And the second question is the one you need to answer in under two minutes, not after an hour of log spelunking.
Notification observability also covers the decisions the system made that are invisible to the user. A notification that was never sent because the user opted out of that category is just as important to trace as one that failed at the vendor level. Without observability into the decision layer, your support team cannot distinguish between "the system decided not to send" and "the system tried to send and failed." Those are very different problems with very different fixes.
Why "Sent" Is Not the Same as "Delivered"
Most notification systems track two states: triggered and sent. An event comes in, the system processes it, hands it to a delivery vendor, and marks it as "sent." That is where visibility ends. And that is where most delivery failures actually happen.
The gap between "sent" and "delivered" is where entire categories of problems hide:
- Email: Your system sends the email to SendGrid. SendGrid accepts it (status: sent). But the recipient's mail server soft-bounces it because the mailbox is full. SendGrid retries three times over 72 hours, then hard-bounces it. If you only check the "sent" status, you see a successful delivery. The user never got the email.
- Push notifications: Your system sends the push to Firebase Cloud Messaging. FCM accepts it (status: sent). But the user's device has been offline for a week. FCM holds the message, then discards it when the TTL expires. Or the user revoked notification permissions at the OS level, and FCM drops it silently.
- SMS: Your system sends the SMS via Twilio. Twilio accepts it (status: queued). But the carrier flags it as spam and never delivers it. Or the user's phone is on airplane mode. Twilio's status callback eventually returns "undelivered," but only if you are listening for it. Even within the US, A2P traffic can be filtered or throttled when the campaign is not registered with The Campaign Registry. Major carriers (AT&T, T-Mobile, Verizon) drop unregistered 10DLC traffic before it reaches the handset, and the carrier-level reason rarely surfaces in the vendor callback.
- In-app: You push the notification to the user's inbox in your app. It is marked as delivered. But the user has 200 unread notifications and never scrolls down far enough to see it.
Every channel has its own version of this gap. The common thread is that "sent" is your system's perspective, and "delivered" is the user's experience. Notification observability means closing that gap by tracking status at every stage, not just at handoff to the vendor.
This also explains why vendor dashboards are insufficient for debugging. SendGrid's dashboard tells you aggregate bounce rates. It does not tell you that user #48291's password reset email bounced because their email address is "user@gmial.com" (note the typo). Twilio's dashboard tells you that SMS delivery rates dropped 2% this week. It does not tell you that your OTP messages to T-Mobile subscribers are being filtered because your 10DLC campaign is registered as a marketing use case instead of low-volume transactional, so the carrier is throttling them as promotional. You need per-notification, per-user tracing to get from "something is broken" to "this specific thing is broken for this specific reason."
The Four Layers of Notification Observability
After debugging notification delivery issues across hundreds of production systems, a consistent pattern emerges. Failures cluster into four distinct layers, and each layer needs its own instrumentation. If you are designing a notification system architecture, these four layers should be built into the pipeline from day one, not bolted on later. Miss any one of them and you have a blind spot.
Layer 1: Trigger
The first question to answer: did the event even arrive? This sounds trivial, but it is the root cause more often than you would expect. An API call that silently failed. A webhook that timed out. An event payload with a missing user ID that got dropped by validation. A rate limiter that throttled the request.
What you need to log at this layer:
- Event ID (unique, for end-to-end tracing)
- Timestamp of receipt
- Event type / notification category
- User ID or recipient identifier
- Payload validation result (accepted or rejected, with reason)
- Source system (which service triggered this event)
The trigger layer catches problems like: the backend team deployed a change that accidentally stopped sending the "order_confirmed" event. Without trigger-layer logging, this shows up as "order confirmation emails stopped" with no obvious cause. With it, you see that event volume for order_confirmed dropped to zero at 14:22 UTC, which correlates with the deploy at 14:20 UTC.
Layer 2: Workflow
Once the event is received, the notification workflow engine makes a series of decisions. Which workflow matches this event? Should it be sent based on the user's preferences? Is there a batch or digest window open? Which channels should it go to? What template should it use?
This is the decision layer, and it is the most commonly untraced part of the notification pipeline. When a notification is not delivered, the answer is often "the system decided not to send it," but without workflow-level tracing, that decision is invisible.
What you need to log at this layer:
- Which workflow matched (or that no workflow matched, and why)
- Preference check result (allowed, blocked, or no preference set)
- Frequency cap check (passed or throttled)
- Batch/digest decision (added to open batch, or new batch started, or bypass)
- Channel routing decision (which channels selected, and why others were excluded)
- Template selection (which template version was used)
Workflow-layer tracing is what turns "the notification was not sent" from a mystery into a clear explanation: "The notification was not sent because the user opted out of the 'marketing-updates' category on April 12."
Layer 3: Delivery
This is the layer most teams think of when they hear "notification monitoring." The notification has been routed to a channel and handed off to a delivery vendor. Was it accepted? Was it delivered? Did the vendor return an error?
What you need to log at this layer:
- Vendor name (SendGrid, FCM, Twilio, etc.)
- Vendor API response code and message
- Vendor message ID (for cross-referencing with vendor logs)
- Delivery status updates (queued, sent, delivered, bounced, failed)
- Retry attempts and outcomes
- Failover events (primary vendor failed, switched to secondary)
The critical detail here is async status tracking. Most vendors accept the message synchronously (returning a 200 OK) but report the actual delivery status asynchronously via webhooks or polling. Your observability system needs to correlate these async updates back to the original event. Without that correlation, you know the vendor accepted it, but you do not know if it actually arrived.
Layer 4: Engagement
The final layer tracks what the user did after the notification was delivered. Did they open the email? Did they tap the push notification? Did they click a link? Did they read the in-app message?
What you need to log at this layer:
- Open events (email opens via tracking pixel, push notification taps)
- Click events (link clicks within the notification)
- Read/seen status (in-app notifications marked as read)
- Dismiss events (user swiped away the push notification)
- Conversion events (user completed the action the notification prompted)
Engagement data completes the observability picture. Without it, you cannot distinguish between "the notification was delivered but ignored" and "the notification was delivered and the user acted on it." This distinction matters for debugging ("the user says they did not see it, but our logs show it was opened") and for optimization ("our password reset emails have a 92% open rate but only a 60% click rate, which means the CTA is unclear").
Together, these four layers give you a complete trace from trigger to engagement for every notification. When a user reports a problem, you do not start with "let me check SendGrid." You start with "let me pull up the trace" and follow it through each layer until you find where it broke.
What a Step-by-Step Notification Log Should Show
Theory is useful, but let us walk through a concrete example. A user submits a support ticket: "I requested a password reset 20 minutes ago and never got the email."
Here is what a proper notification observability system should show you when you look up that user's recent notifications:
Step 1: Event Received
- Event ID:
evt_8f3a2b1c - Event type:
password_reset_requested - User ID:
usr_48291 - Timestamp:
2026-05-05T14:32:01.442Z - Source:
auth-service v2.4.1 - Payload validation: Passed
Good. The event arrived. The auth service triggered it, and it passed validation. No issue at the trigger layer.
Step 2: Workflow Matched
- Workflow:
password-reset-v2(ID:wf_91bc3d) - Matched at:
2026-05-05T14:32:01.518Z - Preference check: Passed (category:
security-alerts, user setting:required) - Frequency cap: Not applicable (security category exempt)
- Batch check: Bypass (transactional, no batching)
The workflow engine found the right workflow. Preferences did not block it (security alerts are marked as required and cannot be opted out of). No batching applied because transactional notifications bypass digest rules. Still no issue.
Step 3: Channel Routing
- Channels evaluated: Email, SMS
- Email: Selected (primary channel for password reset)
- SMS: Skipped (fallback, only if email fails)
- Recipient email:
jane.doe@gmial.com - Template:
password-reset-v2-email(version 3) - Template rendered at:
2026-05-05T14:32:01.623Z
Channel routing selected email as the primary channel. SMS is configured as a fallback. The template rendered successfully. But look at the email address: gmial.com instead of gmail.com. That is likely the problem, but let us follow the trace to confirm.
Step 4: Delivery Attempted
- Vendor: SendGrid
- API call:
2026-05-05T14:32:01.891Z - API response:
202 Accepted - Vendor message ID:
sg_abc123xyz
SendGrid accepted the email. At this point, your system logged it as "sent." If this is where your observability stops, you would see a successful delivery and tell the user "it was sent, check your spam folder." But the trace continues.
Step 5: Delivery Status Update (async)
- Status webhook received:
2026-05-05T14:32:04.112Z - Status: Bounced
- Bounce type: Hard bounce
- Bounce reason:
550 5.1.1 The email account that you tried to reach does not exist. (gmial.com) - Retry: No (hard bounces are not retried)
There it is. The email bounced because gmial.com is not a valid domain. Hard bounce, no retries. The user has a typo in their email address on file.
Step 6: Fallback Triggered
- Fallback condition: Email delivery failed
- Fallback channel: SMS
- Recipient phone:
+1-555-0142 - Vendor: Twilio
- API call:
2026-05-05T14:32:04.340Z - API response:
201 Created - Delivery status: Delivered at
2026-05-05T14:32:06.891Z
The system detected the email failure and triggered SMS as a fallback. The SMS was delivered successfully. At this point, you can tell the user: "Your email address on file has a typo (gmial.com instead of gmail.com), so the email bounced. We sent the reset link to your phone number instead. Please update your email address in your account settings."
Total time to diagnose: about 30 seconds of reading a single trace. Without this observability, the same investigation would involve checking the application logs for the event trigger, then the notification service logs for the routing decision, then the SendGrid dashboard for the bounce, then the Twilio dashboard for the fallback. Four systems, four different interfaces, and you would need to correlate timestamps manually. That process takes 20 to 45 minutes.
Common Debugging Scenarios (and What the Logs Reveal)
The password reset walkthrough above is one pattern. Here are the other failure modes that come up repeatedly, and what proper notification observability reveals in each case.
Scenario 1: Event Triggered, No Workflow Matched
Symptom: A new feature ships with notifications. Users are supposed to get emails when someone shares a document with them. Nobody is getting the emails.
What the logs reveal: The trigger layer shows the document_shared events arriving correctly. But the workflow layer shows "No matching workflow found for event type: document_shared." The engineering team configured the workflow to match on event type doc_shared (without the "ument"). A one-character mismatch between the event the backend sends and the event the workflow expects.
Without observability: This looks like an email delivery problem. The team checks SendGrid. SendGrid shows no emails were sent for that template. They check the template. The template is fine. They check the API integration. The API is fine. Eventually, after 45 minutes of digging, someone compares the event name in the backend code with the workflow configuration and finds the mismatch.
With observability: You search for document_shared events, see that they are being received but no workflow matches, and compare the event name with the workflow config. Five-minute fix.
Scenario 2: Workflow Matched, Preference Check Blocked Delivery
Symptom: A subset of users reports not receiving weekly project update emails. Other users on the same projects get them fine.
What the logs reveal: The workflow layer shows that for the affected users, the preference check returned "blocked: user opted out of category 'project-updates' on 2026-03-15." These users opted out three months ago, possibly accidentally when they were adjusting other notification settings. Or they opted out intentionally but forgot. Either way, the system is working correctly. It is respecting their preferences.
Without observability: This looks like a bug. The team investigates the workflow, the template, the email vendor. Everything is working. They eventually check the user's preference settings one by one and discover the opt-out. For five users, that is a tedious but manageable investigation. For fifty users, it is a project.
With observability: You filter the logs for the "project-updates" category, see that blocked notifications all share the same reason (preference opt-out), and can immediately tell support: "These users opted out of project update emails. They can re-enable them in their notification preferences."
Scenario 3: Delivery Attempted, Vendor Returned Error
Symptom: Push notification delivery rates drop from 94% to 71% over the course of a day. No code changes were deployed.
What the logs reveal: The delivery layer shows a spike in FCM HTTP v1 errors starting at 09:14 UTC, all returning UNREGISTERED. The affected device tokens are all iOS tokens. Cross-referencing with the deployment logs: the mobile team released a new iOS build at 09:00 UTC that changed the APNs token format. The old tokens are now invalid, but the backend has not received updated tokens from the new app version yet because users have not opened the app since updating.
Without observability: You see the delivery rate drop in your analytics dashboard. You check FCM's status page (no incidents). You check your own infrastructure (fine). You do not know which device tokens are failing or why, because your logs only capture "sent to FCM" and not the per-token delivery result.
With observability: You filter delivery failures by channel (push), vendor (FCM), and error code (UNREGISTERED). You see the spike correlates with the iOS app release. You coordinate with the mobile team to handle token refresh on app update.
Scenario 4: Delivered, But User Did Not See It
Symptom: A user says "I never got the invoice email." Your logs show it was delivered. The user is frustrated. Your support team is stuck.
What the logs reveal: The delivery layer shows the email was accepted by the user's mail server (status: delivered). The engagement layer shows no open event. Checking the email headers and deliverability signals: the email was sent from a domain with a DMARC policy of "none," and the user's organization recently enabled strict spam filtering that routes unsigned emails to quarantine.
Without observability: You can tell the user "we sent it," which is technically true but not helpful. You cannot see whether they opened it, and you have no insight into why it might have been filtered.
With observability: You see the delivery was successful but engagement was zero. You check the sending domain's authentication status (SPF, DKIM, DMARC) and find the issue. You either fix the domain configuration or escalate to the user's IT team with specific details about what is being filtered.
How Notification Platforms Compare on Observability
Not all notification platforms provide the same depth of observability. Some offer basic delivery logs. Fewer provide workflow-level execution traces. Here is how the major platforms compare.
Capability comparison based on each platform's public docs as of May 2026. Sources: SuprSend, Knock, Courier, Novu.
The biggest differentiator is not whether a platform has "logs." Every platform has some form of logging. The differentiator is whether you can follow a single notification from the moment it was triggered through every decision the system made, all the way to the delivery result and engagement, in one view. That end-to-end trace is what turns a 30-minute debugging session into a 30-second one. Most platforms give you pieces of the picture. You see the delivery status here, the workflow decision there, and you have to mentally stitch them together.
How SuprSend Handles Notification Observability
We built observability into the notification pipeline from the start, not as an add-on logging layer. Every notification that flows through the system generates a structured trace that covers each stage of the lifecycle: trigger, workflow execution, delivery, and engagement. The platform exposes this through four log views: Requests, Workflow Executions, Broadcast Executions, and Messages. Here is how the key pieces work.
Unified notification log. Every notification has a single, chronological log that shows every step from event receipt to delivery. When a support ticket comes in, you search by user ID, event type, or time range, and you see the full trace. No jumping between dashboards. No correlating timestamps across systems. One view, complete context.
Workflow execution trace. Our workflow engine logs every decision node it evaluates. If a notification was not sent because a preference check blocked it, the log says exactly which preference, when the user set it, and what category it belongs to. If a channel was skipped because the user does not have a valid push token, the log shows that. These decision traces are what make the difference between "the notification failed" and "here is exactly why it was not delivered."
Vendor response capture. We ingest delivery status webhooks from every connected vendor (SendGrid, Mailgun, FCM, APNs, Twilio, and others) and attach them to the notification trace in real time. When an email bounces three seconds after being sent, the bounce reason appears on the same trace. You do not have to log into SendGrid separately to find the bounce code.
Intelligent failover with trace continuity. When a primary delivery vendor fails, the system automatically routes to a configured fallback. The trace shows the primary attempt, the failure reason, the failover decision, and the fallback delivery result, all in sequence. For transactional notifications like OTPs and password resets, this failover visibility is critical. You need to know not just that the message was eventually delivered, but which path it took and why.
Batch and digest tracing. When a notification enters a batch window, the trace shows the batch ID, the grouping key, how many items have been collected, and when the batch is scheduled to close. After the batch fires, the trace shows the rendered digest and per-channel delivery results. Debugging why a user received a digest with 3 items instead of 7 becomes straightforward: you see exactly which events entered the batch and which arrived after it closed.
Search and filter. The notification log is searchable by date range, status, recipient ID, idempotency key, workflow name, channel, and vendor. Enterprise plans add audit logs for account-level changes (auth, API keys, team modifications). If push delivery rates dropped this morning, you can filter for channel=push, status=failed, and see every failed push notification with its error code. If one user is not getting emails, you search by their recipient ID and scan the last 20 notifications in seconds.
The goal is not to generate more logs. It is to make the answer to "what happened to this notification?" available in one place, in under a minute, without needing to cross-reference three vendor dashboards and two internal log systems.
FAQ
What is notification observability?
Notification observability is the ability to trace a single notification from the moment the triggering event is received through every system decision (workflow matching, preference checks, channel routing) to the final delivery status and user engagement. It applies the same principles as application observability (logs, metrics, traces) to the notification pipeline, giving you per-notification, per-user visibility instead of just aggregate analytics.
How is notification observability different from notification analytics?
Analytics answers aggregate questions: "What is our email open rate this week?" or "How many push notifications did we send yesterday?" Observability answers specific questions: "What happened to user #48291's password reset email?" or "Why did this user not receive their order confirmation?" Analytics is for trends and optimization. Observability is for debugging and incident response. You need both, but they serve different purposes.
What are the most common reasons a notification is not delivered?
The four most frequent causes are: (1) the user opted out via notification preferences, and the system correctly suppressed delivery; (2) the delivery vendor returned an error such as an invalid email, expired push token, or rate limit; (3) the event was triggered but no workflow was configured to match it, often due to a typo in the event name; (4) the notification was delivered to the vendor but filtered by the recipient's infrastructure, such as a spam filter or disabled push permissions at the OS level.
What should a notification log include for effective debugging?
At minimum: the event ID and type, timestamp, user ID, workflow match result, preference check result, channel routing decision, delivery vendor and response code, async delivery status updates (bounces, failures), and engagement events (opens, clicks). The key is connecting all of these into a single chronological trace per notification, so you can follow the entire lifecycle without jumping between systems.
How do I track notification delivery status when vendors report it asynchronously?
Most email (SendGrid, Mailgun, SES) and SMS (Twilio) vendors report delivery status via webhooks. You need to set up webhook endpoints that receive these status updates and correlate them back to the original notification using the vendor's message ID. For push notifications, FCM and APNs provide delivery receipts, though the reliability varies. The critical implementation detail is maintaining a mapping between your internal notification ID and the vendor's message ID so you can attach async updates to the right trace.
Can notification observability help with deliverability issues like spam filtering?
Indirectly, yes. Observability shows you the gap between "delivered to the mail server" and "opened by the user." If you see consistently high delivery rates but low open rates for a specific sending domain or template, that is a strong signal that messages are landing in spam. Combined with email authentication checks (SPF, DKIM, DMARC status), observability data helps you identify and fix deliverability problems before they affect a large portion of your users.
How does notification observability work with multi-channel delivery?
In a multi-channel setup (email, push, SMS, in-app), observability needs to trace the routing decision (why each channel was selected or skipped) and the delivery result per channel, all under a single notification trace. If a notification was sent to email and push, you should see both delivery attempts and their results in one place. If the email failed and the system fell back to SMS, the trace should show the primary failure, the fallback decision, and the fallback result in sequence.
What is the difference between notification logs and notification traces?
Logs are individual records of events: "email sent," "webhook received," "preference checked." A trace is a connected sequence of logs that tells the complete story of one notification from trigger to engagement. Think of it like the difference between individual log lines in your application and a distributed trace in OpenTelemetry, Jaeger, or Datadog. The trace gives you the full picture. Individual logs give you fragments that you have to piece together manually. Effective notification observability requires traces, not just logs.
TL;DR
- "Sent" is not "delivered." Most notification systems stop tracking at vendor handoff. Real observability tracks status through delivery confirmation and user engagement.
- Four layers of observability: Trigger (did the event arrive?), Workflow (what did the system decide?), Delivery (did the vendor succeed?), Engagement (did the user see it?). Miss any layer and you have a blind spot.
- Workflow-level tracing is the biggest gap. Most teams trace delivery but not the decisions that happened before delivery: preference checks, channel routing, batch decisions. These invisible decisions are the root cause of most "notification not received" issues.
- End-to-end traces, not scattered logs. A single chronological trace per notification, searchable by user ID, is what turns a 30-minute investigation into a 30-second lookup.
- Async delivery tracking is mandatory. Vendors report bounces, failures, and delivery confirmations asynchronously. If you are not ingesting those webhooks and attaching them to the notification trace, your "delivered" status is a guess.
- Common failure patterns include misconfigured event names, preference opt-outs, expired device tokens, and spam filtering. Each is only diagnosable with the right observability layer.
- SuprSend provides end-to-end notification tracing across all four layers, with workflow execution traces, vendor status ingestion, failover visibility, and user-level search in a single unified log.
If you are building notification infrastructure and your debugging process still involves checking three vendor dashboards and grepping through application logs, you do not have a logging problem. You have an observability problem. The fix is not more logs. It is structured, per-notification traces that follow the message from trigger to inbox.
Want to see what full notification observability looks like? Try SuprSend and trace every notification from trigger to delivery in a single, searchable log. No more cross-referencing vendor dashboards.



