Where'd it go?: Monitoring your conversational AI project's uptime
Chatbot projects benefit from close monitoring. Reviewing chat logs can help designers surface problems with flows and give ideas for new features. An analytics dashboard can help the team understand broad usage trends and even drill down into specific intents. There’s no shortage of things to keep an eye on. One thing you might want to monitor… is the service still there?
As they get more advanced, chatbots often use a variety of systems or tools to enhance their performance. Developing and integrating these systems often necessitates considering what might happen when they fail. However, sometimes issues arise that were unexpected. Support systems may experience downtime, such as Bing API earlier this month. Your system itself may also experience downtime or irregularities. Of course, such events can disrupt the chat experience, from it exhibiting unusual behavior to being completely offline.
Uptime monitoring helps you know when your system experiences these unexpected events by conducting health checks of your service. A liveness check, a type of health check, involves regularly interacting with your service to make sure that it is available and serving content. This may involve “visiting” the website URL at regular intervals. Or, it may involve sending an HTTP request to an API endpoint to verify it is not only available but also working as intended.
When a liveness check fails, the monitoring service will usually implement an escalation plan to alert team members to the potential issue. This could involve sending text messages and emails to various members of the team. There is usually some configurability with these systems. For example, if an issue persists for a longer duration, it might alert a wider set of team members.
There are numerous uptime monitoring services. Some of the ones I like are BinaryCanary, Cronitor, and StatusCake. However, major cloud providers like Google and Microsoft provide liveness checks (and much more monitoring) with their services. These cloud platforms offer more fine-grained health checks for the entire lifecycle of a deployed app, such as startup and readiness checks. As your app grows in complexity, you may want to pursue more configurable monitoring services, such as DataDog or New Relic. These can help you get detailed insight into your application’s availability and performance with a variety of monitoring strategies.
Your choice of tooling for uptime monitoring will depend on the needs of your project and the available resources. Simple “pinging” services can be very quick, useful, and reliable. More complex tools help you discover root causes, allowing you to remediate issues faster. I find value in using multiple tools with overlapping features. The simple and complex tools are not mutually exclusive. Nonetheless, there are some qualities that I think are useful when evaluating basic uptime monitoring services:
HTTP Request Configurability - As mentioned earlier, requests are important for assessing your service’s availability. A monitoring platform is more useful if you can configure requests to meet your service’s needs. Can you send a payload with a request or only generic pings (hits on a URL)? Can you adjust the request headers?
Escalation Configurability - The point of an uptime monitoring service is to initiate action when it’s necessary. Most of the time, that means notifying the right people to look into a potential problem. A monitoring service needs to give you the flexibility to set up a plan that aligns with the needs of your organization and project. When should alerts be sent? Who should they be sent to and how? What happens if the checks keep failing for… 5 minutes? 10 minutes? An hour? What happens if the problem is solved and checks start passing again?
Check Interval - An uptime monitoring service needs to hit your system on a regular schedule. The more frequent the checks, the faster you receive an alert and can start investigating. Many free-tier plans offer checks at 5-minute intervals. In high-traffic systems with a global user base, even 5 minutes may mean thousands of users are impacted before you even know there is a problem. You may want checks to run every minute or even every 30 seconds.
Cost - The cost of the service is important. It’s important that you choose a service with the features you need and will use. It’s equally important that you choose the appropriate plan to give you the features you need.
Distribution of Checks - Most services will allow you to send requests from multiple locations around the world. If your service targets a global audience, this feature helps you check your service’s availability and performance from different locations. Even if you don’t target a global audience, sending requests from multiple locations can help make the monitoring more reliable when the monitoring service itself experiences glitches. It’s great when the service offers various locations for checks to originate from.
Many chatbot platforms allow for some form of uptime monitoring. Even with no-code or low-code platforms, you may able to get an endpoint or a URL to hit to test your service. Services you may want to integrate with your chatbot can be tested, as well. It would cost money to hit these resources, though. Because they do their own monitoring, such companies offer their own status page (WhatsApp status, OpenAI status, Pokemon API status). These status pages usually allow you to receive “incident” alerts from a particular service for free.
Uptime monitoring is an important part of developing a reliable conversational AI service for your users. When unexpected outages happen, health checks will help your team investigate and act faster. It will also help your team communicate about an issue to your users in a timely way. When things are stable, your team can rest easier knowing that the service is being monitored. If you’d like to get started with uptime monitoring…
Try out some uptime monitoring platforms. Explore your options. See what you like, and don’t like.
Explore services you use (or want to use) for a conversational AI project. Find out what they offer for liveness checks or if they have a status page.
Review on-call / escalation plans for digital projects. If your organization has other digital initiatives, you may already have some plans you can draw from. Uptime monitoring isn’t specific or new to conversational AI. Learn from what others have done and are doing.