Webhooks: The Devil in the Details

Christoph Neijenhuis
commercetools tech
Published in
5 min readDec 7, 2016

Webhooks seem to be such a simple solution for callbacks: Send an HTTP request. Receive an HTTP request, do something, and return 200. Seems pretty straightforward, right?

Once we leave the happy path, we see that it’s not so simple — neither for the sender, nor for the receiver. One can ignore many of these issues if the cost of failure is low, e.g. if a CI build is not triggered, or a message was not posted into a Slack channel. However, when looking at commerce use cases for callbacks, many of these have a high cost of failure, e.g. if your food is not delivered within the promised time, or if a refund is not triggered.

Let’s have a closer look at the details and at what can go wrong:

The Delivery fails

What should the sender do if it can’t deliver the Webhook because the destination server or the network isn’t available?

Turns out: It depends. If the callback is supposed to trigger a refund on your credit card, it should be retried ad nauseam. However, if the callback is supposed to trigger a push notification with “your food delivery is on the way”, it should be dropped after a couple of minutes as the customer has received his meal already.

The Receiver can’t process a specific Webhook

Bugs hide everywhere — especially in software. A bug in the receiver may only be triggered by a specific payload. How can the developers of the receiver retrieve this payload? A nice sender would offer a dead letter queue or similar that can be accessed via a UI and/or an API.

The Receiver can only process so much Messages at once

How many HTTP requests can be send in parallel before overloading the receiver? 5? 50? 500?

Workloads in commerce are often spiky. Your frontend needs to scale, but for many of the background processes, a viable strategy is to be asynchronous and buffer tasks. Webhooks are not a good buffer, because the sender pushes to the receiver. If the receiver can pull the work, he can consume it at his own pace.

As a workaround, the receiver can refuse to accept Webhooks under high load, and trust the sender to retry the delivery. However, the receiver would refuse incoming requests randomly. If the sender doesn’t maintain a global FIFO queue (which is quite hard to scale), some tasks may be stuck for hours while others get processed on the first try.

Integration with Monitoring Systems

Even if the sender has addressed all of the above problems, it is still vital that the system can be monitored and that automated alerts can be send out when deliveries fail, a message is added to the dead letter queue or the buffer size grows. This should obviously nicely integrate with the system the (Dev)Ops are using already for monitoring the rest of the infrastructure.

Message Queues are built to solve these Issues

Fortunately, there is already a lot of software out there that is built to solve all of these problems — Message Queues! They allow you to define retry policies, dead letter queues, and are often integrated with your favorite monitoring software. They can be consumed via a pull-API, and many can be integrated with auto-scaling workers/lambdas.

With a Message Queue, the delivery is clearly separated from the processing of the message. It allows the sender to hand it off with a much higher chance of success, because neither bugs nor performance issues in the receiver will have an impact on the delivery. On the other hand, the receiver has control over the messages. Not only can the policies be defined, it’s also possible to manually manage the state of the queue. Often this is possible via a UI, which may also allow you to peek at messages.

Forwarding Webhooks to a Message Queue

If you are integrating with a system that only offers Webhooks, fear not: It is quite easy to just accept the message and put it into the Message Queue of your choice. A few Message Queues have an interface to act as a webhook-receiver themselves, one example being IronMQ.

Sending a Webhook from Message Queue

I’d argue that it’s better to use a “serverless” worker (i.e. AWS Lambda, Iron Worker or Google Function) to process messages, but most cloud-based Message Queues do support pushing messages via Webhooks (IronMQ and Pub/Sub do this natively, AWS has SNS). The advantage over “plain” Webhooks is that you are still able to configure retry policies, a dead letter queue and monitor the system.

Callbacks in the commercetools platform

When designing our API and platform, we’re giving our best to not only make it easy to learn and develop for, but also easy to maintain once an application is in production. When we started designing our callback API, we first looked deeply at Webhooks, because everybody knows them, and they are, hands-down, the easiest thing to implement. But when we started designing for maintainability in a production environment, we found ourselves re-inventing the wheel and building yet another message queue.

We decided to take a step back. While it surely would have been fun to build a Message Queue, our mission is first and foremost to design a great commerce platform! Therefore, we decided to let our callback API stand on the shoulder of giants: Messages are put into a Message Queue of your choice. We currently support SQS on AWS, Pub/Sub on Google Cloud and IronMQ of iron.io. Just like with programming languages, you can choose the queue that fits your needs best.

Conclusion

Webhooks allow you to quickly glue two systems together. However, for business-critical callbacks it is necessary to be in control of edge cases and monitor the health of the system.

An application consuming Webhooks can gain scalability and enable monitoring by forwarding the calls to a Message Queue. An application sending Webhooks can address these concerns, but may end up building yet another Message Queue.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Responses (2)

What are your thoughts?

LIFO seems like a better approach in a lot of use cases but I never seem to see it. Any ideas why? It's almost always preferable to process the freshest data first.

--

bug in the receiver may only be triggered by a specific payload. How can the developers of the receiver retrieve this payload

I tend to log webhooks to dynamo for long term latent processing

--