There are a couple of things I‘m left wondering:
When the SNS is used to notify through Email, Push or SMS then even in the fan-out pattern, no SQS is involved on these paths, right? (seems to me that only on A2A paths SQS would be involved, at least from the diagrams) So is there anything else to help with reliability there that the notifications actually do go out?
For workloads where you e.g. want to alert users through different means, not every user might have the same options selected, e.g. some only Push and others SMS and EMail; would that be modeled through different topics which have different combinations of subscribers attached (some only 1) or would it be better to skip SNS and push multiple messages to different queues directly?
Regarding fanout. Yes exactly. Fanout doesn't mean it needs to involve SQS just it can involve SQS. It is also called a fanout pattern if you do a A2P and only notify Emails, SMS, etc.
To your second question how the architecture would look like for different preferences of architecture.
I think the main benefit of that architecture is that customer can subscribe to a topic. That means if your user A subscribes to the topic for Email and not in-app notification that is fine. It would be also just the one topic.
The consumer/subscriber has the power to subscribe and unsubscribe to topics (similar like you can to newsletters basically).
That is one of the main benefits.
With a queue the producer would need to define which consumer will get the message and most probably it will be another application.
I think it may depend on your architecture, but naively:
- SQS to send alert, all modes
- Lambda reading that queue, filtering on user settings
- SNS per mode of communication, eg email or text
My thinking is that you’d want to filter on user preference early, to prevent repeated work. A benefit of this approach is you prevent combinatorial complexity if you have both selection on kinds of alerts delivered and way to deliver alerts: the Lambda can handle all of that based on the user settings. And still a single SNS per communication channel.
Your gating Lambda can also implement other features, like volume aware decisions — where it eg, rejects “marketing” messages if there have been too many to a single customer recently while still allowing through “transaction” messages.
An SNS message is delivered to a number of endpoints, which can include email, push notifications, SMS, or various AWS services. In its payload the content to send to each type of endpoint so you can make sure it’s going through in the right format.
>SQS has a many-to-one relationship. You can send messages to a queue from many different producers but only one consumer can be defined. A consumer is another application, most often some compute instances such as Lambda, EC2, or Fargate.
My understanding is that you can have multiple consumers of an SQS through the use of visibility timeouts[0]. Once a message is consumed it is as if that message doesn't exist for all other consumers until it reaches a timeout period or is marked done by that consumer. You can also manually mark a message as being ready for other consumers. This moves the message back into the queue for the other consumers to see.
I'm going to be linking this article to my team. We've been talking about moving to SNS/SQS/etc. and this article helps understand the use cases and distinctions better.
1. The main point of the visibility timeout is to handle failure. A message is read by a consumer; the visibility timeout starts; that consumer finishes some processing; then deletes the message from the queue. But, what happens if the consumer encounters a fault during processing which destroys its ability to even tell the queue it encountered a fault? The visibility timeout protects against that; the message just naturally reappears in the queue for processing by another consumer. If one overloaded the visibility timeout to also mean "other consumers should process this", you'd lose the ability to handle faults.
2. It also screws up deadletter redrive policies, which are primarily based on visibility timeout lapses (in addition to communicated failures). You basically could not reliably put a deadletter redrive on your queue, which again just means, you're protecting against fewer failure modes.
3. There would be natural, avoidable latency in waiting for the visibility timeout on every fan-out, whatever you set it to. 1 second? 100 consumers? That message is just clogging up the queue for over a minute as it gets fanned-out to everyone.
4. Consumer1 eats the first message, then times-out its visibility; its back in the queue; there's no way to ensure that message isn't just processed again by Consumer1 instead of Consumer2! You're basically tossing a coin and hoping that, eventually, Consumer2 gets its turn at the message, all the while having Consumer1 reprocess the message an indefinite number of times.
5. Someone has to delete the message. Who? The "last" component to touch it? Once all the other components are done? How do you coordinate that? Theres no guarantee of ordering on when each component sees the message. You'd need some kind of external state, and at that point, why are you even using SQS?
You could theoretically have each consumer read from a queue, process the message, delete that message from the queue, then redrive the message into a new queue for processing by another consumer. This may make sense if you have strict ordering needs for processing but still want the benefits of SQS. You could even have it redrive into N queues for N consumers at the same time. But, at that point, why? We're trying to put a square peg in a round hole; SQS is designed for single consumers. There are far better and simpler tools out there if what you're looking for is multi-consumer fan-out.
I have used SQS in the parent's suggested fashion for many years. I feel like your points are overstated. Visibility timeout's "main point" is not to only handle failure nor do the AWS docs themselves state that. AWS's built-in redrive policies have been more than sufficient to correctly handle error scenarios.
> there's no way to ensure that message isn't just processed again by Consumer1 instead of Consumer2!
Correct, but this isn't the job of the pipe. Smart endpoints, dumb pipes.
You're welcome to design systems however you want. But this is, put simply, bad advice; and when sharing advice like this to people who may be learning these things for the first time it's critical to communicate not just what these complex components are capable of, but how to best work with them to build reliable and effective systems.
If you have multiple heterogenous consumers, do not use a single SQS queue.
I can't even comprehend how you would engineer around the issue of consumer re-processing. You can quote metaphors all day; if you love the idea of dumb pipes, why doesn't the city transport clean and gray water in the same pipe? Do you want to wash your hands using flushed toilet water?
Similarly, you can't engineer around heterogenous consumers grabbing a message, putting it back in the queue, then consuming it again. You can make them smart! You can have them say "woah hold on, I already saw that message I don't need to see it again put it back". Or, you can make them idempotent so reprocessing isn't undesirable. But its still reprocessing; its still a huge waste, and will probably require external state to manage. Moreover, there's literally no system guarantee that Consumer2 will ever see that message; it'll probably see it, fifty-fifty, well then again if one consumer is faster at accessing the AWS API than the second, who knows, anything could happen, but at least its convenient?
The city doesn't require every household to have gray water filtration. Because that would be insane. The pipes don't have to be "smart". We just build two pipes!
This just blows my mind too. The pipe analogy is apt. Using logic to dispatch to whatever pipe -> consumer you want is the way to use queues. Turning it upside down and using properties of the queue to have consumers decide what they want to take and sending it back to others is just unquestionably bad design when you could just make more queues!
Conceptually it is still a terrible idea to have multiple consumers (By multiple consumers I mean things doing different actions on a message, not concurrent consumers doing the same action) on a single queue. Why overload a queue like that for 2 different actions when one can fan out on an action to 2 queues with SNS? Then your consumer does not have to determine if the message is for them or not. Visibility timeouts are for concurrency/errors by a single action. Yes you could hijack it and have 2 consumers act on one message and do different things but that is confusing and no benefit over just having 2 queues
The primary reasons for multiple consumers in a queue is availability and SLA reasons on a queue as well as for easier horizontal scaling. Otherwise you’ll need to have a queue scheduler type system that can signal or serve out queue locators to idle consumers and you start getting into technical scenarios similar to freakin’ ESBs. At enough scale you already have that setup though for multi region failover purposes sure but the granularity of queue consumer routing is based not around concurrency to the queues as much as concurrency and routing across several regions with n queues in between that serve as priority queues.
Also, two different queues being two different buffers that have durability issues can in an improperly conceived architecture amount to a distributed RAID0 of messages.
It really depends upon the tolerance to message duplication, SLA needs, and how prioritization should be handled. At a previous place we had multiple consumers for multiple SQS queues representing different priorities within the same region and it worked fine for many years with the primary headache being message de duplication handling being tricky.
This discussion is, at least it seems, mostly about multiple heterogenous consumers; not homogenous consumers/replicas/horizontal scaling. So, if Slack sends a queue message for every DM that's sent, the difference between having 1 consumer that updates the database and 1 consumer that sends a push notification, versus having 2 consumers that both only update the database.
The idea of having multiple homogenous consumers shouldn't be controversial; that's just horizontal scaling. And, well, at least until a few hours ago I also would have said that the idea of having multiple heterogenous consumers is also uncontroversially bad. But I guess everyone has "their way" of doing things.
Its also important to note that there's a third situation I see somewhat often: maybe call it homogenous delegated consumers, whereby you've got messages like '{"type":"SendDM", "content": {}}'. Or maybe: '{"type":"SendDM", "action": "UpdateDB", "content":{}}'. The consumers are still homogenous, they all run the same code, but they may internally delegate the message to do different things depending on enums within the message. This is pretty ok; its different because at least you'd never have a consumer hit message and be like "I don't want this take it back".
Though I'd caution against it; just understand that its something of a 'hack' to make one queue act like N queues, and that's ok if you're small and have a good grasp on the problem domain. The big issue it will inevitably run into is: some queue message "kinds" will take a lot longer to process than others; and so if you're e.g. overloading a queue to handle both a simple email send and a much more complex asynchronous database update, you'll inevitably get delayed emails. Absolutely inevitable. But, it can work for a time.
Weird example, do people actually use multiple consumers doing different things to a single message? You just queue multiple messages with different properties and consumers process things the same way.
I’m curious why you would do it this way vs publishing to SNS and having that fan out to multiple queues where each consumer can listen for the things it needs to work on (as mentioned in the original article.)
I found it just was not necessary for most cases. And the way I got there was working backwards from the "web scale" technologies like kafka, kinesis, and dynamodb.
I built a data ingestion system that handled an average of 300 messages/sec, peaking at 1,000, and writing to a single R3 RDS instance. You can do a lot by pushing simple scaling strategies to their limit. Everyone thinks they need to handle web scale, but really you just need to handle your scale.
Excellent write-up. I think when dealing with messaging systems it's important to know the difference between Pub/Sub vs Point to Point models or Topics vs Queues.
Can you technically use a Queue as a topic for pub/sub? Yes. But should you? Probably not. You're much better off not using SQS for that and instead using SNS.
I wish SNS had a way to have a process receive a message, or have a Watcher for it in an AWS SDK like Boto. Feels like a big hole in actually using SNS as a pub/sub mechanism. Much simpler than having to setup and maintain an HTTP endpoint.
Yes, and you can group the messages such that messages within a group are (almost?) always consumed in order. I think the distinction though is that with SNS each message is consumed by each consumer, whereas SQS each message is consumed by one node (so you can only really have one system that reads from the queue)
I would still view it as many to one. Visibility timeouts are for concurrency. Semantically speaking I would consider one consumer with n concurrent workers as one consumer function/service. In workflow terms, a SNS is a fanout, and SQS is a queue.
You can subscribe multiple lambdas to an SQS queue. It's not recommended, but it's doable. many-to-many or many-to-one depends on your choices of infrastructure.
I guess we can debate the semantics of it because it is technically possible. But it is terrible design to have a SQS to many different consumers. If someone did that I would reject it on review. In any proper usage of SQS it is many to one.
there are edge cases where that's desirable. I won't enumerate them here but they're discoverable on the Googles. I also would advise against that kind of passionate adherence to infrastructure dogma, taking a more analytical approach to review.
If you could name even one I would remove the "dogma". I cannot think of why anyone would want to do that. And if someone did want to do that they would have to have a very compelling reason to complicate what is usually an easy thing (One action listening on a queue)
I am not referring to multiple homogeneous consumers processing a queue. That is fine. You have a pool of consumers that can pick up from the queue. That is still considered one entity/actor. The people here are proposing having multiple heterogeneous consumers consume from the queue. That is bad.
In my experience, SQS queues attached to a SNS topic is a common configuration. Each subscribed queue has a worker or "worker group" that gets a copy of all messages.
The topic has "zero or more" subscribed queues, and when sending to the SNS topic, you don't need to know how many subscribers there are right now.
In many cases, it's more a flexible equivalent to write to a SNS topic; not directly to a SQS queue.
I'm confused about calling SQS a many to one service. One of the use cases I've seen (and that is endorsed by AWS) is a worker queue, where someone puts a job into the queue and there's a whole bunch of workers waiting. One of the workers grabs a message, processes it and then deletes it. That seems like a many to many service to me… Or maybe a many-to-one-of-many service?
You're kind of on it.
> One of workers the grabs a message, processes it and then deletes it.
If I understood it correctly, in SQS, a message can only be handled by one worker, while SNS can have several workers processing the same message (let's say one worker ingesting the message for system A and another worker to ingest the same message in system B).
I really liked this article because it explained the differences in a very understandable and relatable way, and also because it sold me on ServerlessQ.
This is a great example of content marketing that uses a question to both answer and impart value in a very logical and compelling way. Through understanding the differences between SQS and SNS the reader will know how to use those in the future, get a clearer idea of the complexity, and be presented with a much simpler and still full-featured alternative at the end. Well done.
Technically Lambda polls SQS (minimum 5 pollers when used as an event source for Lambda). But you don't have to think too much about it.
It's mostly important to remember when you're wondering why your mostly-empty queue is still costing you $0.26/month, even if you've got the receive message timeout set to 20 seconds.
If you use this though, you should be aware that the Lambda-SQS polling mechanism doesn't know about the concurrent execution limits for lambda. If you have set a limit on concurrent executions for your function, or are at the account-level limit, the Lambda-SQS polling mechanism will continue to poll for messages and invoke functions even if execution would fail. This will increment the receive count, and if you have set a limit on the receive count, you may find you drop messages or send them to the DLQ when your handler function hasn't yet had a chance to try and process a given message.
The Lambda-SQS poller will slow down in response to the lambda's error rate for this exact reason.
Note that this can cause issues: say you have a time sensitive application that receives a batch of "bad" messages which cause failed lambda invocations. The poller will slow down and the throughput will drop drastically, even though your intention might be for the lambda to continue processing at the same rate and power through the bad messages.
This behavior can be disabled with a support request.
Sure but it’s transparent. You don’t need to go write all that logic, against every queue, with possible bugs, handle different scenarios, etc. You just need to write the logic to process the message passed to the lambda.
Under the hood it might be polling. But it gives the illusion it’s push and it’s bloody useful.
Exactly. SQS is a queue where a message resides. Lambda basically polls this message.
But like kondro said basically it doesn't matter (directly). I think it is important to understand the event source mapping takes over the polling. But in the end it looks like a push system.
But the whole point about costs with empty queues is important. It is definitely important to understand if you want to customize queues for example with long polling. This parameter changes the time your lambda will poll from your queue
Great writeup and comparison. Also, serverlessq looks really interesting. I've struggled with queues quite often in the past – can't wait to give this one a try.
If you want, give ServerlessQ a try and let me know if you need any help!
You find my contact info on my landing page: serverlessq.com or DM me on twitter: twitter.com/sandro_vol
FWIW you can wire an SNS topic directly to a Lambda - But watch out for frequency rules around that setup. Too many posts to the SNS Topic and you'll run into rate limits and dropped messages. (This is also why most suggest going to a queue first)
You can publish to an SNS topic with an SQS queue subscribing to it and invoking your lambda. This provides durability and allows you to configure other nice things like dead-lettering for any message that fails in lambda. You can survive some pretty massive outages with a setup like this. Oh and also, as of recently you can re-drive messages in your dead-letter queue to a source queue via SQS console.
Fairly certain that ability has been around a few years.
I left that comment there for folks who didn't know that was possible, as going the SNS -> SQS -> Lambda route is what is most popular and most written about.
Nice writeup! It'd be helpful to mention that SNS allows you to setup a retry policy and also allows for you to configure a DLQ when retries are exhausted.
There are a couple of things I‘m left wondering: When the SNS is used to notify through Email, Push or SMS then even in the fan-out pattern, no SQS is involved on these paths, right? (seems to me that only on A2A paths SQS would be involved, at least from the diagrams) So is there anything else to help with reliability there that the notifications actually do go out?
For workloads where you e.g. want to alert users through different means, not every user might have the same options selected, e.g. some only Push and others SMS and EMail; would that be modeled through different topics which have different combinations of subscribers attached (some only 1) or would it be better to skip SNS and push multiple messages to different queues directly?