There are tons of emails that share the same prefix. When you lookup a prefix, you can't simply get a boolean response. You have to get a list of emails as the response. The client then searches through the list to see if the desired email is in the list or not. Returning a list of emails instead of a single bit significantly increases the data size.
Additionally, people don't just want a boolean answer of "was my email breached somewhere". They want a list of all the breaches that breached the email. So the returned data actually needs to be a list of emails and the list of breaches that each email was breached in.
>Via the public API. This endpoint also takes an email address as input and then returns all breaches it appears in.
> The client then searches through the list to see if the desired email is in the list or not.
The initial prefix check would probably reduce the amount of lookups necessary, as it would only be necessary to do a deeper search if the prefix matches.
>only be necessary to do a deeper search if the prefix matches
There are 5 billion emails in at least 1 breach and 16 million prefixes. Almost all if not all prefixes have at least 1 email in a breach. So almost all prefixes match. I don't see why it's useful to spend a bunch of effort optimizing the very rare case of a prefix not matching.
Now, if the bloom filter checked emails instead of checking prefixes, that would be useful. However, a bloom filter of 5B elements with a 10% false positive rate would be 2.8 GB, which is prohibitively large.
Yeah that was my point, you can get rid of a significant portion of requests at the edge with a bloom filter, and there's no reason you have to build the bloom filter locally as requests come in. Instead, it can be created ahead of time, when the dataset is updated.
Also regarding "you can get rid of a significant portion of requests at the edge with a bloom filter", Troy's existing design already gets rid of a significant portion of requests at the edge. That's why he says
>The response from each search was coming back so quickly that the user wasn’t sure if it was legitimately checking subsequent addresses they entered or if there was a glitch.
Additionally, people don't just want a boolean answer of "was my email breached somewhere". They want a list of all the breaches that breached the email. So the returned data actually needs to be a list of emails and the list of breaches that each email was breached in.
>Via the public API. This endpoint also takes an email address as input and then returns all breaches it appears in.