> 2. "Also, Google wouldn’t be able to scan and index the text of your e-mails. That’s a problem if you need to search for old emails not stored on your own machine. It could be a real issue for Google’s business model as well, which involves scanning the text of emails in order to place contextual advertising."
Caveat: I am no expert whatsoever in crypto/CS so understand this is almost certainly non-workable for some reason (which is what I'm trying to understand). Why couldn't Google hash every word in an email and store the list of hashes alongside the encrypted version. Then when you search for something, the search terms are hashed as well and the service searches for the hashed search terms amongst the lists of hashes.
I haven't read all those papers so I'm not sure how close it to working, but from what I have read I'm not sure if it even is practical for searching your own mail. For google indexing everybody's email, that would be contradictory as indexing an email basically reveals its content to the party that is using the index to look up.
Not sure such an approach is needed. The same key used to encrypt email could be used to encrypt a search catalog. The user decrypts the entire catalog when a search needs to be done. The risk of the catalog getting too big could be mitigated by making the indexer constrain the catalog to emails from the last 30 days or so, and making the complete catalog available offline. It can be tuned by letting users add important older emails to the catalog, and so on.
It's an approach I've used with a store/forward database and worked well for me with that.
Hashing only helps when the number of possible inputs is very very large. When the number of possible inputs is "the number of words in the dictionary", or "every IPv4 address" or "every phone number" etc, then it would take a modern home computer a few seconds to generate a raintable which would make the hashes instantly reversible.
In order to do this, Google would need access to the plaintext of the email message, which defeats half of the purpose of the supposed encryption initiative.
Because that's not really making it private. If you have each word hashed and a rainbow table full of hashes matched to words, it's very easily reversible. It doesn't really introduce any privacy.
Not to mention the computational and storage cost.
No. With a list of the users hashes and their salt, you could reverse it all in a matter of minutes or seconds on even a single low spec machine. It would offer nothing over just storing the plain text.
are you sure? I can think of weaknesses caused by statistical properties of a large corpus of hashed terms (all with the same salt) clustered together in documents which follow a natural language distribution, but matter of minutes or seconds? Why doesn't that simplicity apply for salted hashed passwords?
You can guess 90% of the words in the OEC with only 7,000 words in your rainbow table. I suspect that is a pretty fair representation of e-mails text. Even if it is the 1,000,000 number...
[As of 2011, commercial products are available that claim the ability to test up to 2,800,000,000 passwords per second on a standard desktop computer using a high-end graphics processor.]
http://en.wikipedia.org/wiki/Password_strength#Password_gues...
So ya. If it is a per-word hash, there is no real security value if you have the salt.
The plain text is dictionary words, given that the hash and the salt is known it will be really quick to hash every dictionary word for a single user. There are 99171 words in /usr/share/dict/words and off the shelf hardware can do 1300 million SHA1 hashes per second [0]
There's only around a million (or a couple million if you're generous with conjugations, lulzspeak, etc) English words.
If you figure passwords understand 75 characters, then one million passwords is around 3.2 randomly chosen characters. If you step up to 4 randomly chosen characters, you'd cover 31 million entries, and hence have about the same strength as reversing a hash of 31 million different words.
A 4 character password is woefully weak by modern standards.
There are only probably a million or so frequently-appearing terms in English-language email messages (including the most common proper nouns, though of course not all of the long tail). Computing a million hashes with a custom salt is still trivial for modern computers.
My understanding of why this doesn't work for passwords is because the number of possible passwords (the size of the rainbow table needed for each salt) is much larger than the number of words in the english language.
Look up "bcrypt" and "scrypt". Using salted hashes for storing passwords is still common, but only for legacy reasons. It's considered insecure nowadays. Machines have got veryvery good at hashing over the last few years.
bcrypt is still a hash, it even incorporates a salt. Of course sha256 is too fast to compute, bcrypt fixes that.
I feel that this whole thread is based on cargo-culting. There are plenty of or (interesting) blog posts with misleading titles like "you shouldn't use hashes for passwords", which are correctly explaining why the cryptographic hash functions are not well suited to hashing passwords.
However, the solution is to use another method to hash the password. But it's still a hash.
A key derivation function used to process a password, produces a hash if you intend to use it as a hash, i.e. to compare it with another hash. If you use it to encrypt something then that would be a key.
"Password-based key derivation functions are used for two primary purposes:
First, to hash passwords so that an attacker who gains access to a password file
does not immediately possess the [..]" (emphasis mine)
We say "don't just store salted hashes" because when people hear that, they think one run of SHA256 (or MD5 or whoknows) with a salt. You shouldn't do that. You should use a prepared method, because it's easy and (much more likely to be) correct. If you want to bodge together your own solution based on some large iterations of SHA256, you can, but it will take you much longer than just dropping in (b/s)crypt, and even longer if you have to match the (b/s)crypt feature set it provides out of the box. And you still have to face the fact that your SHA256 solution may not be as secure as you think because it's still a fast hash function, and an adversary may be able to process it much faster than you expect, whereas the (b/s)crypt have considered that in their design.
Pretty much by definition, advice provided to low-crypto-knowledge people can not depend on high levels of crypto knowledge. So, yes, pedantically you can point out that (b/s)crypt still produces a hash by the technical definition of hash. But you only muddy the waters for the low-knowledge people by doing so, and you probably shouldn't hold your breath waiting for plaudits from the high-knowledge people for doing so.
Since the original topic was how to build an server side index that can let users find relevant documents in an encrypted corpus without revealing the corpus to said server (e.g. [1]), I think that whoever diverts the thread into a discussion about the right function to use with passwords is the one being pedantic and missing the point.
Is it possible to build a index for an encrypted corpus that preserves the user privacy even in case the server is taken over?
I don't know ([2], [3]?). But I certainly know that the main weakness is not that it's impossible to hash terms in such a way that a weak machine cannot reverse significant portions of the index in minutes/days.
Seeing the word "hash" in this context clearly doesn't specify a particular hash function. Saying that using a salt fixes the issue with rainbow tables too doesn't specify whether we are talking about naive md5/sha salted hash or a KDF. These details were irrelevant to the discussion.
One thing you need to be wary of is plaintext attacks. Even if you use a salt and a difficult has but I send you an e-mail and know the contents and can obtain the ciphertext or its digests, you are vulnerable. There are ways past this, of course, but it is one example of a valid attack.
The pdf I referenced has what seems to be a pretty workable approach to me and lets you do hidden searches, boolean searches, phrases, proximity queries, etc.
Caveat: I am no expert whatsoever in crypto/CS so understand this is almost certainly non-workable for some reason (which is what I'm trying to understand). Why couldn't Google hash every word in an email and store the list of hashes alongside the encrypted version. Then when you search for something, the search terms are hashed as well and the service searches for the hashed search terms amongst the lists of hashes.
Edit: Thanks everyone!