> 2. "Also, Google wouldn’t be able to scan and index the text of your e-mails. ...

smsm42 · on April 23, 2014

This requires what is called homomorphic encryption: https://en.wikipedia.org/wiki/Homomorphic_encryption

There seems to be active research done in this field, e.g.: http://research.microsoft.com/en-us/people/klauter/cryptosto... http://research.microsoft.com/en-us/um/people/senyk/slides/e...

I haven't read all those papers so I'm not sure how close it to working, but from what I have read I'm not sure if it even is practical for searching your own mail. For google indexing everybody's email, that would be contradictory as indexing an email basically reveals its content to the party that is using the index to look up.

FourthProtocol · on April 24, 2014

Not sure such an approach is needed. The same key used to encrypt email could be used to encrypt a search catalog. The user decrypts the entire catalog when a search needs to be done. The risk of the catalog getting too big could be mitigated by making the indexer constrain the catalog to emails from the last 30 days or so, and making the complete catalog available offline. It can be tuned by letting users add important older emails to the catalog, and so on.

It's an approach I've used with a store/forward database and worked well for me with that.

mike-cardwell · on April 23, 2014

Hashing only helps when the number of possible inputs is very very large. When the number of possible inputs is "the number of words in the dictionary", or "every IPv4 address" or "every phone number" etc, then it would take a modern home computer a few seconds to generate a raintable which would make the hashes instantly reversible.

dublinben · on April 23, 2014

>raintable

I knew what you meant, but if anyone else was confused, the correct term is "rainbow table."

https://en.wikipedia.org/wiki/Rainbow_table

brown9-2 · on April 23, 2014

In order to do this, Google would need access to the plaintext of the email message, which defeats half of the purpose of the supposed encryption initiative.

level · on April 23, 2014

Because that's not really making it private. If you have each word hashed and a rainbow table full of hashes matched to words, it's very easily reversible. It doesn't really introduce any privacy.

Not to mention the computational and storage cost.

ithkuil · on April 23, 2014

would a per user salt help with the rainbow table issue?

mike-cardwell · on April 23, 2014

No. With a list of the users hashes and their salt, you could reverse it all in a matter of minutes or seconds on even a single low spec machine. It would offer nothing over just storing the plain text.

7952 · on April 23, 2014

So store the salt on the client, and reindex if the salt is lost?

ithkuil · on April 23, 2014

are you sure? I can think of weaknesses caused by statistical properties of a large corpus of hashed terms (all with the same salt) clustered together in documents which follow a natural language distribution, but matter of minutes or seconds? Why doesn't that simplicity apply for salted hashed passwords?

opendais · on April 23, 2014

In theory, passwords are random combinations of words and/or characters so you cannot 'guess' larger than a character at a time.

This scales very quickly to the number of combinations.

http://www.oxforddictionaries.com/us/words/the-oec-facts-abo...

You can guess 90% of the words in the OEC with only 7,000 words in your rainbow table. I suspect that is a pretty fair representation of e-mails text. Even if it is the 1,000,000 number... [As of 2011, commercial products are available that claim the ability to test up to 2,800,000,000 passwords per second on a standard desktop computer using a high-end graphics processor.] http://en.wikipedia.org/wiki/Password_strength#Password_gues...

So ya. If it is a per-word hash, there is no real security value if you have the salt.

desas · on April 23, 2014

The plain text is dictionary words, given that the hash and the salt is known it will be really quick to hash every dictionary word for a single user. There are 99171 words in /usr/share/dict/words and off the shelf hardware can do 1300 million SHA1 hashes per second [0]

[0] http://security.stackexchange.com/questions/8607/how-quickly...

DerpDerpDerp · on April 23, 2014

There's only around a million (or a couple million if you're generous with conjugations, lulzspeak, etc) English words.

If you figure passwords understand 75 characters, then one million passwords is around 3.2 randomly chosen characters. If you step up to 4 randomly chosen characters, you'd cover 31 million entries, and hence have about the same strength as reversing a hash of 31 million different words.

A 4 character password is woefully weak by modern standards.

ithkuil · on April 24, 2014

the client can generate terms by combining a secret key with the term while hashing it (keyed-hash).

See one particular approach in: http://people.csail.mit.edu/akiezun/encrypted-search-report....

davmre · on April 23, 2014

There are only probably a million or so frequently-appearing terms in English-language email messages (including the most common proper nouns, though of course not all of the long tail). Computing a million hashes with a custom salt is still trivial for modern computers.

jerf · on April 23, 2014

"Why doesn't that simplicity apply for salted hashed passwords?"

It does. Don't store salted hashes for passwords.

shkkmo · on April 24, 2014

Huh? What do you store instead?

My understanding of why this doesn't work for passwords is because the number of possible passwords (the size of the rainbow table needed for each salt) is much larger than the number of words in the english language.

mike-cardwell · on April 24, 2014

Look up "bcrypt" and "scrypt". Using salted hashes for storing passwords is still common, but only for legacy reasons. It's considered insecure nowadays. Machines have got very very good at hashing over the last few years.

ithkuil · on April 24, 2014

bcrypt is still a hash, it even incorporates a salt. Of course sha256 is too fast to compute, bcrypt fixes that.

I feel that this whole thread is based on cargo-culting. There are plenty of or (interesting) blog posts with misleading titles like "you shouldn't use hashes for passwords", which are correctly explaining why the cryptographic hash functions are not well suited to hashing passwords.

However, the solution is to use another method to hash the password. But it's still a hash.

A key derivation function used to process a password, produces a hash if you intend to use it as a hash, i.e. to compare it with another hash. If you use it to encrypt something then that would be a key.

Even the scrypt paper (http://www.tarsnap.com/scrypt/scrypt.pdf) says:

"Password-based key derivation functions are used for two primary purposes: First, to hash passwords so that an attacker who gains access to a password ﬁle does not immediately possess the [..]" (emphasis mine)

jerf · on April 24, 2014

We say "don't just store salted hashes" because when people hear that, they think one run of SHA256 (or MD5 or whoknows) with a salt. You shouldn't do that. You should use a prepared method, because it's easy and (much more likely to be) correct. If you want to bodge together your own solution based on some large iterations of SHA256, you can, but it will take you much longer than just dropping in (b/s)crypt, and even longer if you have to match the (b/s)crypt feature set it provides out of the box. And you still have to face the fact that your SHA256 solution may not be as secure as you think because it's still a fast hash function, and an adversary may be able to process it much faster than you expect, whereas the (b/s)crypt have considered that in their design.

Pretty much by definition, advice provided to low-crypto-knowledge people can not depend on high levels of crypto knowledge. So, yes, pedantically you can point out that (b/s)crypt still produces a hash by the technical definition of hash. But you only muddy the waters for the low-knowledge people by doing so, and you probably shouldn't hold your breath waiting for plaudits from the high-knowledge people for doing so.

ithkuil · on April 24, 2014

Since the original topic was how to build an server side index that can let users find relevant documents in an encrypted corpus without revealing the corpus to said server (e.g. [1]), I think that whoever diverts the thread into a discussion about the right function to use with passwords is the one being pedantic and missing the point.

Is it possible to build a index for an encrypted corpus that preserves the user privacy even in case the server is taken over?

I don't know ([2], [3]?). But I certainly know that the main weakness is not that it's impossible to hash terms in such a way that a weak machine cannot reverse significant portions of the index in minutes/days.

Seeing the word "hash" in this context clearly doesn't specify a particular hash function. Saying that using a salt fixes the issue with rainbow tables too doesn't specify whether we are talking about naive md5/sha salted hash or a KDF. These details were irrelevant to the discussion.

1. http://people.csail.mit.edu/akiezun/encrypted-search-report.... 2. http://rd.springer.com/chapter/10.1007%2F11496137_30 3. http://www.cs.ucla.edu/~rafail/PUBLIC/SSE.ppt‎

shkkmo · on May 5, 2014

Thanks for the info, I will take a look at those.

arghnoname · on April 24, 2014

There is a lot of research in this area. Here is a paper on how to search on encrypted data:

http://www.cs.berkeley.edu/~dawnsong/papers/se.pdf

One thing you need to be wary of is plaintext attacks. Even if you use a salt and a difficult has but I send you an e-mail and know the contents and can obtain the ciphertext or its digests, you are vulnerable. There are ways past this, of course, but it is one example of a valid attack.

The pdf I referenced has what seems to be a pretty workable approach to me and lets you do hidden searches, boolean searches, phrases, proximity queries, etc.

read · on April 24, 2014

> this is almost certainly non-workable for some reason

Why is it necessary to have the ability to search past emails? Could there be any value in not looking back?

(It's ok to laugh at this.)