Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So, I'd just made that suggestion myself a few moments ago.

That said, there are concerns with data aggregation, as patterns and trends become visible which aren't clear in small-sample or live-stream (that is, available in near-time to its creation) data. And the creators of corpora such as Twitter, Facebook YouTube, TikTok, etc., might well have reason to be concerned.

This isn't idle or uninformed. I've done data analysis in the past on what were for the time considered to be large datasets. I've been analyzing HN front-page activity for the past month or so, which is interesting. I've found it somewhat concerning when looking at individual user data, though, here being the submitter of front-page items. It's possible to look at patterns over time (who does and does not make submissions on specific days of the week?) or across sites (what accounts heavily contribute to specific website submissions?). In the latter case, I'd been told by someone (in the context of discussing my project) of an alt identity they have on HN, and could see that the alternate was also strongly represented among submitters of a specific site.

Yes, the information is public. Yes, anyone with a couple of days to burn downloading the front-page archive could do similar analysis. And yes, there's far more intrusive data analytics being done as we speak at vastly greater scale, precision, and insights. That doesn't make me any more comfortable taking a deep dive into that space.

It's one thing to be in public amongst throngs or a crowd, with incidental encounters leaving little trace. It's another to be followed, tracked, and recorded in minute detail, and more, for that to occur for large populations. Not a hypothetical, mind, but present-day reality.

The fact that incidental conversations and sharings of experiences are now centralised, recorded, analyzed, identified, and shared amongst myriad groups with a wide range of interests is a growing concern. The notion of "publishing" used to involve a very deliberate process of crafting and memoising a message, then distributing it through specific channels. Today, we publish our lives through incidental data smog, utterly without our awareness or involvement for the most part. And often in jurisdictions and societies with few or no protections, or regard for human and civil rights, let alone a strong personal privacy tradition.

As I've said many times in many variants of this discussion, scale matters, and present scale is utterly unprecedented.



This is a legitimate concern, but whether the people doing the analysis get the data via scraping vs. a box of hard drives is pretty irrelevant to it. To actually solve it you would need the data to not be public.

One of the things you could do is reduce the granularity. So instead of showing that someone posted at 1:23:45 PM on Saturday, July 1, 2023, you show that they posted the week of June 25, 2023. Then you're not going to be doing much time of day or day of week analysis because you don't have that anymore.


Yes, once the data are out there ... it's difficult to do much.

Though I've thought for quite some time that making the trade and transaction of such data illegal might help a lot.

Otherwise ... what I see many people falling into the trap of is thinking of their discussions amongst friends online as equivalent, say, to a discussion in a public space such as a park or cafe --- possibly overheard by bystanders, but not broadcast to the world.

In fact there is both a recording and distribution modality attached to online discussions that's utterly different to such spoken conversations, and those also give rise to the capability to aggregate and correlate information from many sources.

Socially, legally, psychologically, legislatively, and even technically, we're ill-equipped to deal with this.

Fuzzing and randomising data can help, but has been shown to be stubbornly prone to de-fuzzing and de-randomising, especially where it can be correlated to other signals, either unfuzzed or differently-fuzzed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: