Yuck, this is going to really harm scientific research.
There is already a problem with papers falsifying data/samples/etc, LLMs being able to put out plausible papers is just going to make it worse.
On the bright side, maybe this will get the scientific community and science journalists to finally take reproducibility more seriously. I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".
In my mental model, the fundamental problem of reproducibility is that scientists have very hard time to find a penny to fund such research. No one wants to grant “hey I need $1m and 2 years to validate the paper from last year which looks suspicious”.
Until we can change how we fund science on the fundamental level; how we assign grants — it will be indeed very hard problem to deal with.
Grad students don’t get to publish a thesis on reproduction. Everyone from the undergraduate research assistant to the tenured professor with research chairs are hyper focused on “publishing” as much “positive result” on “novel” work as possible
Prerequisite required by who, and why is that entity motivated to design such a requirement? Universities also want more novel breakthrough papers to boast about and to outshine other universities in the rankings. And if one is honest, other researchers also get more excited about new ideas than a failed replication that may for a thousand different reasons and the original authors will argue you did something wrong, or evaluated in an unfair way, and generally publicly accusing other researchers of doing bad work won't help your career much. It's a small world, you'd be making enemies with people who will sit on your funding evaluation committees, hiring committees and it just generally leads to drama. Also papers are superseded so fast that people don't even care that a no longer state of the art paper may have been wrong. There are 5 newer ones that perform better and nobody uses the old one. I'm just stating how things actually are, I don't say that this is good, but when you say something "should" happen, think about who exactly is motivated to drive such a change.
> Prerequisite required by who, and why is that entity motivated to design such a requirement?
Grant awarding institutions like the NIH and NSF presumably? The NSF has as one of its functions, “to develop and encourage the pursuit of a national policy for the promotion of basic research and education in the sciences”. Encouraging the replication of research as part of graduate degree curricula seems to fall within bounds. And the government’s interest in science isn’t novelty per se, it’s the creation and dissemination of factually correct information that can be useful to its constituents.
The commenter I was replying to wanted it to be a prerequisite for a degree, not for a grant. Grant awarding institutions also have to justify their spending to other parts of government and/or parliament (specifically, politicians). Both politicians and the public want to see breakthrough results that have the potential to cure cancer and whatnot. They want to boast that their funding contributed to winning some big-name prize and so on. You have to think with the mind of specific people in specific positions and what makes them look good, what gets them praise, promotions and friends.
> And the government’s interest in science isn’t novelty per se, it’s the creation and dissemination of factually correct information that can be useful to its constituents.
I think Arxiv and similar could contribute positively by listing replications/falsifications, with credit to the validating authors. That would be enough of an incentive for aspiring researchers to start making a dent.
But that seems almost trivially solved. In software it's common to value independent verification - e.g. code review. Someone who is only focused on writing new code instead of careful testing, refactoring, or peer review is widely viewed as a shitty developer by their peers. Of course there's management to consider and that's where incentives are skewed, but we're talking about a different structure. Why wouldn't the following work?
A single university or even department could make this change - reproduction is the important work, reproduction is what earns a PhD. Or require some split, 20-50% novel work maybe is also expected. Now the incentives are changed. Potentially, this university develops a reputation for reliable research. Others may follow suit.
Presumably, there's a step in this process where money incentivizes the opposite of my suggestion, and I'm not familiar with the process to know which.
Is it the university itself which will be starved of resources if it's not pumping out novel (yet unreproducible) research?
> In software it's common to value independent verification - e.g. code review. Someone who is only focused on writing new code instead of careful testing, refactoring, or peer review is widely viewed as a shitty developer by their peers.
That is good practice
It is rare, not common. Managers and funders pay for features
Unreliable insecure software sells very well, so making reliable secure software is a "waste of money", generally
> Presumably, there's a step in this process where money incentivizes the opposite of my suggestion, and I'm not familiar with the process to know which.
> Is it the university itself which will be starved of resources if it's not pumping out novel (yet unreproducible) research?
Researchers apply for grants to fund their research, the university is generally not paying for it and instead they receive a cut of the grant money if it is awarded (IE. The grant covers the costs to the university for providing the facilities to do the research). If a researcher could get funding to reproduce a result then they could absolutely do it, but that's not what funds are usually being handed out for.
Universities are not really motivated to slow down the research careers of their employees, on the contrary. They are very much interested in their employees making novel, highly cited publications and bringing in grants that those publications can lead to.
That.. still requires funding. Even if your lab happens to have all the equipment required to replicate you're paying the grad student for their time spent on replicating this paper and you'll need to buy some supplies; chemicals, animal subjects, pay for shared equipment time, etc.
We are on the comment section about an AI conference and up until the last few years material/hardware costs for computer science research was very cheap compare to other sciences like medicine, biology etc. where they use bespoke instruments and materials. In CS, up until very recently, all you needed was a good consumer PC for each grad student that lasted for many years. Nowadays GPU clusters are more needed but funding is generally not keeping up with that, so even good university labs are way underresourced on this front.
Enough people will falsify the replication and pocket the money, taking you back to where you were in the first place and poorer for it. The loss of trust is an existential problem for the USA.
Here's a work from last year which was plagiarized. The rare thing about this work is it was submitted to ICLR, which opened reviews for both rejected and accepted works.
You'll notice you can click on author names and you'll get links to their various scholar pages but notably DBLP, which makes it easy to see how frequently authors publish with other specific authors.
Some of those authors have very high citation counts... in the thousands, with 3 having over 5k each (one with over 18k).
Not in most fields, unless misconduct is evident. (And what constitutes "misconduct" is cultural: if you have enough influence in a community, you can exert that influence on exactly where that definitional border lies.) Being wrong is not, and should not be, a career-ending move.
If we are aiming for quality, then being wrong absolutely should be. I would argue that is how it works in real life anyway. What we quibble over is what is the appropriate cutoff.
There's a big gulf between being wrong because you or a collaborator missed an uncontrolled confounding factor and falsifying or altering results. Science accepts that people sometimes make mistakes in their work because a) they can also be expected to miss something eventually and b) a lot of work is done by people in training in labs you're not directly in control of (collaborators). They already aim for quality and if you're consistently shown to be sloppy or incorrect when people try to use your work in their own.
The final bit is a thing I think most people miss when they think about replication. A lot of papers don't get replicated directly but their measurements do when other researchers try to use that data to perform their own experiments, at least in the more physical sciences this gets tougher the more human centric the research is. You can't fake or be wrong for long when you're writing papers about the properties of compounds and molecules. Someone is going to come try to base some new idea off your data and find out you're wrong when their experiment doesn't work. (or spend months trying to figure out what's wrong and finally double check the original data).
In fields like psychology, though, you can be wrong for decades. If your result is foundational enough, and other people have "replicated" it, then most researchers will toss out contradictory evidence as "guess those people were an unrepresentative sample". This can be extremely harmful when, for instance, the prevailing view is "this demographic are just perverts" or "most humans are selfish thieves at heart, held back by perceived social consensus" – both examples where researcher misconduct elevated baseless speculation to the position of "prevailing understanding", which led to bad policy, which had devastating impacts on people's lives.
(People are better about this in psychology, now: schoolchildren are taught about some of the more egregious cases, even before university, and individual researchers are much more willing to take a sceptical view of certain suspect classes of "prevailing understanding". The fact that even I, a non-psychologist, know about this, is good news. But what of the fields whose practitioners don't know they have this problem?)
Yeah like I said the soft validation by subsequent papers is more true in more baseline physical sciences because it involves fewer uncontrollable variables. That's why I mentioned 'hard' sciences in my post, messy humans are messy and make science waaay harder.
Well, this is why the funniest and smartest way people commit fraud is faking studies that corroborate very careful collaborators' findings (who are collaborating with many people, to make sure their findings are replicated). That way, they get co-authorship on papers that check out, and nobody looks close enough to realize that they actually didn't do those studies and just photoshopped the figures to save time and money. Eliezer Masliah, btw. Ironically only works if you can be sure your collaborators are honest scientists, lol.
There is actually a ton of replication going on at any given moment, usually because we work off of each other's work, whether those others are internal or external. But, reporting anything basically destroys your career in the same way saying something about Weinstein before everyone's doing it does. So, most of us just default to having a mental list of people and circles we avoid as sketchy and deal with it the way women deal with creepy dudes in music scenes, and sometimes pay the troll toll. IMO, this is actually one of the reasons for recent increases in silo-ing, not just stuff being way more complicated recently; if you switch fields, you have to learn this stuff and pay your troll tolls all over again. Anyway, I have discovered or witnessed serious replication problems four times --
(1) An experiment I was setting up using the same method both on a protein previously analyzed by the lab as a control and some new ones yielded consistently "wonky" results (read: need different method, as additional interactions are implied that make standard method inappropriate) in both. I wasn't even in graduate school yet and was assumed to simply be doing shoddy work, after all, the previous work was done by a graduate student who is now faculty at Harvard, so clearly someone better trained and more capable. Well, I finally went through all of his poorly marked lab notebooks and got all of his raw data... his data had the same "wonkiness," as mine, he just presumably wanted to stick to that method and "fixed" it with extreme cherry-picking and selective reporting. Did the PI whose lab I was in publish a retraction or correction? No, it would be too embarrassing to everyone involved, so the bad numbers and data live on.
(2) A model or, let's say "computational method," was calibrated on a relatively small, incomplete, and partially hypothetical data-set maybe 15 years ago, but, well, that was what people had. There are many other models that do a similar task, by the way, no reason to use this one... except this one was produced by the lab I was in at the time. I was told to use the results of this one into something I was working on and instead, when reevaluating it on the much larger data-set we have now, found it worked no better than chance. Any correction or mention of this outside the lab? No, and even in the lab, the PI reacted extremely poorly and I was forced to run numerous additional experiments which all showed the same thing, that there was basically no context this model was useful. I found a different method worked better and subsequently, had my former advisor "forget" (for the second time) to write and submit his portion of a fellowship he previously told me to apply to. This model is still tweaked in still useless ways and trotted out in front of the national body that funds a "core" grant that the PI basically uses as a slush fund, as sign of the "core's" "computational abilities." One of the many reasons I ended up switching labs. PI is a NAS member, by the way, and also auto-rejects certain PIs from papers and grants because "he just doesn't like their research" (i.e. they pissed him off in some arbitrary way), also flew out a member of the Swedish RAS and helped them get an American appointment seemingly in exchange for winning a sub-Nobel prize for research... they basically had nothing to do with, also used to basically use various members as free labor on super random stuff to faculty who approved his grants, so you know the type.
(3) Well, here's a fun one with real stakes. Amyloid-β oligomers, field already rife with fraud. A lab that supposedly has real ones kept "purifying" them for the lab involved in 2, only for the vial to come basically destroyed. This happened multiple times, leading them to blame the lab, then shipping. Okay, whatever. They send raw material, tell people to follow a protocol carefully to make new ones. Various different people try, including people who are very, very careful with such methods and can make everything else. Nobody can make them. The answer is "well, you guys must suck at making them." Can anyone else get the protocol right? Well, not really... But, admittedly, someone did once get a different but similar protocol to work only under the influence of a strong magnetic field, so maybe there's something weird going on in their building that they actually don't know about and maybe they're being truthful. But, alternatively, they're coincidentally the only lab in the world that can make super special sauce, and everybody else is just a shitty scientist. Does anyone really dig around? No, why would a PI doing what the PI does in 2 want to make an unnecessary enemy of someone just as powerful and potentially shitty? Predators don't like fighting.
(4) Another one that someone just couldn't replicate at all, poured four years into it, origin was a big lab. Same vibe as third case, "you guys must just suck at doing this," then "well, I can't get in contact with the graduate student who wrote the paper, they're now in consulting, and I can't find their data either." No retraction or public comment, too big of a name to complain about except maybe on PubPeer. Wasted an entire R21.
Funding is definitely a problem, but frankly reproduction is common. If you build off someone else's work (as is the norm) you need to reproduce first.
But without repetition being impactful to your career and the pressure to quickly and constantly push new work, a failure to reproduce is generally considered a reason to move on and tackle a different domain. It takes longer to trace the failure and the bar is higher to counter an existing work. It's much more likely you've made a subtle mistake. It's much more likely the other work had a subtle success. It's much more likely the other work simply wasn't written such that a work could be sufficiently reproduced.
I speak from experience too. I still remember in grad school I was failing to reproduce a work that was the main competitor to the work I had done (I needed to create comparisons). I emailed the author and got no response. Luckily my advisor knew the author's advisor and we got a meeting set up and I got the code. It didn't do what was claimed in the paper and the code structure wasn't what was described either. The result? My work didn't get published and we moved on. The other work was from a top 10 school and the choice was to burn a bridge and put a black mark on my reputation (from someone with far more merit and prestige) or move on.
That type of thing won't change in a reproduction system but needs an open system and open reproduction system as well. Mistakes are common and we shouldn't punish them. The only way to solve these issues is openness
I often think we should movefrom peer review as "certification" to peer review as "triage", with replication determining how much trust and downstream weight a result earns over time.
> I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".
Most people (that I talk to, at least) in science agree that there's a reproducibility crisis. The challenge is there really isn't a good way to incentivize that work.
Fundamentally (unless you're independent wealthy and funding your own work), you have to measure productivity somehow, whether you're at a university, government lab, or the private sector. That turns out to be very hard to do.
If you measure raw number of papers (more common in developing countries and low-tier universities), you incentivize a flood of junk. Some of it is good, but there is such a tidal wave of shit that most people write off your work as a heuristic based on the other people in your cohort.
So, instead it's more common to try to incorporate how "good" a paper is, to reward people with a high quantity of "good" papers. That's quantifying something subjective though, so you might try to use something like citation count as a proxy: if a work is impactful, usually it gets cited a lot. Eventually you may arrive at something like the H-index, which is defined as "The highest number H you can pick, where H is the number of papers you have written with H citations." Now, the trouble with this method is people won't want to "waste" their time on incremental work.
And that's the struggle here; even if we funded and rewarded people for reproducing results, they will always be bumping up the citation count of the original discoverer. But it's worse than that, because literally nobody is going to cite your work. In 10 years, they just see the original paper, a few citing works reproducing it, and to save time they'll just cite the original paper only.
There's clearly a problem with how we incentivize scientific work. And clearly we want to be in a world where people test reproducibility. However, it's very very hard to get there when one's prestige and livelihood is directly tied to discovery rather than reproducibility.
I'd personally like to see top conferences grow a "reproducibility" track. Each submission would be a short tech report that chooses some other paper to re-implement. Cap 'em at three pages, have a lightweight review process. Maybe there could be artifacts (git repositories, etc) that accompany each submission.
This would especially help newer grad students learn how to begin to do this sort of research.
Maybe doing enough reproductions could unlock incentives. Like if you do 5 reproductions than the AC would assign your next paper double the reviewers. Or, more invasively, maybe you can't submit to the conference until you complete some reproduction.
The problem is that reproducing something is really, really hard! Even if something doesn't reproduce in one experiment, it might be due to slight changes in some variables we don't even think about. There are some ways to circumvent it (e.g. team that's being reproduced cooperating with reproducing team and agreeing on what variables are important for the experiemnt and which are not), but it's really hard. The solutions you propose will unfortunately incentivize bad reproductions and we might reject theories that are actually true because of that. I think that one of the best way to fight the crisis is to actually improve quality of science - articles where authors reject to share their data should be automatically rejected. We should also move towards requiring preregistration with strict protocols for almost all studies.
Yeah, this feels like another reincarnation of the ancient "who watches the watchmen?" problem [1]. Time and time again we see that the incentives _really really_ matter when facing this problem; subtle changes can produce entirely new problems.
That's fine! The tech report should talk about what the researchers tried and what didn't work. I think submissions to the reproducibility track shouldn't necessarily have to be positive to be accepted, and conversely, I don't think the presence of a negative reproduction should necessarily impact an author's career negatively.
And that's true! It doesn't make sense to spend a lot of resources on reproducing things when there is low hanging fruit of just requiring better research in the first place.
Is it time for some sort of alternate degree to a PhD beyond a Master's? Showing, essentially, "this person can learn, implement, validate, and analyze the state of the art in this field"?
Thats what we call a Staff level engineer. Proven ability to learn, implement and validate is basically the "it factor" businesses are looking for.
If you are thinking about this from an academic angle then sure its sounds weird to say "Two Staff jobs in a row from the University of LinkedIn" as a degree. But I submit this as basically the certificate you desire.
No, this is not at all being a staff engineer. One is about delivering high-impact projects toward a business's needs, with all the soft/political things that involves, and the other is about implementing and validating cutting-edge research, with all the deep academic and technical knowledge and work that that involves. They're incredibly different skillsets, and many people doing one would easily fail in the other.
> The challenge is there really isn't a good way to incentivize that work.
What if we got Undergrads (with hope of graduate studies) to do it? Could be a great way to train them on the skills required for research without the pressure of it also being novel?
Those undergrads still need to be advised and they use lab resources.
If you're a tenure-track academic, your livelihood is much safer from having them try new ideas (that you will be the corresponding author on, increasing your prestige and ability to procure funding) instead of incrementing.
And if you already have tenure, maybe you have the undergrad do just that. But the tenure process heavily filters for ambitious researchers, so it's unlikely this would be a priority.
If instead you did it as coursework, you could get them to maybe reproduce the work, but if you only have the students for a semester, that's not enough time to write up the paper and make it through peer review (which can take months between iterations)
Unfortunately, that might just lead to a bunch of type II errors instead, if an effect requires very precise experimental conditions that undergrads lack the expertise for.
Could it be useful as a first line of defence? A failed initial reproduction would not be seen as disqualifying, but it would bring the paper to the attention of more senior people who could try to reproduce it themselves. (Maybe they still wouldn't bother, but hopefully they'd at least be more likely to.)
Most interesting results are not so simple to recreate that would could reliably expect undergrads to do perform the replication even if we ignore the cost of the equipment and consumables that replication would need and the time/supervision required to walk them through the process.
> Eventually you may arrive at something like the H-index, which is defined as "The highest number H you can pick, where H is the number of papers you have written with H citations."
It's the Google search algorithm all over again. And it's the certificate trust hierarchy all over again. We keep working on the same problems.
Like the two cases I mentioned, this is a matter of making adjustments until you have the desired result. Never perfect, always improving (well, we hope). This means we need liquidity with the rules and heuristics. How do we best get that?
I'm delighted to inform you that I have reproduced every patent-worthy finding of every major research group active in my field in the past 10 years. You can check my data, which is exactly as theory predicts (subject to some noise consistent with experimental error). I accept payment in cash.
Patent revenue is mostly irrelevant, as it's too unpredictable and typically decades in the future. Academics rarely do research that can be expected to produce economic value in the next 10–20 years, because the industry can easily outspend the academia in such topics.
Most papers generate zero patent revenue or even lead to patents at all. For major drugs maybe that works but we already have clinical trials before the drug goes to market that validate the efficacy of the drugs.
> I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".
usually you reproduce previous research as a byproduct of doing something novel "on top" of the previous result. I dont really see the problem with the current setup.
sometimes you can just do something new and assume the previous result, but thats more the exception. youre almost always going to at least in part reproducr the previous one. and if issues come up, its often evident.
thats why citations work as a good proxy. X number of people have done work based around this finding and nobody has seen a clear problem
theres a problem of people fabricating and fudging data and not making their raw data available ("on request" or with not enough meta data to be useful) which wastes everyones time and almost never leads to negative consequences for the authors
It's often quite common to see a citation say "BTW, we weren't able to reproduce X's numbers, but we got fairly close number Y, so Table 1 includes that one next to an asterisk."
The difficult part is surfacing that information to readers of the original paper. The semantic scholar people are beginning to do some work in this area.
yeah thats a good point. the citation might actually be pointing out a problem and not be a point in favor. its a slog to figure out... but seems like the exact type of problem an LLM could handle
give it a published paper and it runs through papers that have cited it and give you an evaluation
That feels arbitrary as a measure of quality. Why isn't new research simply devalued and replication valued higher?
"Dr Alice failed to reproduce 20 would-be headline-grabbing papers, preventing them from sucking all the air out of the room in cancer research" is something laudable, but we're not lauding it.
No, you do not have to. You give people with the skills and interest in doing research the money. You need to ensure its spent correctly, that is all. People will be motivated by wanting to build a reputation and the intrinsic reward of the work
If we did that, CERN could not publish, because nobody else has the capabilities they do. Do we really want to punish CERN (which has a good track record of scientific integrity) because their work can't be reproduced? I think the model in many of these cases is that the lab publishing has to allow some number of postdocs or competitor labs to come to their lab and work on reproducing it in-house with the same reagents (biological experiments are remarkably fragile).
AFAIK, no, but I could see there being cause to push citations to also cite the validations. It'd be good if standard practice turned into something like
Paper A, by bob, bill, brad. Validated by Paper B by carol, clare, charlotte.
Academics typically use citation count and popularity as a rough proxy for validation. It's certainly not perfect, but it is something that people think about. Semantic Scholar in particular is doing great work in this area, making it easy to see who cites who: https://www.semanticscholar.org/
That is a factor most people miss when thinking about the replication crisis. For the harder physical sciences a wrong paper will fairly quickly be found because as people go to expand on the ideas/use that data and get results that don't match the model informed by paper X they're going to eventually figure out that X is wrong. There might be issues with getting incentives to write and publish that negative result but each paper where the results of a previous paper are actually used in the new paper is a form of replication.
I am still reviewing papers that propose solutions based on a technique X, conveniently ignoring research from two years ago that shows that X cannot be used on its own. Both the paper I reviewed and the research showing X cannot be used are in the same venue!
IMHO, It's mostly ignorance coming a push/drive to "publish or perish." When the stakes are so high and output is so valued, and when reproducability isn't required, it disincentivizes thorough work. The system is set up in a way that is making it fail.
There is also the reality that "one paper" or "one study" can be found contradicted almost anything, so if you just went with "some other paper/study debunks my premise" then you'd end up producing nothing. Plus many inside know that there's a lot of slop out there that gets published, so they can (sometimes reasonably IMHO) dismiss that "one paper" even when they do know about it.
It's (mostly) not fraud or malicious intent or ignorance, it's (mostly) humans existing in the system in which they must live.
However, given the feedback by other reviewers, I was the only one who knew that X doesn’t work. I am not sure how these people mark themselves as “experts” in the field if they are not following the literature themselves.
Reproducibility is overrated and if you could wave a wand to make all papers reproducible tomorrow, it wouldn't fix the problem. It might even make it worse.
? More samples reduces the variance of a statistic. Obviously it cannot identify systematic bias in a model, or establish causality, or make a "bad" question "good". Its not overrated though -- it would strengthen or weaken the case for many papers.
If you have a strong grip on exactly what it means, sure, but look at any HN thread on the topic of fraud in science. People think replication = validity because it's been described as the replication crisis for the last 15 years. And that's the best case!
Funding replication studies in the current environment would just lead to lots of invalid papers being promoted as "fully replicated" and people would be fooled even harder than they already are. There's got to be a fix for the underlying quality issues before replication becomes the next best thing to do.
> look at any HN thread on the topic of fraud in science.
HN is very tedious/lazy when it comes to science criticism -- very much agree with you on this.
My only point is replication is necessary to establish validity, even if it is not sufficient. Whether it gives a scientist a false sense of security doesn't change the math of sampling.
I also agree with you on quality issues. I think alternative investment strategies (other than project grants) would be a useful step for reducing perverse incentives, for example. But there's a lot of things science could do.
while i agree that "reproducibility is overrated", i went ahead and read your medium post. my feedback to you is, my summary of that writing: "mike_hearn's take on policy-adjacent writing conducted by public health officials and published in journals that interacted with mike_hearn's valid and common but nonetheless subjective political dispute about COVID-19."
i don't know how any of that writing generalizes to other parts of academic research. i mean, i know that you say it does, but i don't think it does. what exactly do you think most academic research institutions and the federal government spend money on? for example, wet lab research. you don't know anything about wet lab research. i think if you took a look at a typical e.g. basic science in immunology paper, built on top of mouse models, you would literally lose track of any of its meaning after the first paragraph, you would feed it into chatgpt, and you would struggle to understand the topic well enough to read another immunology paper, you would have an immense challenge talking about it with a researcher in the field. it would take weeks of reading. you have no medicine background, so you wouldn't understand the long horizon context of any of it. you wouldn't be able to "chatbot" your way into it, it would be a real education. so after all of that, would you still be able to write the conclusion you wrote in the medium post? i don't think so, because you would see that by many measures, you cannot generalize a froo-froo policy between "subjective political dispute about COVID-19" writing and wet lab research. you'd gain the wisdom to see that they're different things, and you lack the background, and you'd be much more narrow in what you'd say.
it doesn't even have to be in the particulars, it's just about wisdom. that is my feedback. you are at once saying that there is greater wisdom to be had in the organization and conduct of research, and then, you go and make the highly low wisdom move to generalize about all academic research. which you are obviously doing not because it makes sense to, you're a smart guy. but because you have some unknown beef with "academics" that stems from anger about valid, common but nonetheless subjective political disputes about COVID-19.
Thanks for reading it, or scan reading it maybe. Of the 18 papers discussed in the essay here's what they're about in order:
- Alzheimers
- Cancer
- Alzheimers
- Skin lesions (first paper discussed in the linked blog post)
- Epidemiology (COVID)
- Epidemiology (COVID, foot and mouth disease, Zika)
- Misinformation/bot studies
- More misinformation/bot studies
- Archaeology/history
- PCR testing (in general, discussion opens with testing of whooping cough)
- Psychology, twice (assuming you count "men would like to be more muscular" as a psych claim)
- Misinformation studies
- COVID (the highlighted errors in the paper are objective, not subjective)
- COVID (the highlighted errors are software bugs, i.e. objective)
- COVID (a fake replication report that didn't successfully replicate anything)
- Public health (from 2010)
- Social science
Your summary of this as being about a "valid and common but subjective political dispute" I don't agree is accurate. There's no politics involved in any of these discussions or problems, just bad science.
Immunology has the same issues as most other medical fields. Sure, there's also fraud that requires genuinely deep expertise to find, but there's plenty that doesn't. Here's a random immunology paper from a few days ago identified as having image duplications, Photoshopping of western blots, numerous irrelevant citations and weird sentence breaks all suggestive that the paper might have been entirely faked or at least partly generated by AI: https://pubpeer.com/publications/FE6C57F66429DE2A9B88FD245DD...
The authors reply, claiming the problems are just rank incompetence, and each time someone finds yet another problem with the paper leading to yet another apology and proclamation of incompetence. It's just another day on PubPeer, nothing special about this paper. I plucked it off the front page. Zero wet lab experience is needed to understand why the exact same image being presented as two different things in two different papers is a problem.
And as for other fields, they're often extremely shallow. I actually am an expert in bot detection but that doesn't help at all in detecting validity errors in social science papers, because they do things like define a bot as anyone who tweets five times after midnight from a smartphone. A 10 year old could notice that this isn't true.
Part of replication is skeptical review. That's also part of the scientific method. If we're not doing thorough review and replication, we're not really doing science. It's a lot of faith in people incentivized to do sloppy or dishonest work.
Edit: I just read your article linked upthread. It was really good. I don't think we disagree except I say we need to attempt the steps of science wherever sensible and there's human/political problems trying to corrupt them. I try to seperately address those by changing hearts with the Gospel of Jesus Christ. (Cuz self-interest won't fix science.)
So, we need the replications. We also need to address whatever issues would pop up with them.
Yes, that's true. In theory, by the time it gets to the replication stage a paper has already been reviewed. In practice a replication is often the first time a paper is examined adversarially. There might be a useful form of hybrid here, like paying professional skeptics to review papers. The peer review concept academia works on is a very naive setup of the sort you'd expect given the prevailing ideology ("from each according to their ability, to each according to their needs"). Paying professionals to do it would be a good start, but only if there are consequences to a failed review, which there just aren't today.
"There might be a useful form of hybrid here, like paying professional skeptics to review papers."
This is how the scientific method is described. It's what much of the public thinks their money is paying for. So, I'm definitely for doing it for real or not calling it science.
Even the amount of review I saw you do on papers on your blog seems to exceed what much peer review is doing. So, how can we treat things as science if that aren't even meeting that standard, much less replication?
Yeah, spot on. If all we do is add more plausible sounding text on top of already fragile review and incentive structures, that really could make things worse rather than better
Your second point is the important one. AI may be the thing that finally forces the community to take reproducibility, attribution, and verification seriously. That’s very much the motivation behind projects like Liberata, which try to shift publishing away from novelty first narratives and toward explicit credit for replication, verification, and followthrough. If that cultural shift happens, this moment might end up being a painful but necessary correction.
If there is one thing which scientific reports must require is not using AI to produce the documentation. They can be of the data but not of the source or anything else. AI is a tool, not a replacement for actual work.
Reading the article, this is about CITATIONS which are trivially verifiable.
This is just article publishers not doing the most basic verification failing to notice that the citations in the article don't exist.
What this should trigger is a black mark for all of the authors and their institutions, both of which should receive significant reputational repercussions for publishing fake information. If they fake the easiest to verify information (does the cited work exist) what else are they faking?
> LLMs being able to put out plausible papers is just going to make it worse
If correct form (LaTeX two-column formatting, quoting the right papers and authors of the year etc.) has been allowing otherwise reject-worthy papers to slip through peer review, academia arguably has bigger problems than LLMs.
Correct form and relevant citations have been, for generations up to a couple of years ago, mighty strong signals that a work is good and done by a serious and reliable author. This is no longer the case and we are worse off for it.
I'd need to see the same scrutiny applied to pre-AI papers. If a field has a poor replication rate, meaning there's a good chance that a given published paper is just so much junk science, is that better or worse than letting AI hallucinate the data in the first place?
I think, at least I hope, that a part of the LLM value will be to create their retirement for specific needs. Instead of asking it to solve any problem, restrict the space to a tool that can help you then reach your goal faster without the statistical nature of LLMs.
I heard that most papers in a given field are already not adding any value. (Maybe it depends on the field though.)
There seems to be a rule in every field that "99% of everything is crap." I guess AI adds a few more nines to the end of that.
The gems are lost in a sea of slop.
So I see useless output (e.g. crap on the app store) as having negative value, because it takes up time and space and energy that could have been spent on something good.
My point with all this is that it's not a new problem. It's always been about curation. But curation doesn't scale. It already didn't. I don't know what the answer to that looks like.
I've long argued for this, as reproduction is the cornerstone of science. There's a lot of potential ways to do this but one that I like is linking to the original work. Suppose you're looking at the OpenReview page and they have a link for "reproduction efforts" and with at minimum an annotation for confirmation or failure.
This is incredibly helpful to the community as a whole. Reproduction failures can be incredibly helpful even when the original work has no fraud. In those cases a reprising failure reveals important information about the necessary conditions that the original work relies on.
But honestly, we'll never get this until we drop the entire notion of "novel" or "impact" and "publish or perish". Novel is in the eye of the reviewer and the lower the reviewer's expertise the less novel a work seems (nothing is novel as a high enough level). Impact can almost never be determined a priori, and when it can you already have people chasing those directions because why the fuck would they not? But publish or perish is the biggest sin. It's one of those ideas that looks nice on paper, like you are meaningfully determining who is working hard and who is hardly working. But the truth is that you can't tell without being in the weeds. The real result is that this stifles creativity, novelty, and impact as it forces researchers to chase lower hanging fruit. Things you're certain will work and can get published. It creates a negative feedback loop as we compete: "X publishes 5 papers a year, why can't you?" I've heard these words even when X has far fewer citations (each of my work had "more impact").
Frankly, I believe fraud would dramatically reduce were researchers not risking job security. The fraud is incentivized by the cutthroat system where you're constantly trying to defend your job, your work, and your grants. They'll always be some fraud but (with a few exceptions) researchers aren't rockstar millionaires. It takes a lot of work to get to point where fraud even works, so there's a natural filter.
I have the same advice as Mervin Kelly, former director of Bell Labs:
There is already a problem with papers falsifying data/samples/etc, LLMs being able to put out plausible papers is just going to make it worse.
On the bright side, maybe this will get the scientific community and science journalists to finally take reproducibility more seriously. I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".