I agree, it's not fraud, strictly speaking However, to create a "predictive mode...

Deestan · on March 15, 2010

It is a bit disappointing. At my university, we were taught of this fallacy in the mandatory introduction to science course. I remember a beatiful analogy which went something like:

As long as both your sample data and your test data are collected between January and May, you can make a model that accurately predicts the time the sun sets based on the length of your hair.

jerf · on March 15, 2010

The evidence would abundantly suggest that merely being exposed to this idea in a course is not sufficient to guarantee that the resulting scientist will have any clue.

I see two problems here: First, by the time you get to that level of schooling, you've developed a "resistance" (for lack of a better term) to schooling. That is, you've long since internalized the fact that there are two worlds, the one described in school and the real world, and the overlap between the two is tenuous at best. Merely being told about this stuff in the school world isn't enough, you somehow need it to be penetrated down to the "real" world. Computer scientists get a bit of an advantage here in that after being told about this, their homework assignment will generally be to implement the situations described and verify the results. Computer courses are rather good at bridging this gap; most disciplines lack the equivalent and a shocking amount of basic stuff is safely locked away in the "schooling" portion of the world for most people.

This is a result of the second part of the problem, which is that the teachers themselves help create this environment. I went up the computer science grad school approach for this stuff, where not only do they show you the fundamental math, they make you actually implement it (like I said above). I got to watch my wife go up through the biology stats course, and they don't do that. At the time I didn't notice how bad it was, but the stats courses they do are terrible; they teach a variety of "leaf node" techniques (that is, the final results of the other math), but they never teach the actual prerequisites for using the techniques, and in the end it is nothing but voodoo. Students who consign this stuff to "in the school world but not real" are, arguably, being more rational than those who don't; things like confidence interval and p-values are only applicable in Gaussian domains, and I know they say that at some point but they don't ever cover anything else from what I can see. By never connecting it to the real world (very well) and indeed treating in the courses as a bit of voodoo ("wave this statistical chicken over the data and it will squawk the answer, which you will then write in your paper") they contribute to the division between school and real.

So, I'd say two things: Those who say this stuff is taught in "any college stats course" are probably actually literally wrong in that it is possible to take a lot of stats without really covering this, and secondly, even if covered, it ends up categorized away unless great efforts are taken to "make it real" for the student, which, short of forcing implementation on the students (impractical in general) I'm not sure how to do. Most scientists seem to come out of PhD programs with a very thin understanding of this stuff.