The problem is that deep net classifiers in general are not well statistically c...

mumblemumble · on Oct 25, 2024

I'm not an expert, either, but I've poked at this a little. From what I've seen, token logprobs are correlated enough with correctness of the answer to serve as a useful signal at scale, but it's a weak enough correlation that it probably isn't great for evaluating any single output.

My best guess is that somewhere close to the root of the problem is that language models still don't really distinguish syntagmatic and paradigmatic relationships. The examples in this article are a little bit forced in that respect because the alternatives it shows in the illustrations are all paradigmatic alternatives but roughly equivalent from a syntax perspective.

This might relate to why, within a given GPT model generation, the earlier versions with more parameters tend to be more prone to hallucination than the newer, smaller, more distilled ones. At least for the old non-context-aware language models (the last time I really spent any serious time digging deep into language models), it was definitely the case that models with more parameters would tend to latch onto syntagmatic information so firmly that it could kind of "overwhelm" the fidelity of representation of semantics. Kind of like a special case of overfitting just for language models.

singularity2001 · on Oct 26, 2024

maybe this signal needs to be learned in the final step of reinforcement learning where people decide whether "I don't know" is the right answer

trq_ · on Oct 25, 2024

I want to build intuition on this by building a logit visualizer for OpenAI outputs. But from what I've seen so far, you can often trace down a hallucination.

Here's an example of someone doing that for 9.9 > 9.11: https://x.com/mengk20/status/1849213929924513905

z3t4 · on Oct 25, 2024

I'm thinking versioning. 9.9, 9.10, 9.11 etc because in my native language we use the comma, for decimal separation 9,11 9,22 9,90

modeless · on Oct 25, 2024

My understanding is that base models are reasonably well calibrated but the RLHF and other tuning that turns them into chat assistants screws up the calibration.

scottmf · on Oct 25, 2024

There’s much that is lost but imo gpt-4-base would be borderline unusable for most of us compared to its descendants — perhaps even more so than GPT-3 davinci, at least relative to its time.

4 can be an absolute demonic hallucinating machine.