The problem is that deep net classifiers in general are not well statistically calibrated by default. So while the entropy is often high when they are "not sure", models can very often also be "confidently wrong". So using entropy of the logits as an indicator of confidence can easily be very misleading.
I'm not an expert in LLMs though, this is just my understanding of classifiers in general. Maybe with enough data this consideration no longer applies? I'd be interested to know.
I'm not an expert, either, but I've poked at this a little. From what I've seen, token logprobs are correlated enough with correctness of the answer to serve as a useful signal at scale, but it's a weak enough correlation that it probably isn't great for evaluating any single output.
My best guess is that somewhere close to the root of the problem is that language models still don't really distinguish syntagmatic and paradigmatic relationships. The examples in this article are a little bit forced in that respect because the alternatives it shows in the illustrations are all paradigmatic alternatives but roughly equivalent from a syntax perspective.
This might relate to why, within a given GPT model generation, the earlier versions with more parameters tend to be more prone to hallucination than the newer, smaller, more distilled ones. At least for the old non-context-aware language models (the last time I really spent any serious time digging deep into language models), it was definitely the case that models with more parameters would tend to latch onto syntagmatic information so firmly that it could kind of "overwhelm" the fidelity of representation of semantics. Kind of like a special case of overfitting just for language models.
I want to build intuition on this by building a logit visualizer for OpenAI outputs. But from what I've seen so far, you can often trace down a hallucination.
My understanding is that base models are reasonably well calibrated but the RLHF and other tuning that turns them into chat assistants screws up the calibration.
There’s much that is lost but imo gpt-4-base would be borderline unusable for most of us compared to its descendants — perhaps even more so than GPT-3 davinci, at least relative to its time.
4 can be an absolute demonic hallucinating machine.
I'm not an expert in LLMs though, this is just my understanding of classifiers in general. Maybe with enough data this consideration no longer applies? I'd be interested to know.