That's what pruning is, but it's not that straight forward and has limits. Finetuning a smaller model on the output of a larger one is much more flexible and reliable.
GPT 3.5 is probably a 13B Curie finetuned on the output of full size GPT-3 175B, to give you an idea of the technique.
That is smaller than the third smallest StableLM and the same size as LLaMA-13B which can run at useful speeds off of a smart phone CPU.
GPT-3.5 is much worse at "complex" cognitive tasks than Davinci (175B), which seem to indicate that it's a smaller model. It's also much faster than Davinci and costs the same as Curie via the API.
It's clearly a smaller model, but I'm very skeptical that it is 13B. It is much more lucid than any 13B model out in the wild. I find it much more likely that they used additional tricks to scale down hardware requirements and thereby bring the price down so much (int4 quantization, perhaps? that alone would mean 4x less hardware utilization for the same query, if they were using float16 for older models, which they probably were)
I'm sure they're tweaking lots of things under the hood, especially now that they have 100M+ users. It could be bigger (30B?, maybe 65B) as coming down from 175B gives quite a lot of room, but the cognitive drop from Davinci gives away that's it's much smaller.
People fine-tuning LLaMa models on arguably not that much/not the highest quality data are already seeing pretty good improvements over the base LLaMa, even at "small" sizes (7B/13B). I assume OpenAI has access to much higher quality data to fine-tune with and in much higher quantity too.
I have been playing with all the local LLaMA models, and in my experience, the gains that are touted are often very misleading (e.g. people claiming that 13B can be as good as ChatGPT-3.5; it is absolutely not) and/or refer to synthetic testing that doesn't seem to translate well to actual use. Using GPT to generate training data for fine-tuning seems to produce the best results, but even so, GPT4-x-Alpaca 30B is still clearly inferior to the real thing. In general, the gap between 13B and 30B for any LLaMA-derived model is pretty big, and I've yet to see any fine-tuned model at 13B work better than plain llama-30b in actual use.
So I think that 65B may be a realistic estimate here assuming that OpenAI does indeed have some secret sauce for training that's substantially better, but below that I'm very skeptical (but still hope I'm wrong - I'd love to have GPT-3.5 level of performance running locally!).
Agreed, there is way too much hype about the actual capabilities of the LLaMa models. However, instruction tuning alone makes Alpaca much more usable than the the base model and to be fair even some versions of the "tiny" 7B can do small talk relatively well.
> Using GPT to generate training data for fine-tuning seems to produce the best results, but even so, GPT4-x-Alpaca 30B is still clearly inferior to the real thing.
Distillation is interesting and it does seems to make the models adopt ChatGPT's style but I'm dubious that making LLMs generate entire datasets or copy/pasting ShareGPT is going to give you that great of a dataset. The whole point of RLHF is getting the human feedback to make the model better. OpenAI's dataset/RLHF work seems to be working wonders for them and will continue to give them a huge advantage (especially now that they're getting hundred of millions of conversations of people doing all sorts of things with ChatGPT)
GPT 3.5 is probably a 13B Curie finetuned on the output of full size GPT-3 175B, to give you an idea of the technique.
That is smaller than the third smallest StableLM and the same size as LLaMA-13B which can run at useful speeds off of a smart phone CPU.