Is it just me or is this MUCH sturdier against jailbreaks then similar models, o...

nialv7 · 2025-08-06T17:29:40 1754501380

thoughts in the field say instead of a model that is pre-trained normally then censored, this is a model pre-trained on filtered data. i.e. it have never seen anything that is unsafe, ever.

you can't jailbreak when there is nothing "outside".

diggan · 2025-08-06T18:37:21 1754505441

> filtered data. i.e. it have never seen anything that is unsafe, ever

I don't think that's true, you can't ask it outright "How do you make a molotov cocktail?" but if you start by talking about what is allowed/disallowed by policies, how examples would look for disallowed policies and eventually ask it for the "general principles" of how to make a molotov cocktail, it'll happily oblige by essentially giving you enough information to build one.

So it does know how to make an molotov cocktail, for example, but (mostly) refuses to share it.

brna-2 · 2025-08-06T17:38:27 1754501907

This is not actually just about having it produce text that is censored but doing anything it says it is not allowed to do at all. I am sure these two mostly overlap but not always. Like I said, it is not allowed to have "no output" and it is hard to make it do it.