I have a small test suite for the voice AI math tutor we built, about 50 tests, mostly about correctly following the system instructions. The newly released Flash 2.5 is much worse than current stable version. Gemini 2.5 pro will fail 2—3 tests. Flash 2.5 stable, which we use in production, fails about 10, and the new one fails 20. Every test runs 3 times and the model has to be right every time. Will look into it more, I haven‘t yet looked into actual output.
This is not about solving math, the system follows given solution paths.