Still waiting for GPT4V but doubt it will do this. Yes I’ve tried Donut and other options but this is a very gnarly problem.
One option is to extract text blocks along with their coordinates (unstructured.io gives this, probably based on another pkg because it’s basically a container for many pigs). Then do the same with a blank template, and you then have an algorithmic problem of matching the filled values spatially with the key locations from the template.
In theory it's exactly this...