I roughly came to the same conclusion a few months back and wrote a simple, containerized, open source general purpose scraper for use with GPT using Playwright in C# and TypeScript that's fairly easy to deploy and use with GPT function calling[0]. My observation was that using `document.body.innerText` was sufficient for GPT to "understand" the page and `document.body.innerText` preserves some whitespace in Firefox (and I think Chrome).
I use more or less this code as a starting point for a variety of use cases and it seems to work just fine for my use cases (scraping and processing travel blogs which tend to have pretty consistent layouts/structures).
Some variations can make this better by adding logic to look for the `main` content and ignore `nav` and `footer` (or variants thereof whether using semantic tags or CSS selectors) and taking only the `innerText` from the main container.
I use more or less this code as a starting point for a variety of use cases and it seems to work just fine for my use cases (scraping and processing travel blogs which tend to have pretty consistent layouts/structures).
Some variations can make this better by adding logic to look for the `main` content and ignore `nav` and `footer` (or variants thereof whether using semantic tags or CSS selectors) and taking only the `innerText` from the main container.
[0] https://github.com/CharlieDigital/playwright-scrape-api