I've been experimenting with this recently as well, but with an app on my apple watch. Looking for a method/model to split different speakers into different tracks to only look at audio from myself and certain people.
Ahh I’m working on exact same project. I applied to YC with the idea and was told that “nobody wants this” during the interview.
There’s a ton of problems in the space around privacy and UX. But I’m incredibly excited about projects in this space because in modern society we’re basically surrounded by a million unhealthy things designed to tempt us. Logging forces you to “stay honest”. I’ve been shocked already by how many unhealthy habits I underestimated and how many healthy habits I overestimated.
My #1 priority is just to improve my own physical and mental health. Whether there’s a market for this stuff, who knows.
Check out this model, I've had limited success with it.
Best I've done so far is to just add the labels it gives to the overlapping segments whisper spits out, which means some sentences have multiple speakers, but that's mostly the case because of cross-talk. I'd say it gets it right ~80% of the time with the 5 speakers I've done it on across ~16 hours of audio.
we're experimenting building out a version of this too, but on desktop with www.usebacktrack.com - should have splitting speakers/inputs early next year and seeing what that's like