This guy (Phil Wang, https://github.com/lucidrains) seems to have the hobby to just implement all models and papers he finds interesting. See his GitHub page. He has 228 repos, and most of them are some implementation of some machine learning paper. Some of those repos are quite popular.
The project README thanks "Stability.ai for the generous sponsorship to work and open source cutting edge artificial intelligence research", so it's not necessarily just a hobby (though it's possible they just provide compute resources).
Phil's homepage [1] links to a form [2] where you can suggest a paper for him to implement.
I don't understand how this got so many upvotes. It takes only one minute to read the code and realize that the model is not yet completely implemented. Sometimes I have the feeling that people upvote posts without even reading them...
Of course, it's good work, and knowing lucidrains trajectory it's probably going to be implemented in the following days/weeks. But I wonder how many people have at least opened the link before upvoting it.
This question is a tangent to your work. Having never used music LMs, and only being cursorily aware of them - how do you keep up with the sota in your field?
My day job is in ML, but I also enjoy music making as a hobby (on a very amateur level - mostly making 4-bar loops on a handheld tracker or knob twiddling on 90s synths I couldn't afford as a kid). I see an interesting mix of curiosity, hostility and head-in-the-sand attitudes from the musician communities. Though the "head-in-the-sand" component will almost certainly start becoming less prevalent with this and other models that are sure to come out in short order.
I'm pretty sure soon enough we'll start seeing the same kind of dynamics that have played out for the arts community in music, not that the dust has settled there yet. I hope there isn't much negative financial impact on people's livelihoods, but maybe some will be unavoidable. And of course, AI is also coming for programmer's jobs, which will hit even closer to home. The next decade will be "interesting", so to speak.
I also think that it's still very new, and we have not yet discovered the creative limitations of the tech.
I personally expect that it will turn out that there is zero creativity inherent to the tech, and that it will become apparent after a while that without constant new training input from real brains, the output will not creatively evolve, and therefore become boring to many and end up relegated to elevators, hold music, etc, while people rebel and quite possibly have another folk music explosion. Banjos refuse to die, and that will remain true.
I might be wrong, but that's how I'd bet my twenty bucks.
I am not arguing that it won't be successful, just that I don't like it at its current stage of progress. Actually most of the music that people hear our days is nothing special either (either quite dumb or just a rinse-repeat of some older successful musical forms).
However, there are many models which do output midi. That's actually much simpler, and has been done already a few years ago.
I thought OpenAI did this. But then, I might misremember, because their Jukebox actually also seems to produce raw audio (https://openai.com/blog/jukebox/).
I think that description applies to Riffusion, one of the earlier models in this area that was a pretty straightforward to adapt image-based diffusion models to making music, since you can treat spectrograms as images. But this model uses "soundstream", which is another model that has its own paper. It's described as a "neural audio codec" which, by itself, is a model that encodes and decodes audio into "tokens"; so sort of like other codecs (eg, MP3) except that the compressed representation it uses is a more high-level learned representation. This model outputs the tokens which are then decoded by soundstream. The tokens probably encode a lot of the same kind of spectral information contained in spectrograms (or similarly, mel-frequency features) but seem to be a little bit more expressive/data efficient.
Quote: “By relying on pretrained and frozen MuLan, we need audio- only data for training the other components of MusicLM. We train SoundStream and w2v-BERT on the Free Music Archive (FMA) dataset (Defferrard et al., 2017), whereas the tokenizers and the autoregressive models for the seman- tic and acoustic modeling stages are trained on a dataset con- taining five million audio clips, amounting to 280k hours of music at 24 kHz.”
Tldr: you can only get out of these models what you put in, and these ones are trained on raw audio.
If you want midi output, you need to train a model on midi data.
pardon my ignorance - what exactly is involved in reimplementing these models?
i assume there's only a superficial description of the architecture, and no weights to load in, so you'll have to train everything from scratch? do we even have their dataset?
Generally it's without weights, but MusicLM is also a WIP. More mature implementations have descriptions on how to train them and follow ups on small scale/crowd-sourced experiments & research[1].