Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Open source implementation of Google's MusicLM in PyTorch (github.com/lucidrains)
118 points by bevenky on Jan 31, 2023 | hide | past | favorite | 22 comments


This guy (Phil Wang, https://github.com/lucidrains) seems to have the hobby to just implement all models and papers he finds interesting. See his GitHub page. He has 228 repos, and most of them are some implementation of some machine learning paper. Some of those repos are quite popular.


The project README thanks "Stability.ai for the generous sponsorship to work and open source cutting edge artificial intelligence research", so it's not necessarily just a hobby (though it's possible they just provide compute resources).

Phil's homepage [1] links to a form [2] where you can suggest a paper for him to implement.

[1]: https://lucidrains.github.io/

[2]: https://forms.gle/Dtrxc6CceHEcqS6X6


He is open to consultation and work so the repo is a nice gallery of what's possible and learnable material

https://lucidrains.github.io/

He also is creator of ThisPersonDoesNotExist.com


The implementation is quite clean too, and provided that you have read the papers, they are easy to understand.


I don't understand how this got so many upvotes. It takes only one minute to read the code and realize that the model is not yet completely implemented. Sometimes I have the feeling that people upvote posts without even reading them...

Of course, it's good work, and knowing lucidrains trajectory it's probably going to be implemented in the following days/weeks. But I wonder how many people have at least opened the link before upvoting it.


This question is a tangent to your work. Having never used music LMs, and only being cursorily aware of them - how do you keep up with the sota in your field?


Google's MusicLM sounds plausible, but quite dull and even sometimes irritating to my musician's ear.


As another musician, I'll point out that there was a point when the only thing AI could produce visually was a lot of dog faces.


My day job is in ML, but I also enjoy music making as a hobby (on a very amateur level - mostly making 4-bar loops on a handheld tracker or knob twiddling on 90s synths I couldn't afford as a kid). I see an interesting mix of curiosity, hostility and head-in-the-sand attitudes from the musician communities. Though the "head-in-the-sand" component will almost certainly start becoming less prevalent with this and other models that are sure to come out in short order.

I'm pretty sure soon enough we'll start seeing the same kind of dynamics that have played out for the arts community in music, not that the dust has settled there yet. I hope there isn't much negative financial impact on people's livelihoods, but maybe some will be unavoidable. And of course, AI is also coming for programmer's jobs, which will hit even closer to home. The next decade will be "interesting", so to speak.


I also think that it's still very new, and we have not yet discovered the creative limitations of the tech.

I personally expect that it will turn out that there is zero creativity inherent to the tech, and that it will become apparent after a while that without constant new training input from real brains, the output will not creatively evolve, and therefore become boring to many and end up relegated to elevators, hold music, etc, while people rebel and quite possibly have another folk music explosion. Banjos refuse to die, and that will remain true.

I might be wrong, but that's how I'd bet my twenty bucks.


I am not arguing that it won't be successful, just that I don't like it at its current stage of progress. Actually most of the music that people hear our days is nothing special either (either quite dumb or just a rinse-repeat of some older successful musical forms).


Does anyone know if these models can output also Midi instead of plain audio?


This model is designed to output raw audio.

However, there are many models which do output midi. That's actually much simpler, and has been done already a few years ago.

I thought OpenAI did this. But then, I might misremember, because their Jukebox actually also seems to produce raw audio (https://openai.com/blog/jukebox/).

Edit: Ah, it was even earlier, OpenAI MuseNet, this: https://openai.com/blog/musenet/

However, midi generation is so easy, you even find it in some tutorials: https://www.tensorflow.org/tutorials/audio/music_generation


Not out of the box, afaik. They produce spectograms that get converted into wav/mp3.


I think that description applies to Riffusion, one of the earlier models in this area that was a pretty straightforward to adapt image-based diffusion models to making music, since you can treat spectrograms as images. But this model uses "soundstream", which is another model that has its own paper. It's described as a "neural audio codec" which, by itself, is a model that encodes and decodes audio into "tokens"; so sort of like other codecs (eg, MP3) except that the compressed representation it uses is a more high-level learned representation. This model outputs the tokens which are then decoded by soundstream. The tokens probably encode a lot of the same kind of spectral information contained in spectrograms (or similarly, mel-frequency features) but seem to be a little bit more expressive/data efficient.


No. They can’t.

You could train a model that could, but these models can’t.

Paper: https://google-research.github.io/seanet/musiclm/examples/

Quote: “By relying on pretrained and frozen MuLan, we need audio- only data for training the other components of MusicLM. We train SoundStream and w2v-BERT on the Free Music Archive (FMA) dataset (Defferrard et al., 2017), whereas the tokenizers and the autoregressive models for the seman- tic and acoustic modeling stages are trained on a dataset con- taining five million audio clips, amounting to 280k hours of music at 24 kHz.”

Tldr: you can only get out of these models what you put in, and these ones are trained on raw audio.

If you want midi output, you need to train a model on midi data.


Seems to be an early WIP.


This one is AudioLM modified from here https://github.com/lucidrains/audiolm-pytorch to support the music generation needs of Mulan.


Nice work!

Won't the model training be a lot of cost to bear, though?


Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch.

https://github.com/lucidrains/musiclm-pytorch/blob/main/musi...


pardon my ignorance - what exactly is involved in reimplementing these models?

i assume there's only a superficial description of the architecture, and no weights to load in, so you'll have to train everything from scratch? do we even have their dataset?


Generally it's without weights, but MusicLM is also a WIP. More mature implementations have descriptions on how to train them and follow ups on small scale/crowd-sourced experiments & research[1].

[1]: https://github.com/lucidrains/denoising-diffusion-pytorch




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: