Review

Google Research submitted a paper on January 26 2023 for MusicLM, a mind-bogglingly incredibly powerful AI model that converts user text prompts into music.

https://google-research.github.io/seanet/musiclm/examples/

 

On May 10th, Google Research released to the public a waitlist that allows applicants to try it out. In about 6 seconds, it returns to the user 2 songs each 20 seconds long.

https://blog.google/technology/ai/musiclm-google-ai-test-kitchen/

 

The music I have gotten it to make is beyond and dreamy. It is often human level to me. It is incredibly fast to make it release my dreams. I could easily make a 20 minute long track get incredibly complex and advanced if they let us extend it from X seconds in.

 

I have been testing its limits and abilities since May 10th up to today just over 3 months later. After about 1 month in, I noticed the AI was losing many abilities, and the same prompts were making noticeably different music, which made it clear to me something might be changing for the worse. By 2 months in or so the outputs simply were no longer intelligent and advanced and jaw dropping, as if the AI was an old-school model which weren't very good back in the day. 3 months in now I can see this truly appears to be happening, because most songs just repeat a short 2 or 4 second beat now, without doing much at all. It's embarrassingly horrible right now. Except for a few prompts, a few still somewhat work, some feel like 2 months in quality, some 1 month in. I'd say very few if any are 0 months in level. I feel like a lost my pet dog, or a really good dream I once had. I wish they never changed the model. I saved all my dreamy tests, though there is a few harder prompts I just came up with that I want to document but now can't. It says come back in 30 days (oddly, just when MusicLM would get noticeably worse) after some 100 prompts, but this can be bypassed by coming back the next day.

 

My early tests. The rest on the 1st SoundCloud are good too:

https://www.reddit.com/r/singularity/comments/13h0zyy/i_really_crank_out_music_tracks_with_musiclm_this/

My 2nd SoundCloud has both the 3-months-in tests and more early tests I had.

https://soundcloud.com/steven-aiello-992033649/sets

 

The problem I believe starts when MusicLM asks the user to choose between the 2 generated songs which they like more by giving it a trophy to improve the model. The users are only going to make a fast-paced judgement (I myself got some wrong sometimes and couldn't go back to change them) which isn't enough time to truly rank them both. It is also a binary vote. Furthermore often both tracks have their own unique features and abilities shown in them, I'm guessing there is up to possibly hundreds of short features in both that are useful, while the whole 20 seconds would be the largest feature/ memory in the AI's model, so trying to make the model ignore one song a bit more is only going to remove all sorts of its very own abilities with random chance, eventually completely I'm going to guess. I know sometimes one song might be better and you would think this would make the AI smarter, but perhaps their algorithm to do this can't properly continuously learn/ update the model as intended. Regardless, the AI seems a ton worse right now.

 

Important notes:

MusicLM was done training a long time ago it seems. Why would they still be fiddling around with training it? It clearly was already jaw dropping. We already know from my tests the model did change (I'm 100% sure of that due to just the August 20th ice castle test set), so it seems likely it is using the users feedback as we use it to change the model. They might be trying to form it to our likes no matter what that costs the model to lose because they worry about people complaining about how it knows perhaps copyrighted songs or NSFW material. Or maybe they want to simply make it dumber, or make it better. Either of those 3 possibilities would be sad, hopefully there is an error in their algorithm or understanding of it simply. It was already good and definitely didn't get better.

Even though most songs it now makes seem to be usually repeating a short 2-4 second melody, many of the same instruments in my very early tested prompts are still present. Usually.

Many of my old prompts no longer work for no reason, I tried getting one such to run and all I have to do to get it to work is remove 1 space between the '.' and 'L' OR leave the space and change the 'L' to a lowercase 'l'. The short prompt 'violence' won't run in MusicLM, neither does 'megurine luka' or 'rin kagamine', but the more popular 'hatsune miku' does. The part of the prompt I did this to to get it to run was this: '...taiko, and guzheng. Layer in synthesized...'. Somewhere, AI is behind these word choices, but also is humans, and this changed after public testing started.

We already saw other AIs seem suspect to degrading over time: GPT-4, and DALL-E 2. I personally saw DALL-E 2 change, since each output was different for the same prompt. I think it got a little worse, not too much worse though. We loved it at first likely because it made 10 small images which seem like eye candy all being close together at a glance, and never had to see all the weird details in any expanded image unless viewed one at a time. A study also tracked chatGPT over time and seems to say it might have got worse. It seems GPT-4 was made worse even before release since it was too powerful, though I am not sure about this claim. It seem to me to be quite great still. Google has not yet released many of its AIs partially due to fear of misuse it seems. Additionally, GPT3.5 and GPT-4 seem to have feedback buttons too, perhaps they too are using user inputs to change the model. This seems to be a new thing. I also saw MidJourney 5 do this too before the actual release, he said it isn't yet producing interesting images, so they have to do another round of votes for another few days. From the few dozen of images I saved before and after, I'm somewhat sure the before MJ5 made more accurate, prompt-listening, and interesting images. I saw too this year Stable Diffusion ask us also between 2 images which best matches the prompt.

Someone has mentioned some of my early tests are long or start with "Create a..." (a few nice ones were by GPT-4 and it came up with "Create a..."), which the AI might not usually see in track titles. If anything, this only shows the early MusicLM had no problem with understanding these prompts. I still have other tests and any prompt you try today comes out bad in today's MusicLM, so I consider this not to be a problem in my testing.

I made a little correction to some of my prompts on SoundCloud etc. I of course don't make prompts like "The Cat Was..." like SoundCloud converts them to show, but sometimes the first letter is indeed capitalized, so still check comments section for exact prompt. As of August 26 I have fully (or almost fully if not!) added now the exact prompts showing where capital letters are/ aren't. I of course have them on file though correctly and tested with same prompts.


 

In conclusion, MusicLM definitely changed, and based on all my tests I would say it definitely became almost useless as of 3 months later. I seriously hope they give us back day1 model.

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since:

Listening to your sets of samples, the collapse in quality does seem drastic.

This is a little puzzling because there's nothing about RLHF mode collapse per se which ought to trigger repetition within the song. Your complaint ought to be that the tracks all now sound similar to each other and blandly mealy-mouthed, not that each track sounds similar to itself... That should actually be strongly discouraged by the user ratings, as repetition is a simple pattern which ought to be easy for listeners to downvote and for reward models to detect & punish.

This makes me wonder if they screwed up in some way; for example, a way in which RLHF could produce this pathological degradation to repeating loops would be if they took a shortcut & applied RLHF only on a small window like 10s (rather than the entire music track). This would lead the model to greedily optimize only for a short snippet, and then mode collapse onto just repeating that snippet indefinitely (but uniquely so per track/prompt, and not necessarily collapsing all similarly-prompted tracks onto the same track/snippet). User ratings penalize repetition would then have no effect, because once the rating filters down to the chopped-up-10s-snippet dataset, the repetition has disappeared, leaving the model baffled as to why one was preferred when they are (at the snippet level) equally good. (Which could then screw up the model further by adding tons of noise to the already-impoverished feedback signal.)

I made a little correction to some of my prompts on SoundCloud etc. I of course don't make prompts like "The Cat Was..." like SoundCloud converts them to show, but sometimes the first letter is indeed capitalized, so still check comments section for exact prompt. As of August 26 I have fully (or almost fully if not!) added now the exact prompts showing where capital letters are/ aren't. I of course have them on file though correctly and tested with same prompts.

Any updates on this? For example, I notice that the new music services like Suno & Udio seem to be betraying a bit of mode collapse and noticeable same-yness, but they certainly do not degenerate into such within-song repetition like these were.

Yes, I see it as a risk that eventually AI will be aligned with mean human values, excluding some outliers, and will result in boring, almost senseless world.