Qwen3-TTS family is now open sourced: Voice design, clone, and generation

simonw · 2026-01-22T17:22:13 1769102533

If you want to try out the voice cloning yourself you can do that an this Hugging Face demo: https://huggingface.co/spaces/Qwen/Qwen3-TTS - switch to the "Voice Clone" tab, paste in some example text and use the microphone option to record yourself reading that text - then paste in other text and have it generate a version of that read using your voice.

I shared a recording of audio I generated with that here: https://simonwillison.net/2026/Jan/22/qwen3-tts/

javier123454321 · 2026-01-22T17:38:00 1769103480

This is terrifying. With this and z-image-turbo, we've crossed a chasm. And a very deep one. We are currently protected by screens, we can, and should assume everything behind a screen is fake unless rigorously (and systematically, i.e. cryptographically) proven otherwise. We're sleepwalking into this, not enough people know about it.

rdtsc · 2026-01-22T17:59:59 1769104799

That was my thought too. You’d have “loved ones” calling with their faces and voices asking for money in some emergency. But you’d also have plausible deniability as anything digital can be brushed off as “that’s not evidence, it could be AI generated”.

rpdillon · 2026-01-22T22:20:10 1769120410

Only if you focus on the form instead of the content. For a long time my family has had secret words and phrases we use to identify ourselves to each other over secure, but unauthenticated, channels (i.e. the channel is encrypted, but the source is unknown). The military has had to deal with this for some time, and developed various form of IFF that allies could use to identify themselves. E.g. for returning aircraft, a sequence of wing movements that identified you as friend. I think for a small group (in this case, loved ones), this could be one mitigation of that risk. My parents did this with me as a kid, ostensibly as a defense against some other adult saying "My mom sent me to pick you up...". I never did hear of that happening, though.

neevans · 2026-01-22T18:14:48 1769105688

this was already possible with chatterbox for a long while.

freedomben · 2026-01-22T18:33:42 1769106822

Yep, this has been the reality now for years. Scammers have already had access to it. I remember an article years ago about a grandma who wired her life savings to a scammer who claimed to have her granddaughter held hostage in a foreign country. Turns out they just cloned her voice from Facebook data and knew her schedule so timed it while she would be unreachable by phone.

DANmode · 2026-01-22T18:52:22 1769107942

or anyone who refuses to use hearing aids.

fridder · 2026-01-22T21:12:19 1769116339

Admittedly I have not dove into it much but, I wonder if we might finally have a usecase for NFTs and web3? We need some sort of way to denote items are persion generated not AI. Would certainly be easier than trying to determine if something is AI generated

simonw · 2026-01-22T22:00:59 1769119259

How would NFTs/web3 help differentiate between something created by a human and something that a human created with AI and then tagged with their signature using those tools?

grumbel · 2026-01-22T21:24:08 1769117048

That's the idea behind C2PA[1], your camera and the tools put a signature on the media to prove its provenance. That doesn't make manipulation impossible (e.g. you could photograph an AI image of a screen), but it does give you a trail of where a photo came from and thus an easier way to filter it or lookup the original.

[1] https://c2pa.org/

u8080 · 2026-01-22T22:08:01 1769119681

https://www.youtube.com/watch?v=diboERFAjkE pretty much this

javier123454321 · 2026-01-22T22:23:02 1769120582

Oh wow. Thank you for this. Amazing, terrifying, spot on, all of it.

arcanemachiner · 2026-01-22T22:52:48 1769122368

I knew what it would be before I even opened it. The crazy thing is that video is like 3 years old.

oceanplexian · 2026-01-22T20:13:38 1769112818

> This is terrifying.

Far more terrifying is Big Tech having access to a closed version of the same models, in the hands of powerful people with a history of unethical behavior (i.e. Zuckerberg's "Dumb Fucks" comments). In fact it's a miracle and a bit ironic that the Chinese would be the ones to release a plethora of capable open source models, instead of the scraps like we've seen from Google, Meta, OpenAI, etc.

javier123454321 · 2026-01-22T20:37:20 1769114240

I do strongly agree. Though the societal impact is only mitigated by open models, not curtailed at all.

echelon · 2026-01-22T18:38:36 1769107116

We're going to be okay.

There are far more good and interesting use cases for this technology. Games will let users clone their voices and create virtual avatars and heroes. People will have access to creative tools that let them make movies and shows with their likeness. People that couldn't sing will make music.

Nothing was more scary than the invention of the nuclear weapon. And we're all still here.

Life will go on. And there will be incredible benefits that come out of this.

javier123454321 · 2026-01-22T19:33:38 1769110418

I'm not denigrating the tech, all I'm saying is that we've crossed to new territory and there will be consequences that we don't understand from this. The same way that social media has been particularly detrimental to young people (especially women) in a way we were not ready for. This __smells__ like it could be worse, alongside with (or regardless of) the benefits of both.

I simply think people don't really know that the new world requires a new set of rules of engagement for anything that exists behind a screen (for now).

supern0va · 2026-01-22T18:42:26 1769107346

We'll be okay eventually, when society adapts to this and becomes fully aware of the capabilities and the use cases for abuse. But, that may take some time. The parent is right to be concerned about the interim, at the very least.

That said, I am likewise looking forward to the cool things to come out of this.

DANmode · 2026-01-22T18:52:56 1769107976

> People that couldn't sing will make music.

I was with you, until

But, yeah. Life will go on.

echelon · 2026-01-22T18:55:01 1769108101

There are plenty of electronic artists who can't sing. Right now they have to hire someone else to do the singing for them, but I'd wager a lot of them would like to own their music end-to-end. I would.

I'm a filmmaker. I've done it photons-on-glass production for fifteen years. Meisner trained, have performed every role from cast to crew. I'm elated that these tools are going to enable me to do more with a smaller budget. To have more autonomy and creative control.

javier123454321 · 2026-01-22T19:41:17 1769110877

Yes, the flipside of this is that we're eroding the last bit of ability for people to make a living through their art. We are capturing the market for people to live off of making illustrations, to making background music, jingles, promotional videos, photographs, graphic design, and funnelling those earnings to NVIDIA. The question I keep asking is whether we care to value as a society for people to make a living through their art. I think there is a reason to care.

It's not so much of an issue with art for art's sake aided by AI. It's an issue with artistic work becoming unviable work.

volkercraig · 2026-01-22T20:54:42 1769115282

This feels like one of those tropes that keeps showing up whenever new tech comes out. At the advent of recorded music, im sure buskers and performers were complaing that live music is dead forever. Stage actors were probably complaining that film killed plays. Heck, I bet someome even complained that video itself killed the radio star. Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around, theyre just called v-tubers and podcasters.

javier123454321 · 2026-01-22T21:07:00 1769116020

umm, I don't know if you've seen the current state of trying to make a living with music but It's widely accepted as dire. Touring is a loss leader, putting out music for free doesn't pay, stream counts payouts are abysmally low. No one buys songs.

All that is before the fact that streaming services are stuffing playlists with AI generated music to further reduce the payouts to artists.

> Yet here we are, hundreds of years later, live music is still desirable, plays still happen, and faceless voices are still around...

Yes all those things still happen, but it's increasingly untenable to make a living through it.

DANmode · 2026-01-22T22:06:42 1769119602

What happens to lyricless electronica if suddenly every electronic artist has quality vocal-backing?

Oh no.

Maybe we did frig this up.

magicalhippo · 2026-01-22T19:08:54 1769108934

The HF demo space was overloaded, but I got the demo working locally easily enough. The voice cloning of the 1.7B model captures the tone of the speaker very well, but I found it failed at reproducing the variation in intonation, so it sounds like a monotonous reading of a boring text.

I presume this is due to using the base model, and not the one tuned for more expressiveness.

edit: Or more likely, the demo not exposing the expressiveness controls.

The 1.7B model was much better at ignoring slight background noise in the reference audio compared to the 0.6B model though. The 0.6B would inject some of that into the generated audio, whereas the 1.7B model would not.

Also, without FlashAttention it was dog slow on my 5090, running at 0.3X realtime with just 30% GPU usage. Though I guess that's to be expected. No significant difference in generation speed between the two models.

Overall though, I'm quite impressed. I haven't checked out all the recent TTS models, but a fair number, and this one is certainly one of the better ones in terms of voice cloning quality I've heard.

dsrtslnd23 · 2026-01-22T22:53:01 1769122381

Any idea on the VRAM footprint for the 1.7B model? I guess it fits on consumer cards but I am wondering if it works on edge devices.

thedangler · 2026-01-22T20:57:46 1769115466

How did you do this locally? Tools? Language?

pseudosavant · 2026-01-22T19:07:40 1769108860

Remarkable tech that is now accessible to almost anyone. My cloned voice sounded exactly like me. The uses for this will be from good to bad and everywhere in-between. A deceased grandmother reading "Good Night Moon" to grandkids, scamming people, the ability to create podcasts with your own voices from just prompts.

kingstnap · 2026-01-22T20:54:17 1769115257

It was fun to try out. I wonder if at some point if I have a few minutes of me talking I could make myself read an entire book to myself.

mohsen1 · 2026-01-22T19:12:20 1769109140

> The requested GPU duration (180s) is larger than the maximum allowed

What am I doing wrong?

gregsadetsky · 2026-01-22T19:19:23 1769109563

you need to login

TheAceOfHearts · 2026-01-22T18:20:26 1769106026

Interesting model, I've managed to get the 0.6B param model running on my old 1080 and I can generated 200 character chunks safely without going OOM, so I thought that making an audiobook of the Tao Te Ching would be a good test. Unfortunately each snippet varies drastically in quality: sometimes the speaker is clear and coherent, but other times it bursts out laughing or moaning. In a way it feels a bit like magical roulette, never being quite certain of what you're going to get. It does have a bit of charm, when you chain the various snippets together you really don't know what direction it's gonna go.

Using speaker Ryan seems to be the most consistent, I tried speaker Eric and it sounded like someone putting on a fake exaggerated Chinese accent to mock speakers.

If it wasn't for the unpredictable level of emotions from each chunk, I'd say this is easily the highest quality TTS model I've tried.

KaoruAoiShiho · 2026-01-22T18:48:38 1769107718

Have you tried specifying the emotion? There's an option to do so and if it's left empty it wouldn't surprise me if it defaulted to rng instead of bland.

TheAceOfHearts · 2026-01-22T19:35:43 1769110543

For the system prompt I used:

> Read this in a calm, clear, and wise audiobook tone.

> Do not rush. Allow the meaning to sink in.

But maybe I should experiment with something more detailed. Do you have any suggestions?

throwaw12 · 2026-01-22T16:10:01 1769098201

Qwen team, please please please, release something to outperform and surpass the coding abilities of Opus 4.5.

Although I like the model, I don't like the leadership of that company and how close it is, how divisive they're in terms of politics.

mortsnort · 2026-01-22T16:26:54 1769099214

They were just waiting for someone in the comments to ask!

zeppelin101 · 2026-01-22T21:31:49 1769117509

Someone has to take the first step. Let's be grateful to the brave anon HN poster for stepping up.

mhuffman · 2026-01-22T17:15:08 1769102108

It really is the best way to incentivize politeness!

stuckkeys · 2026-01-22T20:26:34 1769113594

I loled hard at this. Thank you kind stranger.

pseudony · 2026-01-22T17:43:34 1769103814

Same issue (I am Danish).

Have you tested alternatives? I grabbed Open Code and a Minimax m2.1 subscription, even just the 10usd/mo one to test with.

Result? We designed a spec for a slight variation of a tool for which I wrote a spec with Claude - same problem (process supervisor tool), from scratch.

Honestly, it worked great, I have played a little further with generating code (this time golang), again, I am happy.

Beyond that, Glm4.7 should also be great.

See https://dev.to/kilocode/open-weight-models-are-getting-serio...

It is a recent case story of vibing a smaller tool with kilo code, comparing output from minimax m2.1 and Glm4.7

Honestly, just give it a whirl - no need to send money to companies/nations your disagree with with.

nunodonato · 2026-01-22T18:14:42 1769105682

I've been using GLM 4.7 with Claude Code. best of both worlds. Canceled my Anthropic subscription due to the US politics as well. Already started my "withdrawal" in Jan 2025, Anthropic was one of the few that was left

stavros · 2026-01-22T20:42:16 1769114536

I much prefer OpenCode these days, give it a try.

nunodonato · 2026-01-22T21:45:11 1769118311

I did, I couldnt get used to it and didn't get so good results. I think Claude Code's tools are really top notch, and maybe the system prompt

bigyabai · 2026-01-22T18:17:59 1769105879

I'm in the same boat. Sonnet was overkill for me, and GLM is cheap and smart enough to spit out boilerplate and FFMPEG commands whenever it's asked.

$20/month is a bit of an insane ask when the most valuable thing Anthropic makes is the free Claude Code CLI.

stavros · 2026-01-22T20:42:43 1769114563

I don't know, I max out my Opus limits regularly. I guess it depends on usage.

TylerLives · 2026-01-22T16:19:05 1769098745

>how divisive they're in terms of politics

What do you mean by this?

throwaw12 · 2026-01-22T16:23:38 1769099018

Dario said not nice words about China and open models in general:

https://www.bloomberg.com/news/articles/2026-01-20/anthropic...

vlovich123 · 2026-01-22T16:48:18 1769100498

I think the least politically divisive issue within the US is concern about China’s growth as it directly threatens the US’s ability to set the world’s agenda. It may be politically divisive if you are aligned with Chinese interests but I don’t see anything politically divisive for a US audience. I expect Chinese CEOs speak in similar terms to a Chinese audience in terms of making sure they’re decoupled from the now unstable US political machine.

subscribed · 2026-01-22T20:14:07 1769112847

Looking at the last year's US agenda I'm okay with that.

cmrdporcupine · 2026-01-22T17:40:23 1769103623

"... for a US audience"

And that's the rub.

Many of us are not.

giancarlostoro · 2026-01-22T17:34:35 1769103275

From the perspective of competing against China in terms of AI the argument against open models makes sense to me. It’s a terrible problem to have really. Ideally we should all be able to work together in the sandbox towards a better tomorrow but thats not reality.

I prefer to have more open models. On the other hand China closes up their open models once they start to show a competitive edge.

Levitz · 2026-01-22T17:31:59 1769103119

I mean, there's no way it's about this right?

Being critical of favorable actions towards a rival country shouldn't be divisive, and if it is, well, I don't think the problem is in the criticism.

Also the link doesn't mention open source? From a google search, he doesn't seem to care much for it.

Balinares · 2026-01-22T17:40:57 1769103657

They're supporters of the Trump administration's military, a view which is not universally lauded.

mohsen1 · 2026-01-22T19:15:50 1769109350

With a good harness I am getting similar results with GLM 4.7. I am paying for TWO! max accounts and my agents are running 24/7.

I still have a small Claude account to do some code reviews. Opus 4.5 does good reviews but at this point GLM 4.7 usually can do the same code reviews.

If cost is an issue (for me it is, I pay out of pocket) go with GLM 4.7

WarmWash · 2026-01-22T16:45:32 1769100332

The Chinese labs distill the SOTA models to boost the performance of theirs. They are a trailer hooked up (with a 3-6 month long chain) to the trucks pushing the technology forwards. I've yet to see a trailer overtake it's truck.

China would need an architectural breakthrough to leap American labs given the huge compute disparity.

miklosz · 2026-01-22T17:17:58 1769102278

I have seen indeed a trailer overtake its truck. Not a beautiful view.

digdugdirk · 2026-01-22T18:23:15 1769106195

Agreed. I do think the metaphor still holds though.

A financial jackknifing of the AI industry seems to be one very plausible outcome as these promises/expectations of the AI companies starts meeting reality.

overfeed · 2026-01-22T18:10:26 1769105426

Care to explain how the volume of AI research papers authored by Chinese researchers[1] has exceeded US-published ones? Time-traveling plagiarism perhaps, since you believe the US is destined to lead always.

1. Chinese researcher in China, to be more specific.

bfeynman · 2026-01-22T18:30:13 1769106613

Not a great metric, research in academia doesn't necessarily translate to value. In the US they've poached so many academics because of how much value they directly translate to.

WarmWash · 2026-01-22T21:56:21 1769118981

I don't doubt China wouldn't be capable of making SOTA models, however they are very heavily compute constrained. So they are forced to shortcut compute by riding the coattails of compute heavy models.

They need a training-multiplier breakthrough that would allow them to train SOTA models on on a fraction of the compute that the US does. And this would also have to be kept a secret and be well hidden (often multiple researchers from around the world put the pieces together on a problem at around the same time, so the breakthrough would have to be something pretty difficult to discover for the greatest minds in the field) to prevent the US from using it to multiply their model strength with their greater compute.

jacquesm · 2026-01-22T18:25:32 1769106332

Volume is easy: they have far more people, it is quality that counts.

overfeed · 2026-01-22T19:45:02 1769111102

Perhaps you should pay attention to where the puck is going to be, rather than where it is currently. Lots of original ideas are coming out of Chinese AI research[1], denying this betrays some level of cope.

1. e.g. select any DeepSeek release, and read the accompanying paper

jacquesm · 2026-01-22T20:30:44 1769113844

I'll pay attention to where the puck is because that is something I can observe, where it is going to be is anybody's guess. Lots of original ideas are coming out of Chinese AI research but there is also lots of junk. I think in the longer term they will have the advantage but right now that simply isn't the case.

Your 'cope' accusation has no place here, I have no dog in the race and do not need to cope with anything.

overfeed · 2026-01-22T22:21:33 1769120493

> Your 'cope' accusation has no place here

I will rephrase my statement and continue to stand by it: "Denying the volume of original AI research being done by China - a falsifiable metric - betrays some level of cope."

You seem to agree on the fact that China has surpassed the US. As for quality, I'll say expertise is a result of execution. At some point in time during off-shoring, the US had qualitatively better machinists that China, despite manufacturing volumes. That is no longer the case today - as they say, cream floats to the top, and that holds true for a pot or an industrial-sized vat.

aaa_aaa · 2026-01-22T17:03:21 1769101401

No all they need is time. I am awaiting the dowfall of the ai hegemony and hype with popcorn at hand.

mhuffman · 2026-01-22T17:16:12 1769102172

I would be happy with an openweight 3 month old Claude

cmrdporcupine · 2026-01-22T17:41:21 1769103681

DeepSeek 3.2 is frankly fairly close to that. GLM 4.7 as well. They're basically around Sonnet 4 level.

amrrs · 2026-01-22T16:14:10 1769098450

Have you tried the new GLM 4.7?

davely · 2026-01-22T17:12:08 1769101928

I've been using GLM 4.7 alongside Opus 4.5 and I can't believe how bad it is. Seriously.

I spent 20 minutes yesterday trying to get GLM 4.7 to understand that a simple modal on a web page (vanilla JS and HTML!) wasn't displaying when a certain button was clicked. I hooked it up to Chrome MCP in Open Code as well.

It constantly told me that it fixed the problem. In frustration, I opened Claude Code and just typed "Why won't the button with ID 'edit' work???!"

It fixed the problem in one shot. This isn't even a hard problem (and I could have just fixed it myself but I guess sunk cost fallacy).

bityard · 2026-01-22T17:29:26 1769102966

I've used a bunch of the SOTA models (via my work's Windsurf subscription) for HTML/CSS/JS stuff over the past few months. Mind you, I am not a web developer, these are just internal and personal projects.

My experience is that all of the models seem to do a decent job of writing a whole application from scratch, up to a certain point of complexity. But as soon as you ask them for non-trivial modifications and bugfixes, they _usually_ go deep into rationalized rabbit holes into nowhere.

I burned through a lot of credits to try them all and Gemini tended to work the best for the things I was doing. But as always, YMMV.

KolmogorovComp · 2026-01-22T17:40:03 1769103603

Exactly the same feedback

girvo · 2026-01-22T21:57:59 1769119079

> I can't believe how bad it is

This has been my consistent experience with every model prior to Opus 4.5, and every single open model I've given a go.

Hopefully we will get there in another 6 months when Opus is distilled into new open models, but I've always been shocked at some of the claims around open models, when I've been entirely unable to replicate them.

Hell, even Opus 4.5 shits the bed with semi-regularity on anything that's not completely greenfield for my usage, once I'm giving it tasks beyond some unseen complexity boundary.

Balinares · 2026-01-22T18:26:40 1769106400

Amazingly, just yesterday, I had Opus 4.5 crap itself extensively on a fairly simple problem -- it was trying to override a column with an aggregation function while also using it in a group-by without referring to the original column by its full qualified name prefixed with the table -- and in typical Claude fashion it assembled an entire abstraction layer to try and hide the problem under, before finally giving up, deleting the column, and smugly informing me I didn't need it anyway.

That evening, for kicks, I brought the problem to GLM 4.7 Flash (Flash!) and it one-shot the right solution.

It's not apples to apples, because when it comes down to it LLMs are statistical token extruders, and it's a lot easier to extrude the likely tokens from an isolated query than from a whole workspace that's already been messed up somewhat by said LLM. That, and data is not the plural of anecdote. But still, I'm easily amused, and this amused me. (I haven't otherwise pushed GLM 4.7 much and I don't have a strong opinion about about it.)

But seriously, given the consistent pattern of knitting ever larger carpets to sweep errors under that Claude seems to exhibit over and over instead of identifying and addressing root causes, I'm curious what the codebases of people who use it a lot look like.

throwaw12 · 2026-01-22T16:24:53 1769099093

yes I did, not on par with Opus 4.5.

I use Opus 4.5 for planning, when I reach my usage limits fallback to GLM 4.7 only for implementing the plan, it still struggles, even though I configure GLM 4.7 as both smaller model and heavier model in claude code

Onavo · 2026-01-22T17:17:55 1769102275

Well DeepSeek V4 is rumored to be in that range and will be released in 3 weeks.

sampton · 2026-01-22T16:18:30 1769098710

Every time Dario opens his mouth it's something weird.

genewitch · 2026-01-22T16:29:44 1769099384

it isn't often that tehcnology gives me chills, but this did it. I've used "AI" TTS tools since 2018 or so, and i thought the stuff from two years ago was about the best we were going to get. I don't know the size of these, i scrolled to the samples. I am going to get the models set up somewhere and test them out.

Now, maybe the results were cherrypicked. i know everyone else who has released one of these cherrypicks which to publish. However, this is the first time i've considered it plausible to use AI TTS to remaster old radioplays and the like, where a section of audio is unintelligible but can be deduced from context, like a tape glitch where someone says "HEY [...]LAR!" and it's an episode of Yours Truly, Johnny Dollar...

I have dozens of hours of audio of like Bob Bailey and people of that era.

kamranjon · 2026-01-22T17:08:49 1769101729

I wonder if it was trained on anime dubs cause all of the examples I listened to sounded very similar to a miyazaki style dub.

genewitch · 2026-01-22T19:36:31 1769110591

scroll down to the second to last group, the second one down is obama speaking english, the third one down is trump speaking japanese (a translation of the english phrase)

besides, they know what side their bread is buttered on. I feel like this is almost not the real announcement; or, the engineers that wrote this up and did the demos just ran it that way. The normal speech voices are fine (lower than the anime ones on the page.) i agree that the first few are very infantile. I'll change that word if i can think of a better one.

freedomben · 2026-01-22T18:35:25 1769106925

Indeed, I have a future project/goal of "restoring" Have Gun - Will Travel radio episodes to listenable quality using tech like this. There are so many lines where sound effects and tape rot and other "bad recording" things make it very difficult to understand what was sad. Will be amazing, but as with all tech the potential for abuse is very real

genewitch · 2026-01-22T19:40:07 1769110807

hey if you want to collab or trade notes, my email is in my profile. there was java software that did FANTASTIC work cleaning up crappy transfers of audio, like, specifically, it was perfect for "AM Quality Monaural Audio".

  Observe, original: https://www.youtube.com/watch?v=YiRcOVDAryM
  my edit (took about an hour, if memory serves, to set up. forgot render time...): https://www.youtube.com/watch?v=xazubVJ0jz4

i say "was [...] software" because the last 2 times i've tried to use it, it did imperceptible cleanup, making it worthless. Anyhow, all my radio plays are from OTRR, i think.

Audio.Restoration.DeNoise.DeNoiseLF.2.8.3_WiN.OSX is a more recent version i think

p.s. are you a "dude named Ben"?

girvo · 2026-01-22T21:52:24 1769118744

Amusingly one of their examples (the final Age Control example) is prompted to have American English as an accent, but sounds like an Australian trying to sounds American to my ear haha

rahimnathwani · 2026-01-22T17:22:46 1769102566

Has anyone successfully run this on a Mac? The installation instructions appear to assume an NVIDIA GPU (CUDA, FlashAttention), and I’m not sure whether it works with PyTorch’s Metal/MPS backend.

magicalhippo · 2026-01-22T18:53:26 1769108006

FWIW you can run the demo without FlashAttention using --no-flash-attn command-line parameter, I do that since I'm on Windows and haven't gotten FlashAttention2 to work.

javier123454321 · 2026-01-22T17:32:43 1769103163

I recommend using modal for renting the metal.

turnsout · 2026-01-22T18:57:54 1769108274

It seems to depend on FlashAttention, so the short answer is no. Hopefully someone does the work of porting the inference code over!

satvikpendem · 2026-01-22T17:28:36 1769102916

This would be great for audiobooks, some of the current AI TTS still struggle.

dangoodmanUT · 2026-01-22T22:14:05 1769120045

Many voices clone better than 11labs, while admitedly lower bitrate

jakobdabo · 2026-01-22T19:30:50 1769110250

Can anyone please provide directions/links to tools that can be run locally, and that take an audio recording of a voice as an input, and produce an output with the same voice saying the same thing with the same intonations, but with a fixed/changed accent?

This is needed for processing an indie game's voice recordings, where the voice actors weren't native speakers and had some accent.

PunchyHamster · 2026-01-22T17:36:37 1769103397

Looking forward for my grandma being scammed by one!

jacquesm · 2026-01-22T18:26:41 1769106401

So far that seems to be the main use case.

bigyabai · 2026-01-22T19:18:51 1769109531

Grandmas should know better, nowadays. It's 2026, half of today's grandparents grew up with QVC and landline psychics.

swaraj · 2026-01-22T20:53:32 1769115212

Tried the voice clone with a 30s trump clip (with reference text), and it didn't sound like him at all.

whinvik · 2026-01-22T18:17:16 1769105836

Haha something that I want to try out. I have started using voice input more and more instead of typing and now I am on my second app and second TTS model, namely Handy and Parakeet V3.

Parakeet is pretty good, but there are times it struggles. Would be interesting to see how Qwen compares once Handy has it in.

woodson · 2026-01-22T22:25:06 1769120706

This is about speech to text, not speech recognition.

Footprint0521 · 2026-01-22T19:27:53 1769110073

Why parakeet over whisper v3 turbo? Just curious as one who heavily uses whisper, I’ve seemed to have better results with that

whinvik · 2026-01-22T20:39:53 1769114393

Parakeet is much smaller and for me the perf/speed combo has just been better.

sails · 2026-01-22T19:56:01 1769111761

Any recommendations for an iOS app to test models like this? There are a few good ones for text gen, and it’s a great way to try models

bigyabai · 2026-01-22T21:05:17 1769115917

Besides UTM, no.

thedangler · 2026-01-22T16:44:26 1769100266

Kind of a noob, how would I implement this locally? How do I pass it audio to process. I'm assuming its in the API spec?

dust42 · 2026-01-22T16:49:41 1769100581

Scroll down on the Huggingface page, there are code examples and also a link to github: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base

daliusd · 2026-01-22T20:03:01 1769112181

I wanted to try this locally as well so I have asked AI to write CLI for me: https://github.com/daliusd/qtts

There are some samples. If you have GPU you might want to fork and improve this, but otherwise slow, but usable on CPU as well.

JonChesterfield · 2026-01-22T18:01:54 1769104914

I see a lot of references to `device_map="cuda:0"` but no cuda in the github repo, is the complete stack flash attention plus this python plus the weights file, or does one need vLLM running as well?

indigodaddy · 2026-01-22T16:12:22 1769098342

How does the cloning compare to pocket TTS?

quinncom · 2026-01-22T21:29:04 1769117344

Pocket TTS is much smaller: 100M parameters versus 600–1800M.

andhuman · 2026-01-22T19:19:49 1769109589

It’s uncanny good. I prefer it to pocket, but then again pocket is much smaller and for realtime streaming.

albertwang · 2026-01-22T15:30:41 1769095841

great news, this looks great! is it just me, or do most of the english audio samples sound like anime voices?

bityard · 2026-01-22T17:42:29 1769103749

Well, if you look at the prompts, they are basically told to sound like that.

And if you ask me, I think these models were trained on tween fiction podcasts. (My kids listen to a lot of these and dramatic over-acting seems to be the industry standard.)

Also, their middle-aged adult with an "American English" accent sounds like any American I've ever met. More like a bad Sean Connery impersonator.

rapind · 2026-01-22T15:57:48 1769097468

> do most of the english audio samples sound like anime voices?

100% I was thinking the same thing.

reactordev · 2026-01-22T16:52:34 1769100754

The real value I see is being able to clone a voice and change timbre and characteristics of the voice to be able to quickly generate voice overs, narrations, voice acting, etc. It's superb!

devttyeu · 2026-01-22T15:47:45 1769096865

Also like some popular youtubers and popular speakers.

pixl97 · 2026-01-22T16:25:53 1769099153

Hmm, wonder where they got their training data from?

thehamkercat · 2026-01-22T16:11:29 1769098289

even the Japanese audio samples sound like anime

htrp · 2026-01-22T16:31:32 1769099492

subbed audio training data (much better than cc data) is better

ideashower · 2026-01-22T16:49:52 1769100592

Huh. One of the English Voice Clone examples features Obama.

illwrks · 2026-01-22T22:32:56 1769121176

I think the other sounds like Steve Jobs - I could be wrong though!

subscribed · 2026-01-22T20:35:59 1769114159

Distinct, characteristic voice. My first to play with will be Morgan Freeman.

salzig · 2026-01-22T18:14:21 1769105661

So now we're getting every movie in "original voice" but local language? Can't wait to view anime or Bollywood :D

wahnfrieden · 2026-01-22T17:01:46 1769101306

How is it for Japanese?

salzig · 2026-01-22T18:11:39 1769105499

there is a sample clone -> Trump speaks Japanese.

Edit: "Cross-lingual Voice Clone" https://qwen.ai/blog?id=qwen3tts-0115#voice-clone

lostmsu · 2026-01-22T15:58:10 1769097490

I still don't know anyone who managed Qwen3-Omni to work properly on a local machine.