TIL: There is an open source "Alexa replacement" project

cm0002@libretechni.ca · 25 days ago

TIL: There is an open source "Alexa replacement" project

brucethemoose@lemmy.world · edit-2 25 days ago

I mean, there are many. TTS and self-hosted automation are huge in the local LLM scene.

We even have open source “omni” models now, that can ingest and output speech tokens directly (which means they get more semantic understanding from tone and such, they ‘choose’ the tone to reply with, and that it’s streamable word-by-word). They support all sorts of tool calling.

…But they aren’t easy to run. It’s still in the realm of homelabs with at least an RTX 3060 + hacky python projects.

If you’re mad, you can self-host Longcat Omni

https://huggingface.co/meituan-longcat/LongCat-Flash-Omni

And blow Alexa out of the water with a MIT-licensed model from, I kid you not, a Chinese food delivery company.

EDIT

For the curious, see:

Audio-text-to-text (and sometimes TTS): https://huggingface.co/models?pipeline_tag=audio-text-to-text&num_parameters=min%3A6B&sort=modified

TTS: https://huggingface.co/models?pipeline_tag=text-to-speech&num_parameters=min%3A6B&sort=modified

“Anything-to-anything,” generally image/video/audio/text -> text/speech: https://huggingface.co/models?pipeline_tag=any-to-any&num_parameters=min%3A6B&sort=modified

Bigger than 6B to exclude toy/test models.

fonix232@fedia.io · 25 days ago

I do wish there was a smaller LongCat model available. My current AI node has a hard 16GB VRAM limit (yay AMD UMA limitations), so 27B can’t really fit. An 8B dynamically loaded model would fit, and run much better.

brucethemoose@lemmy.world · edit-2 24 days ago

You can do hybrid inference of Qwen 30B omni for sure. Or fully offload inference of Vibevoice Large (9B). Or really a huge array of models.

…The limiting factor is free time, TBH. Just sifting through the sea of models, seeing if they work at all, testing if quantization works and such is a huge timesink, especially if you are trying to load stuff with rocm.

fonix232@fedia.io · 24 days ago

And I am on ROCm - specifically on an 8945HS, which is advertised as a Ryzen AI APU yet is completely unsupported as a target with major issues around queuing and more complex models (although the new 7.0 betas have been promising but TheRock’s flip-flopping with their Docker images has been making me go crazy…).

brucethemoose@lemmy.world · edit-2 24 days ago

Ah. On an 8000 APU, to be blunt, you’re likely better off with Vulkan + whatever omni models GGML supports these days. Last I checked, TG is faster and prompt processing is close to rocm.

…And yeah, that was total misadvertisement on AMD’s part. They’ve completely diluted the term kinda like TV makers did with ‘HDR’

fonix232@fedia.io · 24 days ago

The thing is, if AMD actually added proper support for it, given it has a somewhat powerful NPU as well… For the total TDP of the package it’s still one of the best perf per watt APU, just the damn software support isn’t there.

Feckin AMD.

brucethemoose@lemmy.world · edit-2 24 days ago

The IGP is more powerful than the NPU on these things anyway. The NPU us more for ‘background’ tasks, like Teams audio processing or whatever its used for on Windows.

Yeah, in hindsight, AMD should have tasked (and still should task) a few engineers on popular projects (and pushed NPU support harder), but GGML support is good these days. It’s gonna be pretty close to RAM speed-bound for text generation.

fonix232@fedia.io · 24 days ago

Aye, I was actually hoping to use the NPU for TTS/STT while keeping the LLM systems GPU bound.

brucethemoose@lemmy.world · edit-2 24 days ago

It still uses memory bandwidth, unfortunately. There’s no way around that, though NPU TTS would still be neat.

…Also, generally, STT responses can’t be streamed, so you mind as well use the iGPU anyway. TTS can be chunked I guess, but do the major implementations do that?

TIL: There is an open source "Alexa replacement" project

TIL: There is an open source "Alexa replacement" project

Home