请问HN:目前最好的本地/开放语音转语音设置是什么?
我正在尝试完全本地化地实现“语音助手”功能:麦克风 → 模型 → 扬声器,低延迟,理想情况下支持流式传输和可打断(插入)。<p>Qwen3 Omni 在纸面上看起来完美(“实时”、语音到语音等)。但我一直在探索,却找不到一篇可复现的“这是我如何在本地使用开放权重实现真实语音到语音”的文章。很多都是“语音输入 → 文本输出”或“模型完成后音频输出”,但没有可用的实时语音循环。感觉要么是(a)工具尚未到位,要么是(b)我错过了什么关键的东西。<p>如果人们在2026年想要开放且本地的语音技术,实际上使用的是什么?<p>有没有人在本地实现真正的端到端语音模型(流式音频输出),还是说当前的最先进技术仍然是“流式ASR + LLM + 流式TTS”的组合?<p>如果你成功让Qwen3 Omni的语音到语音功能运行:使用了什么技术栈(transformers / vLLM-omni / 其他),什么硬件,实际是实时的吗?<p>在单个GPU上,最“今天可用”的组合是什么?<p>附加问题:人们在麦克风到首次音频返回的过程中看到的粗略数字是什么?<p>希望能得到一些指向代码库、配置或“这是我最终成功的方案”的经验分享。
查看原文
I’m trying to do the “voice assistant” thing fully locally: mic → model → speaker, low latency, ideally streaming + interruptible (barge-in).<p>Qwen3 Omni looks perfect on paper (“real-time”, speech-to-speech, etc). But I’ve been poking around and I can’t find a single reproducible “here’s how I got the open weights doing real speech-to-speech locally” writeup. Lots of “speech in → text out” or “audio out after the model finishes”, but not a usable realtime voice loop. Feels like either (a) the tooling isn’t there yet, or (b) I’m missing the secret sauce.<p>What are people actually using in 2026 if they want open + local voice?<p>Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?<p>If you did get Qwen3 Omni speech-to-speech working: what stack (transformers / vLLM-omni / something else), what hardware, and is it actually realtime?<p>What’s the most “works today” combo on a single GPU?<p>Bonus: rough numbers people see for mic → first audio back<p>Would love pointers to repos, configs, or “this is the one that finally worked for me” war stories.