HackerNews中文版

我正在尝试完全本地化地实现“语音助手”功能：麦克风 → 模型 → 扬声器，低延迟，理想情况下支持流式传输和可打断（插入）。Qwen3 Omni 在纸面上看起来完美（“实时”、语音到语音等）。但我一直在探索，却找不到一篇可复现的“这是我如何在本地使用开放权重实现真实语音到语音”的文章。很多都是“语音输入 → 文本输出”或“模型完成后音频输出”，但没有可用的实时语音循环。感觉要么是（a）工具尚未到位，要么是（b）我错过了什么关键的东西。如果人们在2026年想要开放且本地的语音技术，实际上使用的是什么？有没有人在本地实现真正的端到端语音模型（流式音频输出），还是说当前的最先进技术仍然是“流式ASR + LLM + 流式TTS”的组合？如果你成功让Qwen3 Omni的语音到语音功能运行：使用了什么技术栈（transformers / vLLM-omni / 其他），什么硬件，实际是实时的吗？在单个GPU上，最“今天可用”的组合是什么？附加问题：人们在麦克风到首次音频返回的过程中看到的粗略数字是什么？希望能得到一些指向代码库、配置或“这是我最终成功的方案”的经验分享。

查看原文

I’m trying to do the “voice assistant” thing fully locally: mic → model → speaker, low latency, ideally streaming + interruptible (barge-in).Qwen3 Omni looks perfect on paper (“real-time”, speech-to-speech, etc). But I’ve been poking around and I can’t find a single reproducible “here’s how I got the open weights doing real speech-to-speech locally” writeup. Lots of “speech in → text out” or “audio out after the model finishes”, but not a usable realtime voice loop. Feels like either (a) the tooling isn’t there yet, or (b) I’m missing the secret sauce.What are people actually using in 2026 if they want open + local voice?Is anyone doing true end-to-end speech models locally (streaming audio out), or is the SOTA still “streaming ASR + LLM + streaming TTS” glued together?If you did get Qwen3 Omni speech-to-speech working: what stack (transformers / vLLM-omni / something else), what hardware, and is it actually realtime?What’s the most “works today” combo on a single GPU?Bonus: rough numbers people see for mic → first audio backWould love pointers to repos, configs, or “this is the one that finally worked for me” war stories.

请问HN：目前最好的本地/开放语音转语音设置是什么？