展示HN:LemonSlice – 让你的语音助手拥有一个面孔
大家好,我们是LemonSlice的联合创始人(<a href="https://lemonslice.com">https://lemonslice.com</a>)。我们专注于训练互动头像视频模型。我们的API允许您上传照片,并立即与该角色进行类似FaceTime的通话。这里有一个演示:<a href="https://www.loom.com/share/941577113141418e80d2834c83a5a0a9" rel="nofollow">https://www.loom.com/share/941577113141418e80d2834c83a5a0a9</a>。
聊天机器人无处不在,语音AI最近也迅速发展。但我们相信,视频头像将成为对话式AI最常见的形式。大多数人更愿意观看视频而不是阅读文本。问题在于,实时生成视频是非常困难的,而克服“恐怖谷”现象则更具挑战性。
我们尚未突破“恐怖谷”。没有人做到过。但我们正在接近,我们的照片级真实头像目前是业内最佳(您可以自己判断:<a href="https://lemonslice.com/try/taylor">https://lemonslice.com/try/taylor</a>)。此外,我们是唯一能够生成动物和高度风格化卡通头像的模型。试试这个:<a href="https://lemonslice.com/try/alien">https://lemonslice.com/try/alien</a>。警告!与这个小家伙交谈可能会改善你的心情。
今天,我们发布了我们的新模型* - Lemon Slice 2,这是一个20亿参数的扩散变换器,能够在单个GPU上以20帧每秒生成无限长度的视频,并开放我们的API。
我们是如何让视频扩散模型实时运行的呢?并没有单一的技巧,而是许多技巧的叠加。第一个重大变化是使我们的模型具备因果性。标准的视频扩散模型是双向的(它们同时查看当前帧之前和之后的帧),这意味着无法进行流式传输。
接下来就是将所有内容适配到一个GPU上。我们从全窗口注意力切换到滑动窗口注意力,这解决了我们的内存瓶颈。我们将去噪步骤从40步精简到仅几步 - 质量下降的程度低于我们的预期,尤其是在使用基于GAN的蒸馏之后(尽管调整对抗损失以避免模式崩溃也是一段冒险的旅程)。
其余的工作是推理:将RoPE从复杂改为真实(这个很酷!),精度调优,融合内核,特殊的滚动KV缓存,许多其他缓存等等。我们尽可能地削减毫秒,最终达到了实时效果。
我们为HN设置了一个访客游乐场,您可以在不登录的情况下创建和与角色对话:www.lemonslice.com/hn。对于希望使用我们API构建的用户(我们有一个新的LiveKit集成,令人兴奋!),请在HN游乐场获取一个优惠码,享受您的第一个专业月免费(价值100美元)。查看文档:<a href="https://lemonslice.com/docs">https://lemonslice.com/docs</a>。定价基于使用,每分钟视频生成费用为0.12-0.20美元。
期待您的反馈!我们也很想看到您制作的任何酷炫角色 - 请在评论中分享它们的链接。
*我们去年为我们的V1模型做了一个Show HN:<a href="https://news.ycombinator.com/item?id=43785044">https://news.ycombinator.com/item?id=43785044</a>。从技术上讲,它令人印象深刻,但与我们今天的产品相比实在太差了。
查看原文
Hey HN, we're the co-founders of LemonSlice (<a href="https://lemonslice.com">https://lemonslice.com</a>). We train interactive avatar video models. Our API lets you upload a photo and immediately jump into a FaceTime-style call with that character. Here's a demo: <a href="https://www.loom.com/share/941577113141418e80d2834c83a5a0a9" rel="nofollow">https://www.loom.com/share/941577113141418e80d2834c83a5a0a9</a><p>Chatbots are everywhere. Voice AI has recently taken off. But we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.<p>We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: <a href="https://lemonslice.com/try/taylor">https://lemonslice.com/try/taylor</a>). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: <a href="https://lemonslice.com/try/alien">https://lemonslice.com/try/alien</a>. Warning! Talking to this little guy may improve your mood.<p>Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.<p>How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.<p>From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).<p>And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.<p>We set up a guest playground for HN so you can create and talk to characters without logging in: www.lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: <a href="https://lemonslice.com/docs">https://lemonslice.com/docs</a>. Pricing is usage-based at $0.12-0.20/min for video generation.<p>Looking forward to your feedback! And we’d love to see any cool characters you make - please share their links in the comments<p>*We did a Show HN last year for our V1 model: <a href="https://news.ycombinator.com/item?id=43785044">https://news.ycombinator.com/item?id=43785044</a>. It was technically impressive but so bad compared to what we have today.