HackerNews中文版

我们构建了一个端到端的蛋白质人工智能管道，涵盖了结构预测、序列设计和密码子优化。在对多种变换器架构进行密码子级语言建模的比较后，CodonRoBERTa-large-v2 显示出明显的优势，其困惑度为 4.10，Spearman CAI 相关性为 0.40，显著优于 ModernBERT。随后，我们扩展到 25 个物种，在 55 个 GPU 小时内训练了 4 个生产模型，并建立了一个物种条件系统，这是其他开源项目所不具备的。完整的结果、架构决策和可运行代码见下文。

查看原文

We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.

在25种物种中训练mRNA语言模型，费用为165美元。