在25种物种中训练mRNA语言模型,费用为165美元。

24作者: maziyar9 天前原帖
我们构建了一个端到端的蛋白质人工智能管道,涵盖了结构预测、序列设计和密码子优化。在对多种变换器架构进行密码子级语言建模的比较后,CodonRoBERTa-large-v2 显示出明显的优势,其困惑度为 4.10,Spearman CAI 相关性为 0.40,显著优于 ModernBERT。随后,我们扩展到 25 个物种,在 55 个 GPU 小时内训练了 4 个生产模型,并建立了一个物种条件系统,这是其他开源项目所不具备的。完整的结果、架构决策和可运行代码见下文。
查看原文
We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.