我不小心让概率编程的速度提高了30到200倍。
我是一名网页开发承包商,在进行一个无关的爱好项目时偶然接触到了GPU原生的概率编程。
所谓“GPU原生”,是指整个推断算法在GPU内核中运行,不需要CPU的协调——没有Python的开销,也没有步骤之间的内核启动延迟。
我对15种不同的推断算法进行了与NumPyro、JAX和GPyTorch的基准测试。我没有统计学背景,因此确保跟踪了专家关心的质量指标。
我的R-hat值在0.9999到1.0003之间(应该接近1.0),而在HMC上的有效样本数(ESS)每秒提升了多达600倍。一些质量指标更倾向于基线实现——我并不是说我的方法在每个维度上都优于其他方法,只是它在质量相当的情况下显著更快。
测试在RTX 4060笔记本GPU上进行。
完整基准结果:
https://github.com/Aeowulf/nativeppl-results
目前还不分享实现细节,因为我仍在思考如何处理这一发现。但我希望能得到以下方面的反馈:
- 这些基准测试是否有意义/公平?
- 我还应该测试哪些其他算法或问题规模?
- 是否存在对更快概率推断的市场需求?
查看原文
I'm a web dev contractor who stumbled onto GPU-native probabilistic programming while working on an unrelated hobby project.<p>By "GPU-native" I mean the entire inference algorithm runs inside GPU kernels with no CPU coordination - no Python overhead, no kernel launch latency between steps.<p>I benchmarked against NumPyro, JAX, and GPyTorch on 15 different inference algorithms. I don't have a statistics background, so I made sure to track the quality metrics that experts care about.<p>My R-hat values are 0.9999-1.0003 (should be ~1.0), and ESS/second is up to 600x better on HMC. Some quality metrics favor the baseline implementations - I'm not claiming this beats everything on every dimension, just that it's significantly faster with comparable quality.<p>Tested on an RTX 4060 Laptop GPU.
Full benchmark results:
https://github.com/Aeowulf/nativeppl-results<p>Not sharing implementation details yet as I'm still figuring out what to make of this discovery. But I'd appreciate feedback on:<p>- Are these benchmarks meaningful/fair?<p>- What other algorithms or problem sizes should I test?<p>- Is there a market for faster probabilistic inference?