展示HN:癌症诊断为大型语言模型提供了一个有趣的强化学习环境

11作者: dchu177 天前原帖
大家好,我是来自Aluna(YC S24)的David。我们与诊断实验室合作,构建用于肿瘤学任务的数据集和评估工具。 我想分享一个我构建的简单强化学习环境,它为前沿的大型语言模型(LLM)提供了一套工具,使其能够在数字化的病理切片上进行缩放和平移,以找到相关区域进行诊断。 以下是一些LLM在几张切片上进行诊断的视频: ([https://www.youtube.com/watch?v=k7ixTWswT5c](https://www.youtube.com/watch?v=k7ixTWswT5c)):LLM在对一例小细胞肺癌进行诊断前选择不同区域查看的过程。 ([https://youtube.com/watch?v=0cMbqLnKkGU](https://youtube.com/watch?v=0cMbqLnKkGU)):LLM在对一例良性纤维腺瘤进行诊断前选择不同区域查看的过程。 我构建这个环境的原因: 病理切片是现代癌症诊断的基础。活检组织被切片、染色并装载在玻璃上,以供病理学家检查异常。 如今,许多病理切片被数字化为全切片图像(WSI),以TIF或SVS格式存储,大小可达数GB。 虽然已经存在一些专注于病理的人工智能模型,但我很好奇前沿的LLM在病理相关任务上的表现。主要挑战在于WSI的大小超出了LLM的上下文窗口。标准的解决方法是将其拆分为数千个小块,但对于大型前沿LLM来说,这种方法效率低下。 受到病理学家在显微镜下缩放和平移的启发,我构建了一套工具,使LLM能够控制放大倍数和坐标,一次查看小区域,并决定下一步查看的位置。 这最终导致了一些有趣的行为,实际上在提示工程方面似乎取得了相当不错的结果: - GPT 5:在决定之前探索了大约30个区域(在6个癌症亚型任务中与专家病理学家达成一致4次,在5个IHC评分任务中达成一致3次)。 - Claude 4.5:通常使用10-15个视图,但准确性与GPT-5相似(在6个癌症亚型任务中与病理学家达成一致3次,在5个IHC评分任务中达成一致4次)。 - 较小的模型(GPT 4o,Claude 3.5 Haiku):检查了大约8帧,整体准确性较低(在6个癌症亚型任务中达成一致1次,在5个IHC评分任务中达成一致1次)。 显然,这只是一个小样本集,因此我们正在努力创建一个更大的基准套件,包含更多案例和任务类型,但我觉得这个结果很酷,所以想与HN分享!
查看原文
Hey HN, this is David from Aluna (YC S24). We work with diagnostic labs to build datasets and evals for oncology tasks.<p>I wanted to share a simple RL environment I built that gave frontier LLMs a set of tools that lets it zoom and pan across a digitized pathology slide to find the relevant regions to make a diagnosis. Here are some videos of the LLM performing diagnosis on a few slides:<p>(<a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=k7ixTWswT5c" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=k7ixTWswT5c</a>): traces of an LLM choosing different regions to view before making a diagnosis on a case of small-cell carcinoma of the lung<p>(<a href="https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=0cMbqLnKkGU" rel="nofollow">https:&#x2F;&#x2F;youtube.com&#x2F;watch?v=0cMbqLnKkGU</a>): traces of an LLM choosing different regions to view before making a diagnosis on a case of benign fibroadenoma of the breast<p>Why I built this:<p>Pathology slides are the backbone of modern cancer diagnosis. Tissue from a biopsy is sliced, stained, and mounted on glass for a pathologist to examine abnormalities.<p>Today, many of these slides are digitized into whole-slide images (WSIs)in TIF or SVS format and are several gigabytes in size.<p>While there exists several pathology-focused AI models, I was curious to test whether frontier LLMs can perform well on pathology-based tasks. The main challenge is that WSIs are too large to fit into an LLM’s context window. The standard workaround, splitting them into thousands of smaller tiles, is inefficient for large frontier LLMs.<p>Inspired by how pathologists zoom and pan under a microscope, I built a set of tools that let LLMs control magnification and coordinates, viewing small regions at a time and deciding where to look next.<p>This ended up resulting in some interesting behaviors, and actually seemed to yield pretty good results with prompt engineering:<p>- GPT 5: explored up to ~30 regions before deciding (concurred with an expert pathologist on 4 out of 6 cancer subtyping tasks and 3 out of 5 IHC scoring tasks)<p>- Claude 4.5: Typically used 10–15 views but similar accuracy as GPT-5 (concurred with the pathologist on 3 out of 6 cancer subtyping tasks and 4 out of 5 IHC scoring tasks)<p>- Smaller models (GPT 4o, Claude 3.5 Haiku): examined ~8 frames and were less accurate overall (1 out of 6 cancer subtytping tasks and 1 out of 5 IHC scoring tasks)<p>Obviously, this was a small sample set, so we are working on creating a larger benchmark suite with more cases and types of tasks, but I thought this was cool that it even worked so I wanted to share with HN!