关于Mastra的最先进内存的看法

1作者: manthangupta1093 个月前原帖
我阅读了关于观察记忆的博客和代码,发现这是一个非常有趣的方向,我也很欣赏在实现细节上的透明度。不过,我对其“最先进技术”(SOTA)记忆的主张和更广泛的框架有一些想法。 从我所看到的情况来看: 1. 该实现似乎在很大程度上是针对LongMemEval的良好表现进行了调优。这是一个有用的信号,但并不一定能转化为在生产环境中稳健的长期记忆行为。 2. 它更像是上下文压缩/上下文管理,而不是一个持久的长期代理记忆系统。这在单个长时间运行的任务中表现得很好。 3. Observer和Reflector都以压缩形式重写记忆。这对于令牌控制是有帮助的,但压缩本质上是有损的,可能会丢失一些后续可能变得重要的小细节。 4. Reflector似乎主要通过令牌阈值来验证成功,而不是检查重写的记忆是否在语义上忠实于原始内容。随着时间的推移,这可能导致记忆漂移。 5. Observer提示可能会引入假设(例如,如果经过足够的时间,推断计划的行动已经发生),这有可能导致产生错误的记忆。 6. 设计似乎在重写观察时强调了最近性。虽然这保持了上下文的新鲜感,但可能会使系统偏向于最近的信息,并逐渐压缩掉较旧但仍然重要的细节。持久的记忆系统通常需要机制来保留显著的长期事实,而不仅仅是最近的活动。 7. 完整的观察块被反复注入上下文中。这可能会增加令牌成本,并根据任务引入无关的噪声。 8. 在响应时,似乎对原始消息证据的基础支持有限,这使得检测和纠正错误的压缩记忆变得更加困难。 9. 最后,我认为我们应该对基于单一基准的“最先进技术”主张保持谨慎。LongMemEval的结果可能在该设置下表现强劲,但生产工作负载要复杂得多。稳健性、漂移、基础支持和成本行为通常只有在持续的真实世界使用中才能显现出来。 总体而言,这看起来像是一个强大的基准导向的上下文处理。我只是对它是否已经符合稳健的通用长期记忆系统的标准持怀疑态度。我很好奇团队如何看待这些权衡,超越基准性能。
查看原文
I went through both the blog and the code for Observational Memory, and really interesting direction, and I appreciate the transparency in sharing implementation details. I do have a few thoughts on the SOTA memory claim and the broader framing.<p>From what I can see:<p>1. The implementation appears heavily tuned toward performing well on LongMemEval. That&#x27;s a useful signal, but it doesn&#x27;t necessarily translate to robust long-term memory behavior in production environments.<p>2. It feels closer to context compression&#x2F;context management than a durable long-term agent memory system. This will perform really well for a single long-running task<p>3. Both the Observer and Reflector rewrite memory in compressed form. That&#x27;s helpful for token control, but compression is inherently lossy and can drop smaller details that might become important later.<p>4. The Reflector seems to validate success primarily via token thresholds, rather than checking whether the rewritten memory remains semantically faithful to the original. Over time, this could allow memory drift.<p>5. The Observer prompt may introduce assumptions (e.g., inferring that a planned action happened if enough time has passed), which risks creating incorrect memories.<p>6. The design appears to emphasize recency when rewriting observations. While that keeps context fresh, it may bias the system toward recent information and gradually compress away older but still important details. Durable memory systems usually need mechanisms to preserve salient long-term facts, not just recent activity.<p>7. The full observations block is repeatedly injected into context. This may increase token cost and introduce irrelevant noise depending on the task.<p>8. There appears to be limited grounding back to raw message evidence at response time, which makes it harder to detect and correct incorrect compressed memories.<p>9. Finally, I think we should be cautious about claiming &quot;SOTA&quot; based on performance on a single benchmark. LongMemEval results may demonstrate strong performance on that setup, but production workloads are much messier. Robustness, drift, grounding, and cost behavior typically show up only under sustained real-world usage.<p>Overall, this looks like a strong benchmark-oriented context handling. I am just less convinced that it yet qualifies as a robust, general-purpose long-term memory system. Curious how the team is thinking about these trade-offs beyond benchmark performance.