问HN:你如何搜索个人数据?

1作者: escapecharacter1 天前原帖
我有近20年的工作记录,包括个人笔记、通信、代码和文档。这些内容分散在多个(云)服务中,跨这些领域进行搜索变得不切实际。 问题是这样的:“啊,我记得和某人谈论过[算法],然后记录了一个重要的见解。我们来找找这个。” 这不是一个大型语言模型(LLM)能够解决的问题。阻碍在于没有办法在所有这些纯文本中运行搜索代码。 服务包括: * 电子邮件(Gmail,已与我的macOS磁盘通过Apple Mail同步) * Dropbox * Notion * Google Drive * Obsidian * Github * Apple Notes * Discord聊天 * Trello * 我自己的博客 如果我把所有内容都同步到我的Mac磁盘,也许我可以在那里进行纯文本搜索。然而,Spotlight的索引总是不完整,常常漏掉明显的文件。我的Dropbox太大了,所以我并没有将其全部本地同步。 有些服务我已经不再使用,比如Evernote。当我归档这个服务时,我导出了所有内容并将其移动到我的Dropbox中。因此,如果我搜索Dropbox,它也会搜索Evernote中的旧笔记。我不可能对我正在积极使用的所有服务都这样做。 我现在的搜索方式是猜测结果最有可能在哪个服务中,然后在那里搜索。当没有结果时,我就搜索下一个最有可能的服务,反复进行。 对于我自己的博客,我曾经使用Google的站内搜索,但我最近发现这个搜索不完整:https://bsky.app/profile/dustinfreeman.bsky.social/post/3m5l5tto6pk27 我可以想象一个解决方案,即有一个第三方服务能够访问我所有服务的访问密钥。但是,现实是,这需要巨大的信任。此外,我对所有这些服务的访问都需要双重身份验证,并且有有效期,因此我需要不断地重新授权给这个第三方服务。在这种情况下,继续按照我现在的方式进行搜索就显得更有意义。
查看原文
I have personal notes, correspondence, code and documentation from nearly 20 years of work. These are spread across multiple (cloud) services, and searching across these fiefdoms has been impractical.<p>The problem goes like: &quot;Ah, I remember having a conversation with someone about [algorithm], then recording an important insight. Let&#x27;s find that.&quot;<p>This isn&#x27;t a problem solved by an LLM. The blocker is that there isn&#x27;t a way to run search code on all this plain text.<p>Services: * Email (gmail, synced to my macOS disk with Apple Mail) * Dropbox * Notion * Google Drive * Obsidian * Github * Apple Notes * Discord chats * Trello * My own blog<p>If I had everything synced to my mac&#x27;s disk, maybe I could do a plaintext search there. However Spotlight&#x27;s indexing is always incomplete and misses obvious files. My Dropbox is so large I don&#x27;t sync it all locally.<p>Some services I no longer use, like Evernote. When I archived this service, I exported everything and moved it into my Dropbox. So, if I search Dropbox, it also searches old notes from Evernote. There&#x27;s no way I could be doing this for all services I actively use.<p>The way I search now is I guess the service the result is most likely in, and search there. When finding no results, I search the next most likely service, ad nauseum For my own blog, I used to use Google&#x27;s site search, but I recently discovered this was incomplete: https:&#x2F;&#x2F;bsky.app&#x2F;profile&#x2F;dustinfreeman.bsky.social&#x2F;post&#x2F;3m5l5tto6pk27<p>I could imagine a solution where there&#x27;s some 3rd party service that has access keys to all my services. But, let&#x27;s be real, that&#x27;s a huge amount of trust. Also, my access to all these services is 2FA&#x27;d with expiry, and so I&#x27;d be continually re-upping auth to this third party service. At that point, it makes sense to just do search how I do it now.