对加密市场数据的SQL访问,而不仅仅是JSON。
嗨,HN,
我是Nazim,Koinju.io的创始人,我想在这里分享一个我们最近开启的探索性选项:通过SQL提供对我们数据库的访问,该数据库包含所有加密货币市场数据。虽然REST可以直接检索数据,但我们越来越认为,针对统一的加密市场数据层提供SQL访问对于分析工作可能会很有价值,尤其是在大语言模型(LLM)的背景下。
这一想法部分受到OpenBB首席执行官Didier Lopes最近关于金融公司拥有金融工作基础设施的文章的启发,特别是工作流执行和AI推理发生的运行时环境。
大多数数据API是为那些已经知道自己想要什么的软件设计的。调用一个端点,获取JSON,解析它,然后在其他地方进行计算。这种模型运作良好,并且仍然有效。但我不确定它是否适合LLM驱动的工作流,尤其是在大数据的情况下。
语言模型可以调用API、读取JSON或编写Python代码来实现(Claude代码可以强制输出JSON)。但这并不意味着模型在处理、重塑、连接、聚合、验证或推理大型结构化数据集时是高效的,尤其是通过标记化的行。在小规模时,它适合上下文限制;但在大规模时,它变得复杂,小细节可能会悄然消失,仿佛它们是异常值。
因此,我们正在测试的论点是:
对于大数据集,面向AI的基本操作应该从“返回JSON”切换为“在数据集上执行一个有限的、可检查的操作”,这是一个你可以计划、重放甚至精确追踪的操作。在这种情况下,LLM承担了规划者/控制者的角色。它应该能够检查模式、理解约束、表达操作、检查限制甚至抽象语法树(AST),通过执行层运行计算,然后对紧凑的类型化结果进行推理。
因此,SQL是我们在这一层的当前尝试。
这其实并不新颖,也不是什么神奇的“AI原生”。但它是明确的、可检查的、可组合的,并且接近数据可执行。REST在简单检索时仍然有意义。但对于大型市场数据集的分析问题,JSON分页似乎不是合适的工作单位。
这里还有一个治理问题:在金融行业,许多公司不希望他们的整个工作流迁移到供应商的黑箱接口。这似乎是合理的。内部上下文、权限、模型政策、审计日志和决策工作流当然应该存在于公司的环境中。但这并不一定意味着每个外部数据集在提出任何问题之前都应该被本地复制。
也许更好的边界是:
- 公司拥有工作流和推理运行时
- 数据提供者暴露一个受控的执行表面
- LLM发出有限的操作
- 查询引擎执行实际计算
- 结果返回
我对从事类似工作的人提供的任何反馈都很感兴趣,包括市场数据、量化研究、分析等。我试图回答的问题是:
- 今天,LLM处理大数据的正确接口是什么?
- 模型应该在原始数据、JSON、模式、SQL、类型化工具、语义层或其他什么上操作?
- 客户拥有的运行时与提供方数据执行之间的边界应该在哪里?
当调用者可能是一个代理时,查询限制、成本预览、干运行、权限和审计日志应该如何运作?
我并不只是寻求验证。如果答案是“不要创造一个新的AI类别;只需提供干净的数据、稳定的模式、SQL、文档和可预测的限制”,那也是有用的。
查看原文
Hi HN,<p>I’m Nazim, founders of Koinju.io and I wanted to share here an exploratory option we opened very recently: providing access to our database, which contains all cryptocurrency market data, via SQL. REST give access for direct retrieval but we're thinking more and more that SQL access for analytical work over a unified crypto market data layer could be of something because of llms.<p>This was partly triggered by Didier Lopes, ceo of OpenBB recent essay on financial firms owning the infrastructure where financial work happens (https://www.linkedin.com/pulse/how-did-we-end-up-here-didier-rodrigues-lopes-hgeqe/
), especially the runtime where workflows execute and AI inference happens.<p>Most data APIs were designed for software that already knows what it wants. Call an endpoint, get JSON, parse it, compute somewhere else. That model worked great and still works great. But I’m not sure it maps well to llm-driven workflows, especially with big data.<p>A language model can call APIs /read JSON or write python to do so (claude code can force json output). But that does not mean the model is efficient in ingesting, reshaping, joining, aggregating, validating, or reasoning over large structured datasets through tokenized rows. At small scale, it fit within context limit. At large scale, it becomes complexe and small details may disappear silently, as if they were outliers...<p>So the thesis we are testing is:
For big datasets, the AI-facing primitive should be switched from “return json” to execute a bounded, inspectable operation over the dataset”, something that you could plan, replay and even trace precisely. In that case, the llm endorse the role of a planner/controller. It should be able to inspect schemas, understand constraints, express an operation, check limits or even ASTs, run the computation through an execution layer, and then reason over a compact typed result.<p>So SQL is our current attempt at that layer.<p>This is really not new :-) not even magically “AI-native”. But it is explicit, inspectable, composable, and executable close to the data. REST still makes sense for simple retrieval. But for analytical questions over large market datasets, JSON pagination feels like the wrong unit of work.<p>And there is also a governance question here: In financial sector, many firms do not want their entire workflow to move into a vendor’s black-box interface. That seems right. Internal context, permissions, model policy, audit logs, and decision workflows should probably live in the firm env, of course. But that does not necessarily mean every external dataset should be copied locally before any question can be asked.<p>Maybe the better boundary is:
-the firm owns the workflow and inference runtime
-the data provider exposes a controlled execution surface,
-the llm issues bounded operations,
-the query engine performs the actual computation
-result comes back<p>I’m interested in any feedback from people working on stuff like that, market data, quant research, analytics... The questions I’m trying to answer:
-What is the right interface today for an llm working with bigdata?
-Should the model operate on raw, JSON, schemas, SQL, typed tools, semantic layers, or something else?
-Where should the boundary be between customer-owned runtime and provider-side data execution?<p>How should query limits, cost previews, dry runs, permissions, and audit logs work when the caller might be an agent?<p>I’m not looking only for validation. If the answer is “don’t invent a new AI category; just provide clean data, stable schemas, SQL, docs, and predictable limits”, that would also be useful.