懒惰迭代与数组链式操作在50万行数据上的基准测试结果

1作者: gvsh_maths大约 2 个月前原帖
我构建了一个 TypeScript 迭代器库(iterflow),并希望测量懒惰和急切管道之间的实际堆内存差异。这是基准测试的详细说明。 <p>管道</p> 急切 - 标准数组链式调用: <p>const data = Array.from(generateRows(500_000));</p> <p>const results = data .filter(r => r.active && r.value > threshold) .map(r => ({ id: r.id, score: r.value * 1.5 })) .slice(0, 10_000);</p> 每一步都会产生一个新的中间数组。 .filter() 分配一个,.map() 分配另一个,.slice() 则丢弃大部分的数组。 懒惰 - 通过 iterflow 实现相同的管道: <p>import { iter } from '@mathscapes/iterflow';</p> <p>const results = iter(generateRows(500_000)) .filter(r => r.active && r.value > threshold) .map(r => ({ id: r.id, score: r.value * 1.5 })) .take(10_000) .toArray();</p> generateRows 是一个生成器,每次生成一行。在 .toArray() 拉取值通过链条之前,没有任何数据被实际化。没有中间数组。 <p>结果</p> 数据集:500,000 行 管道:filter(active && value > 5000) → map(score) → take(10,000) 原生数组 (.filter → .map → .slice) 15.4 MB (最小 15.2 MB,最大 16.2 MB) iterflow (.filter → .map → .take) 5.8 MB (最小 5.8 MB,最大 5.8 MB) <p>方法论</p> - 指标:管道前后的 heapUsed 差异,而不是总进程内存 - 两个管道都来自相同的生成器源——差异仅测量管道分配,而不是源数据——在每次运行之间强制调用显式 gc() 的 expose-gc - 测量前丢弃一次热身运行 - 报告 5 次运行的中位数 - 原生数组运行在管道运行之前将完整的 500k 数据集实际化为数据。该分配不包括在差异中——两种方法在同一基础上进行测量。 <p>关于库的一些说明</p> - iter() 是对 ES2015 生成器和迭代器协议的封装——没有魔法,只是一个流畅的 API,使调用位置看起来与数组链式调用相同 - .sum() 和 .mean() 仅限于 Iterflow<number> 类型——在非数字迭代器上调用它们会导致编译错误 - 具有一些流式统计操作(.streamingMean()、.ewma()、.windowedMin()),可以在不使用单独累加器的情况下进行运行聚合 - 零运行时依赖 https://www.npmjs.com/package/@mathscapes/iterflow
查看原文
I built a TypeScript iterator library (iterflow) and wanted to measure the actual heap difference between lazy and eager pipelines. This is the benchmark writeup.<p>The pipelines<p>Eager - standard array chaining:<p>const data = Array.from(generateRows(500_000));<p>const results = data .filter(r =&gt; r.active &amp;&amp; r.value &gt; threshold) .map(r =&gt; ({ id: r.id, score: r.value * 1.5 })) .slice(0, 10_000);<p>Each step produces a new intermediate array. .filter() allocates one, .map() allocates another, .slice() then discards most of both.<p>Lazy - same pipeline via iterflow:<p>import { iter } from &#x27;@mathscapes&#x2F;iterflow&#x27;;<p>const results = iter(generateRows(500_000)) .filter(r =&gt; r.active &amp;&amp; r.value &gt; threshold) .map(r =&gt; ({ id: r.id, score: r.value * 1.5 })) .take(10_000) .toArray(); generateRows is a generator, yields one row at a time. Nothing is materialized until .toArray() pulls values through the chain. No intermediate arrays.<p>Results<p>Dataset: 500,000 rows Pipeline: filter(active &amp;&amp; value &gt; 5000) → map(score) → take(10,000)<p>native array (.filter → .map → .slice) 15.4 MB (min 15.2 MB, max 16.2 MB) iterflow (.filter → .map → .take) 5.8 MB (min 5.8 MB, max 5.8 MB)<p>Methodology<p>- Metric: heapUsed delta before and after the pipeline, not total process memory - Both pipelines start from the same generator source — the delta measures pipeline allocations only, not source data --expose-gc with explicit gc() calls forced between every run - One warm-up run discarded before measurement - Median of 5 runs reported - The native array run materializes the full 500k dataset into data before the pipeline runs. That allocation is not included in the delta - both approaches are measured on the same footing.<p>A few notes on the library<p>- iter() is a wrapper around ES2015 generators and the iterator protocol - no magic, just a fluent API so the call site looks identical to array chaining - .sum() and .mean() are typed to Iterflow&lt;number&gt; only - calling them on a non-numeric iterator is a compile error - Has some streaming statistical operations (.streamingMean(), .ewma(), .windowedMin()) for running aggregations without a separate accumulator - Zero runtime dependencies<p>https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;@mathscapes&#x2F;iterflow