懒惰迭代与数组链式操作在50万行数据上的基准测试结果
我构建了一个 TypeScript 迭代器库(iterflow),并希望测量懒惰和急切管道之间的实际堆内存差异。这是基准测试的详细说明。
<p>管道</p>
急切 - 标准数组链式调用:
<p>const data = Array.from(generateRows(500_000));</p>
<p>const results = data
.filter(r => r.active && r.value > threshold)
.map(r => ({ id: r.id, score: r.value * 1.5 }))
.slice(0, 10_000);</p>
每一步都会产生一个新的中间数组。 .filter() 分配一个,.map() 分配另一个,.slice() 则丢弃大部分的数组。
懒惰 - 通过 iterflow 实现相同的管道:
<p>import { iter } from '@mathscapes/iterflow';</p>
<p>const results = iter(generateRows(500_000))
.filter(r => r.active && r.value > threshold)
.map(r => ({ id: r.id, score: r.value * 1.5 }))
.take(10_000)
.toArray();</p>
generateRows 是一个生成器,每次生成一行。在 .toArray() 拉取值通过链条之前,没有任何数据被实际化。没有中间数组。
<p>结果</p>
数据集:500,000 行
管道:filter(active && value > 5000) → map(score) → take(10,000)
原生数组 (.filter → .map → .slice) 15.4 MB (最小 15.2 MB,最大 16.2 MB)
iterflow (.filter → .map → .take) 5.8 MB (最小 5.8 MB,最大 5.8 MB)
<p>方法论</p>
- 指标:管道前后的 heapUsed 差异,而不是总进程内存
- 两个管道都来自相同的生成器源——差异仅测量管道分配,而不是源数据——在每次运行之间强制调用显式 gc() 的 expose-gc
- 测量前丢弃一次热身运行
- 报告 5 次运行的中位数
- 原生数组运行在管道运行之前将完整的 500k 数据集实际化为数据。该分配不包括在差异中——两种方法在同一基础上进行测量。
<p>关于库的一些说明</p>
- iter() 是对 ES2015 生成器和迭代器协议的封装——没有魔法,只是一个流畅的 API,使调用位置看起来与数组链式调用相同
- .sum() 和 .mean() 仅限于 Iterflow<number> 类型——在非数字迭代器上调用它们会导致编译错误
- 具有一些流式统计操作(.streamingMean()、.ewma()、.windowedMin()),可以在不使用单独累加器的情况下进行运行聚合
- 零运行时依赖
https://www.npmjs.com/package/@mathscapes/iterflow
查看原文
I built a TypeScript iterator library (iterflow) and wanted to measure the actual heap difference between lazy and eager pipelines. This is the benchmark writeup.<p>The pipelines<p>Eager - standard array chaining:<p>const data = Array.from(generateRows(500_000));<p>const results = data
.filter(r => r.active && r.value > threshold)
.map(r => ({ id: r.id, score: r.value * 1.5 }))
.slice(0, 10_000);<p>Each step produces a new intermediate array. .filter() allocates one, .map() allocates another, .slice() then discards most of both.<p>Lazy - same pipeline via iterflow:<p>import { iter } from '@mathscapes/iterflow';<p>const results = iter(generateRows(500_000))
.filter(r => r.active && r.value > threshold)
.map(r => ({ id: r.id, score: r.value * 1.5 }))
.take(10_000)
.toArray();
generateRows is a generator, yields one row at a time. Nothing is materialized until .toArray() pulls values through the chain. No intermediate arrays.<p>Results<p>Dataset: 500,000 rows
Pipeline: filter(active && value > 5000) → map(score) → take(10,000)<p>native array (.filter → .map → .slice) 15.4 MB (min 15.2 MB, max 16.2 MB)
iterflow (.filter → .map → .take) 5.8 MB (min 5.8 MB, max 5.8 MB)<p>Methodology<p>- Metric: heapUsed delta before and after the pipeline, not total process memory
- Both pipelines start from the same generator source — the delta measures pipeline allocations only, not source data --expose-gc with explicit gc() calls forced between every run
- One warm-up run discarded before measurement
- Median of 5 runs reported
- The native array run materializes the full 500k dataset into data before the pipeline runs. That allocation is not included in the delta - both approaches are measured on the same footing.<p>A few notes on the library<p>- iter() is a wrapper around ES2015 generators and the iterator protocol - no magic, just a fluent API so the call site looks identical to array chaining
- .sum() and .mean() are typed to Iterflow<number> only - calling them on a non-numeric iterator is a compile error
- Has some streaming statistical operations (.streamingMean(), .ewma(), .windowedMin()) for running aggregations without a separate accumulator
- Zero runtime dependencies<p>https://www.npmjs.com/package/@mathscapes/iterflow