问HN:GPU的不确定性对人工智能有害吗?
论点:
- GPU使用并行计算
- 浮点运算不具备结合性
- 舍入误差的累积方式不同
- GPU产生的计算结果具有噪声
- 数据中存在已知的噪声与准确性之间的权衡
- 噪声需要过度参数化或更大的网络才能进行泛化
- 过度参数化会阻碍网络对问题空间的充分泛化
因此,GPU的不确定性似乎对人工智能不利。我哪里出错了?
问题:
- 这是否已经被量化?据我理解,答案会因情况而异,并与网络的深度、宽度、架构、学习率等其他细节相关。归根结底,熵意味着某种噪声与准确性之间的权衡,但我们讨论的是像10%、1%、0.1%这样的数量级吗?
- 由于噪声与准确性之间的权衡,似乎可以认为使用一个经过确定性训练的小网络可以达到与一个经过非确定性训练的更大网络相同的性能。这是真的吗?即使我们只讨论一个神经元的差异?
- 如果像驾驶汽车这样的问题空间过于庞大,无法完全表示为一个数据集(考虑宇宙的原子作为硬盘),我们如何确保一个数据集是问题空间的完美抽样?
- 过度参数化难道不会保证模型学习的是数据集而不是问题空间吗?将其概念化为使用更高次的多项式来表示另一个多项式是否不正确?
- 即使有完美的抽样,当少量噪声能够引发雪崩时,噪声计算似乎也不兼容。如果这种噪声以1%量化,难道不能说数据集在网络中留下的“印象”会比应有的多出1%吗?也许在某种意义上会溢出?评估数据点“非常接近”但未包含在训练数据点中的情况,更可能被错误评估为相同的“邻近”训练数据点。也许我在这里重新发明边缘情况和过拟合,但我认为过拟合不会在训练结束时自发发生。
查看原文
Argument:<p>- GPUs use parallelism<p>- Floating point math is not associative<p>- Rounding error accumulates differently<p>- GPUs generate noisy computations<p>- Known noise vs accuracy tradeoff in data<p>- Noise requires overparameterization/larger network to generalize<p>- Overparameterization prevents the network from fully generalizing to the problem space<p>Therefore, GPU nondeterminism seems bad for AI. Where did I go wrong?<p>Questions:<p>- Has this been quantified? As I understand it, the answer would be situational and tied to other details like network depth, width, architecture, learning rate, etc. At the end of the day, entropy means some sort of noise/accuracy tradeoff, but are we talking magnitudes like 10%, 1%, 0.1%?<p>- Because of the noise/accuracy tradeoff, it seems to hold that one could use a smaller network trained deterministically and achieve the same performance as X bigger network trained non-deterministically. Is this true, even if we're talking only a single neuron of a difference?<p>- If something like the problem space of driving a car is too large to be fully represented into a dataset (consider the atoms of the universe as a hard drive), how can we be sure a dataset is a perfect sampling of the problem space?<p>- Wouldn't overparameterization guarantee the model learns the dataset and not the problem space? Is it incorrect to conceptualize this as using a polynomial of a higher degree to represent another?<p>- Even with perfect sampling, noisy computation seems incompatible when a small amount of noise is capable of causing an avalanche. If this noise is somehow quantified to 1%, couldn't you say the dataset's "impression" left in the network would be 1% larger than it should, maybe spilling over in a sense? Eval data points "very close to" but not included in training datapoints would be more likely to incorrectly evaluate to as the same "nearby" training datapoint. Maybe I'm reinventing edge case and overfitting here, but I don't think overfitting just spontaneously starts happening towards the end of training.