当前位置

首页 > 英语阅读 > 双语新闻 > 大数据的近因偏差烦恼(下)

大数据的近因偏差烦恼(下)

推荐人: 来源: 阅读: 1.81W 次

The same tends to be true of most complex phenomena in real life: stock markets, economies, the success or failure of companies, war and peace, relationships, the rise and fall of empires. Short-term analyses aren’t only invalid – they’re actively unhelpful and misleading. Just look at the legions of economists who lined up to pronounce events like the 2009 financial crisis unthinkable right until it happened. The very notion that valid predictions could be made on that kind of scale was itself part of the problem.

大数据的近因偏差烦恼(下)

现实生活中大部分复杂事物的现象正是如此:股票市场、经济发展、企业的成功与失败、战争与和平、国家关系、帝国的崛起和衰落等等。短期分析不仅不扎实、毫无益处,还会产生误导。回头看看,就在2009年全球金融危机袭来的时候,还有那么多经济学家信誓旦旦地宣称这一事件不会发生。认为根据那种短期时间尺度的数据就能做出扎实的预测,这种想法本身就有很大的问题。

It’s also worth remembering that novelty tends to be a dominant consideration when deciding what data to keep or delete. Out with the old and in with the new: that’s the digital trend in a world where search algorithms are intrinsically biased towards freshness, and where so-called link rot infests everything from Supreme Court decisions to entire social media services. A bias towards the present is structurally engrained in almost all the technology surrounding us, not least thanks to our habit of ditching most of our once-shiny machines after about five years.

我们还应当记住,在决定哪些数据该保存还是删除的时候,新颖性往往会成为主要的考虑因素。旧的淘汰,新的进来,在这个搜索算法本质上偏向于新鲜事物的数字世界中,这是一个明显的趋势。从最高法院的裁决,到所有社交媒体服务平台,我们到处都可以看到已经失效的网址。我们身边的几乎所有技术都偏向于当前信息,人也一样:大多数人已经习惯用个四五年就把原本光鲜亮丽的设备丢掉。

What to do? This isn’t just a question of being better at preserving old data – although this wouldn’t be a bad idea, given just how little is currently able to last decades rather than years. More importantly, it’s about determining what is worth preserving in the first place – and what it means meaningfully to cull information in the name of knowledge.

怎么办?这个问题已经不仅仅在于如何更好保存旧数据的范畴——尽管这并不是个坏主意,想想我们现在还有什么东西能流行保留10年之久。更重要的是,这个问题关系到确定哪些东西值得优先保存,以及如何在知识的名义下,选择哪些信息最有意义

What’s needed is something that I like to think of as “intelligent forgetting”: teaching our tools to become better at letting go of the immediate past in order to keep its larger continuities in view. It’s an act of curation akin to organising a photograph album – albeit with more maths. When are two million photographs less valuable than two thousand? When the larger sample covers less ground; when the questions that can be asked of it are less important; when the level of detail on offer instils not useful scepticism, but false confidence.

或许我们需要的是我所称之为的“智能性遗忘”:应该让我们的工具更多地放弃最近的信息,从而在长远视角上保持更高水平的连续性。这有点像是以数学方法重新整理一本影集。什么时候两百万张照片的价值比两千张照片更低?什么时候较大的样本量覆盖范围反而较小?哪些问题的重要性较低?哪个细节水平能提供有用的质疑证据,而不是虚假的信心?

Many data sets are irreducible and most precious when complete: gene sequences; demographic data; the raw, hard knowledge of geography and physics. The softer the science, however, the more that scale is likely inversely to correlate with quality – and the more important time itself becomes as a filter. Either we choose carefully what endures, matters and meaningfully captures our receding past – or its imprint is silently supplanted by the present’s growing noise.

许多数据集是无法缩减的,只有在完整的情况下才最宝贵,比如,基因序列、人口统计学数据、地理和物理学的原始观测数据等等。数据的科学性越弱,数据规模与数据的质量就越可能呈现负相关,此时时间本身就成为更加重要的过滤工具。我们如果不仔细选择过去保存下来的有价值、有意义的事物,它们就会被迅速膨胀的信息洪流悄无声息地吞没掉。

Time cuts several ways, for there is another crucial sense in which it remains a limiting factor: the availability of human time and attention. Corporations, individuals and governments alike have orders of magnitude more information available today than they did even a few years ago. Yet they don’t have any more available attention, board members, chief executives, elected officials or hours in the day. Better and better tools exist to help decision-makers ask meaningful questions of the information they possess – but you can only analyse what remains accessible. Mere accumulation is no kind of answer. In an era of bigger and bigger data, what you choose not to know matters just as much as what you do.

能否考察长期历史遗留下来的数据取决于考察者是否有足够的时间和注意力。今天的企业、个人和政府机构都能够获得比以往(甚至就在几年前)大许多数量级的数据,但是董事会成员、首席执行官、政府官员等决策者却没有足够时间和注意力来应对这些数据。今天的决策者们有越来越高效的工具帮助他们就所持有的数据提出问题——但你只应该分析有意义的数据。单纯的数量累积不是一个好的对策。在一个数据量越来越大的时代,如何选择主动放弃哪些事情,与选择做什么事情一样重要。