和上一篇文章一样，这篇文章也是我在数学建模中碰到的，如果只是普通数据量的计数问题那么我们不妨使用counter，但是如果数据量达到一定规模，那么我们不得不考虑其他算法来解决问题了。我们这里使用hyperloglog算法来实现大数据量计数问题，这种算法是一种基于统计的计数算法，算法并不一定准确，但是足够快，如果读者将速度放在第一位那么不妨试试这种算法，而且hyperloglog算法准确率逼近100%，试问1000001和100000又有多大的差距呢，所以这种算法是有一定实用性的。

代码实现

当然作为胶水语言的Python，我们当然不必重复造轮子，这里我们可以直接使用python的bounter库来实现hyperloglog算法计数。

安装方法：pip install bounter

这里给出bounter在github上的官方教材使用的代码：

示例一：

from bounter import bounter
counts = bounter(size_mb=1024)  # use at most 1 GB of RAM
counts.update([u'a', 'few', u'words', u'a', u'few', u'times'])  # count item frequencies
print(counts[u'few'])  # query the counts
2

示例二

from bounter import bounter
counts = bounter(size_mb=200)  # default version, unless you specify need_items or need_counts
counts.update(['a', 'b', 'c', 'a', 'b'])
print(counts.total(), counts.cardinality())  # total and cardinality still work
(5L, 3)
print(counts['a'])  # individual item frequency still works
2
print(list(counts))  # iterator returns keys, just like Counter
[u'b', u'a', u'c']
print(list(counts.iteritems()))  # supports iterating over key-count pairs, etc.
[(u'b', 2L), (u'a', 2L), (u'c', 1L)]