Code Answer: Find the mode of an array of floating point numbers using hashing

I recently read that a method involving hashing could be a good way to find the mode of an array of floating point numbers. I don't know much about hashing or its applications and the text didn't explain further.

Does anyone know what this method would involve?

From stackoverflow

Not being much of a statistics guy, the mode in this sense seems to be defined as "the value that occurs the most frequently".

From that description, it seems the obvious way to implement this using a hash table would be to use the number as a key, and as the value stored, use a frequency counter.

Then just loop through each number, in Python pseudo-code this would be something like:
```
array = [1.0, 1.2, 0.4, ...] # A bunch of numbers
counts = {}
for a in array:
  if a in counts:
    counts[a] += 1
  else:
    counts[a] = 1
```
Then you'd need to extract the greatest value, and find its corresponding key:
```
sorted([(v, k) for k, v in counts.items()])[-1][1]
```
This creates a sorted list of (value, key) pairs, sorted on value, and then extracts the key ([1]) from the last ([-1]) pair.

I didn't bother using a defaultdict or equivalent, for illustration I think the if is rather helpful.

NOTE: I make no guarantees that this is the canonical way of solving the problem. This is just what I'd do, off the top of my head, if someone asked me to solve the problem (and maybe hinted that using a hash is a good idea).

Vatine : It's pretty close to what I would do too.

   NB. random array of floating-point numbers
   ] y =: 10 (?@$%]) 5
0 0.6 0.2 0.4 0.4 0.8 0.6 0.6 0.8 0
   NB. count occurrences
   ({:,#)/.~ y
  0 2
0.6 3
0.2 1
0.4 2
0.8 2
   NB. order by occurrences
   (\:{:"1)({:,#)/.~ y
0.6 3
  0 2
0.4 2
0.8 2
0.2 1
   NB. pick the most frequent
   {.{.(\:{:"1)({:,#)/.~ y
0.6

I would advise against using a hash, as it assumes exact comparisons -- never a good assumption on floating-point numbers. You always want to do an epsilon comparison of some sort. What if your array contains some elements 0.2(00000000) and 0.2(00000001), which really should be considered equal, but aren't because they came from different calculations?

Conveniently, J always does epsilon-comparison by default. Too conveniently, since it's hidden in the /.~ and I have to write more code in order to demonstrate how to do this in other languages like Python...

epsilon = 0.0001
def almost_equal(a, b):
    return -epsilon <= a-b <= epsilon

array = [0.0, 0.6, 0.2, 0.4, 0.4, 0.8, 0.6, 0.6, 0.8, 0.0]

# more efficient would be to keep this in sorted order,
# and use binary search to determine where to insert,
# but this is just a simple demo
counts = []
for a in array:
    for i, (b, c) in enumerate(counts):
        if almost_equal(a, b):
            counts[i] = (b, c + 1)
            break
    else:
        counts.append((a, 1))

# sort by frequency, extract key of most frequent
print "Mode is %f" % sorted(counts, key = lambda(a, b): b)[-1][0]

j_random_hacker : +1 for pointing out the problem of FP inexactness. But J's appalling syntax makes regexes look beautiful. o_O

Code Answer

Friday, April 15, 2011

Find the mode of an array of floating point numbers using hashing

0 comments:

Post a Comment

Blog Archive