Sunday, March 6, 2011

Algorithm to tell if two arrays have identical members.

What's the best algorithm for comparing two arrays to see if they have the same members?

Assume there are no duplicates, the members can be in any order, and that neither is sorted.

compare(
    [a, b, c, d],
    [b, a, d, c]
) ==> true

compare(
    [a, b, e],
    [a, b, c]
) ==> false

compare(
    [a, b, c],
    [a, b]
) ==> false
From stackoverflow
  • The best I can think of is O(n^2), I guess.

    function compare($foo, $bar) {
        if (count($foo) != count($bar)) return false;
    
        foreach ($foo as $f) {
            foreach ($bar as $b) {
                if ($f == $b) {
                    // $f exists in $bar, skip to the next $foo
                    continue 2;
                }
            }
            return false;
        }
        return true;
    }
    
    Brian R. Bondy : O(n) would be the best if you're using a hashtable. If you're sorting first O(nlogn).
    Matt J : Does this still work if foo has duplicates? It doesn't look like it would check to see if they have the same number of duplicates. I.e., this would falsely match [0 0 0 0 1 2 3] and [0 1 1 2 2 3 3]
    nickf : no it doesn't work in that case, but that was one of the assumptions at the start. I guess that if you couldn't be sure there wasn't duplicates, then you could delete items from the bar array as they were matched.
    Jasper Bekkers : Making a multi-threaded variant of this implementation that scales linearly is probably easier than any other implementation on this page.
  • You could load one into a hash table, keeping track of how many elements it has. Then, loop over the second one checking to see if every one of its elements is in the hash table, and counting how many elements it has. If every element in the second array is in the hash table, and the two lengths match, they are the same, otherwise they are not. This should be O(N).

    To make this work in the presence of duplicates, track how many of each element has been seen. Increment while looping over the first array, and decrement while looping over the second array. During the loop over the second array, if you can't find something in the hash table, or if the counter is already at zero, they are unequal. Also compare total counts.

    Another method that would work in the presence of duplicates is to sort both arrays and do a linear compare. This should be O(N*log(N)).

    jmucchiello : This solution works best in the presence of duplicates.
  • Obvious answers would be:

    1. Sort both lists, then check each element to see if they're identical
    2. Add the items from one array to a hashtable, then iterate through the other array, checking that each item is in the hash
    3. nickf's iterative search algorithm

    Which one you'd use would depend on whether you can sort the lists first, and whether you have a good hash algorithm handy.

    Ken Gentle : Minor optimizations... 1. As mentioned below, check the length first. 2. Java's Set.add(E o) operation returns true if the element was added, so the iteration can simply test 'if (!setA.add(elementFromB))' and return false.
    jmucchiello : The problem with the hashtable approach is it doesn't work when the list can have duplicate values. Does a[1, 2, 2, 3] == b[1, 2, 3, 3]? Both are the same length and all the items in b can be found in the hashtable of a. You need a set structure that counts the items and check the counts are equal.
  • I would suggest using a sort first and sort both first. Then you will compare the first element of each array then the second and so on.

    If you find a mismatch you can stop.

  • If you sort both arrays first, you'd get O(N log(N)).

  • The best way is probably to use hashmaps. Since insertion into a hashmap is O(1), building a hashmap from one array should take O(n). You then have n lookups, which each take O(1), so another O(n) operation. All in all, it's O(n).

    In python:

    def comparray(a, b): 
        sa = set(a)
        return len(sa)==len(b) and all(el in sa for el in b)
    
    Frentos : I'd love to know the details of this data structure with O(1) insertions and O(1) lookups. Unless you have a perfect hashing function (no duplicates) and there is no overhead as the hashmap grows, the insertion is going to be more that O(1)
    Claudiu : Well it is amortized O(1), meaning on average it'll be O(1). Obviously doing enough insertions will cause it to grow and collide, also causing lookups to then take slightly longer. I guess that would apply in this case, since it's doing n of them. Do you know details on python's hashmap performance?
    Frentos : I agree typical inserts/lookups are amortized O(1), and I'd definitely use some form of dictionary/hash/set as a solution in practice. But O() notation is for worst-case. I googled and found data structures which guarantee O(1) access, but no analysis of worst-case insertion time for N elements.
  • Don't forget to compare their lengths first! That may be all the optimization you need :).

    Dana the Sane : As long as your language or code can perform count() in O(1), otherwise you might be an order of magnitude slower.
    toto : We're looking for an algorithm here, not smart things to do :)
  • Ignoring the built in ways to do this in C#, you could do something like this:

    Its O(1) in the best case, O(N) (per list) in worst case.

    public bool MatchArrays(object[] array1, object[] array2)
    {
       if (array1.length != array2.length)
          return false;
    
       bool retValue = true;
    
       HashTable ht = new HashTable();
    
       for (int i = 0; i < array1.length; i++)
       {
          ht.Add(array1[i]);
       }
    
       for (int i = 0; i < array2.length; i++)
       {
          if (ht.Contains(array2[i])
          {
             retValue = false;
             break;
          }
       }
    
        return retValue;
    }
    
    Frentos : Hash tables aren't O(1) to do lookups. Your solution is somewhere between O(N.log N) and O(N^2) depending on the hash table implementaton.
    FlySwat : The O(1) is if the list doesn't match, otherwise its O(N) twice in the worst case and O(N) in the best case. Hashtables are O(1) when there is no chance of a collision btw.
  • Assuming you don't want to disturb the original arrays and space is a consideration, another O(n.log(n)) solution that uses less space than sorting both arrays is:

      * Return FALSE if arrays differ in size
      * Sort the first array -- O(n.log(n)) time, extra space required is the size of one array
      * For each element in the 2nd array, check if it's in the sorted copy of
         the first array using a binary search -- O(n.log(n)) time
    

    If you use this approach, please use a library routine to do the binary search. Binary search is surprisingly error-prone to hand-code.

    [Added after reviewing solutions suggesting dictionary/set/hash lookups:]

    In practice I'd use a hash. Several people have asserted O(1) behaviour for hashes, leading them to conclude a hash-based solution is O(N). Typical inserts/lookups may be close to O(1), and some hashing schemes guarantee worst-case O(1) lookup, but worst-case insertion -- in constructing the hash -- isn't O(1). Given any particular hashing data structure, there would be some set of inputs which would produce pathological behaviour. I suspect there exist hashing data structures with the combined worst-case to [insert-N-elements then lookup-N-elements] of O(N.log(N)) time and O(N) space.

    erikkallen : If we assume data is not hostile, worst-case times are rarely interesting. Everyone says quicksort is O(n*log(n)), but it's worst-case performance is O(n^2)
  • I wonder if you could compare their sums and products:

    (defun list= (list-a list-b)
      (if (and (= (reduce #'+ list-a)
                  (reduce #'+ list-b))
               (= (reduce #'* list-a)
                  (reduce #'* list-b)))
          t
          nil))
    

    This would be O(n), but I suspect that it is basically wrong. Is anyone in for a little proof?

    edit: Yeah, when looking at it again (after getting some sleep), there are several problems with this: multiplying all out will produce large numbers, and when they get into the bignum range, there will be quite some overhead. Also, it only works for positive integers.

    Frentos : A counter-example is [0,1,4] and [0,2,3] ; both have sum=5, product=0. The original question doesn't restrict the array elements to be numbers.
    Ken Gentle : I'll give you an up vote for posting LISP - I wonder how many SO users can identify it, much less read it...
  • What is the "best" solution obviously depends on what constraints you have. If it's a small data set, the sorting, hashing, or brute force comparison (like nickf posted) will all be pretty similar. Because you know that you're dealing with integer values, you can get O(n) sort times (e.g. radix sort), and the hash table will also use O(n) time. As always, there are drawbacks to each approach: sorting will either require you to duplicate the data or destructively sort your array (losing the current ordering) if you want to save space. A hash table will obviously have memory overhead to for creating the hash table. If you use nickf's method, you can do it with little-to-no memory overhead, but you have to deal with the O(n2) runtime. You can choose which is best for your purposes.

    Jasper Bekkers : However, nickf's solution is the easiest to multi-thread; it doesn't do any writes to shared data and it can scale (near) linearly. It only has problems with duplicate items as mentioned in the comments.
  • Going on deep waters here, but:

    Sorted lists sorting can be O(nlogn) as pointed out. just to clarify, it doesn't matter that there is two lists, because: O(2*nlogn) == O(nlogn), then comparing each elements is another O(n), so sorting both then comparing each element is O(n)+O(nlogn) which is: O(nlogn)

    Hash-tables: Converting the first list to a hash table is O(n) for reading + the cost of storing in the hash table, which i guess can be estimated as O(n), gives O(n). Then you'll have to check the existence of each element in the other list in the produced hash table, which is (at least?) O(n) (assuming that checking existance of an element the hash-table is constant). All-in-all, we end up with O(n) for the check.

    The Java List interface defines equals) as each corresponding element being equal.

    Interestingly, the Java Collection interface definition almost discourages implementing the equals()) function.

    Finally, the Java Set interface per documentation implements) this very behaviour. The implementation is should be very efficient, but the documentation makes no mention of performance. (Couldn't find a link to the source, it's probably to strictly licensed. Download and look at it yourself. It comes with the JDK) Looking at the source, the HashSet (which is a commonly used implementation of Set) delegates the equals() implementation to the AbstractSet, which uses the containsAll() function of AbstractCollection using the contains() function again from hashSet. So HashSet.equals() runs in O(n) as expected. (looping through all elements and looking them up in constant time in the hash-table.)

    Please edit if you know better to spare me the embarrasment.

    Jasper Bekkers : The difference between O(2 * n log n) and O(n log n) do matter in a practical application; it's the hidden constants that largely make up the performance of the algorithm. The theoretical speed is only relevant on a superficial level.
  • Upon collisions a hashmap is O(n) in most cases because it uses a linked list to store the collisions. However, there are better approaches and you should hardly have collisions anyway because if you did the hashmap would be useless. In all regular cases it's simply O(1). Besides that, it's not likely to have more than a small n of collisions in a single hashmap so performance wouldn't suck that bad; you can safely say that it's O(1) or almost O(1) because the n is so small it's can be ignored.

  • This can be done in different ways:

    1 - Brute force: for each element in array1 check that element exists in array2. Note this would require to note the position/index so that duplicates can be handled properly. This requires O(n^2) with much complicated code, don't even think of it at all...

    2 - Sort both lists, then check each element to see if they're identical. O(n log n) for sorting and O(n) to check so basically O(n log n), sort can be done in-place if messing up the arrays is not a problem, if not you need to have 2n size memory to copy the sorted list.

    3 - Add the items and count from one array to a hashtable, then iterate through the other array, checking that each item is in the hashtable and in that case decrement count if it is not zero otherwise remove it from hashtable. O(n) to create a hashtable, and O(n) to check the other array items in the hashtable, so O(n). This introduces a hashtable with memory at most for n elements.

    4 - Best of Best (Among the above): Subtract or take difference of each element in the same index of the two arrays and finally sum up the subtacted values. For eg A1={1,2,3}, A2={3,1,2} the Diff={-2,1,1} now sum-up the Diff = 0 that means they have same set of integers. This approach requires an O(n) with no extra memory. A c# code would look like as follows:

        public static bool ArrayEqual(int[] list1, int[] list2)
        {
            if (list1 == null || list2 == null)
            {
                throw new Exception("Invalid input");
            }
    
            if (list1.Length != list2.Length)
            {
                return false;
            }
    
            int diff = 0;
    
            for (int i = 0; i < list1.Length; i++)
            {
                diff += list1[i] - list2[i];
            }
    
            return (diff == 0);
        }
    

    4 doesn't work at all, it is the worst

    nickf : that fails on the inputs: [2,4,6] and [-2,8,6] --> diff[4,-4,0] = 0.
    The Elite Gentleman : Rather, do this: `diff += Math.abs(list1[i]) - Math.abs(list2[i]);`
  • You can use a signature (a commutative operation over the array members) to further optimize this in the case where the array are usually different, saving the o(n log n) or the memory allocation. A signature can be of the form of a bloom filter(s), or even a simple commutative operation like addition or xor.

    A simple example (assuming a long as the signature side and gethashcode as a good object identifier; if the objects are, say, ints, then their value is a better identifier ; and some signatures will be larger than long) public bool MatchArrays(object[] array1, object[] array2) { if (array1.length != array2.length) return false; long signature1 = 0; long signature2 = 0; for (i=0;i

    if (signature1 != signature2) return false;

    return MatchArraysTheLongWay(array1, array2); }

    where (using an addition operation; use a different commutative operation if desired, e.g. bloom filters)

    public long CommutativeOperation(long oldValue, long newElement) { return oldValue + newElement; }

  • Here is another option, let me know what you guys think.It should be T(n)=2n*log2n ->O(nLogn) in the worst case.

    private boolean compare(List listA, List listB){
        if (listA.size()==0||listA.size()==0) return true;
        List runner = new ArrayList();
        List maxList = listA.size()>listB.size()?listA:listB;
        List minList = listA.size()>listB.size()?listB:listA;
        int macthes = 0;
        List nextList = null;;
        int maxLength = maxList.size();
        for(int i=0;i<maxLength;i++){
            for (int j=0;j<2;j++) {
                nextList = (nextList==null)?maxList:(maxList==nextList)?minList:maList;
                if (i<= nextList.size()) {
                    MatchingItem nextItem =new MatchingItem(nextList.get(i),nextList)
                    int position = runner.indexOf(nextItem);
                    if (position <0){
                        runner.add(nextItem);
                    }else{
                        MatchingItem itemInBag = runner.get(position);
                        if (itemInBag.getList != nextList)   matches++;
                        runner.remove(position);
                    }
                }
            }
        }
        return maxLength==macthes;
    }
    
    public Class MatchingItem{
    private Object item;
    private List itemList;
    public MatchingItem(Object item,List itemList){
        this.item=item
        this.itemList = itemList
    }
    public boolean equals(object other){
        MatchingItem otheritem = (MatchingItem)other;
        return otheritem.item.equals(this.item) and otheritem.itemlist!=this.itemlist
    }
    
    public Object getItem(){ return this.item}
    public Object getList(){ return this.itemList}
    

    }

0 comments:

Post a Comment