Monday, March 7, 2011

grouping records in one temp table

I have a table where one column has duplicate records but other columns are distinct. so something like this

Code SubCode version status

1234 D1 1 A

1234 D1 0 P

1234 DA 1 A

1234 DB 1 P

5678 BB 1 A

5678 BB 0 P

5678 BP 1 A

5678 BJ 1 A

0987 HH 1 A

So in the above table. subcode and Version are unique values whereas Code is repeated. I want to transfer records from the above table into a temporary table. Only records I would like to transfer are where ALL the subcodes for a code have status of 'A' and I want them in the temp table only once.

So from example above. the temporary table should only have 5678 and 0987 since all the subcodes relative to 5678 have status of 'A' and all subcodes for 0987 (it only has one) have status of A. 1234 is ommited because its subcode 'DB' has status of 'P'

I'd appreciate any help!

From stackoverflow
  • In your select, add a where clause that reads:

    Select [stuff]
    From Table T
    Where Exists
        (Select * From Table 
         Where Code = T.Code
            And Status = 'A')
      And Not Exists
        (Select * From Table I
         Where Code = T.Code 
           And Not Exists
              (Select * From Table
               Where Code = I.Code
                   And SubCode = I.SubCode
                   And Status = 'A'))
    

    In English, Show me the rows, where there is at least one row with status 'A', and there are NO rows with any specific subcode, that do not have at least one row with that code/subcode, with status 'A'

    Tom H. : This doesn't return the correct results. 5678 will be excluded because at least one row exists with a non-"A" status. (also, your query has SubCode where I think you meant status)
    Charles Bretana : Yes thx, corrected that
    Tom H. : This still doesn't work.
    Charles Bretana : Now it works... , I had a "<>" where it shouild have been an "="
  • INSERT theTempTable (Code)
    SELECT t.Code
    FROM   theTable t
           LEFT OUTER JOIN theTable subT ON (t.Code = subT.Code AND subT.status <> 'A')
    WHERE  subT.Code IS NULL
    GROUP BY t.Code
    

    This should do the trick. The logic is a little tricky, but I'll do my best to explain how it is derived.

    The outer join combined with the IS NULL check allows you to search for the absence of a criteria. Combine that with the inverse of what you're normally looking for (in this case status = 'A') and the query succeeds when there are no rows that do not match. This is the same as ((there are no rows) OR (all rows match)). Since we know that there are rows due to the other query on the table, all rows must match.

    Tom H. : This doesn't return the correct results. 5678 will be excluded because at least one row exists with a non-"A" status.
  • It's a little unclear as to whether or not the version column comes into play. For example, do you only want to consider rows with the largest version or if ANY subcde has an "A" should it count. Take 5678, BB for example, where version 1 has an "A" and version 0 has a "B". Is 5678 included because at least one of subcode BB has an "A" or is it because version 1 has an "A".

    The following code assumes that you want all codes where every subcode has at least one "A" regardless of the version.

    SELECT
        T1.code,
        T1.subcode,
        T1.version,
        T1.status
    FROM
        MyTable T1
    WHERE
        (
          SELECT COUNT(DISTINCT subcode)
          FROM MyTable T2
          WHERE T2.code = T1.code
        ) =
        (
          SELECT COUNT(DISTINCT subcode)
          FROM MyTable T3
          WHERE T3.code = T1.code AND T3.status = 'A'
        )
    

    Performance may be abysmal if your table is large. I'll try to come up with a query that is likely to have better performance since this was off the top of my head.

    Also, if you explain the full extent of your problem maybe we can find a way to get rid of that temp table... ;)

    Here are two more possible methods. Still a lot of subqueries, but they look like they will perform better than the method above. They are both very similar, although the second one here had a better query plan in my DB. Of course, with limited data and no indexing that's not a great test. You should try all of the methods out and see which is best for your database.

    SELECT
        T1.code,
        T1.subcode,
        T1.version,
        T1.status
    FROM
        MyTable T1
    WHERE
        EXISTS
        (
         SELECT *
         FROM MyTable T2
         WHERE T2.code = T1.code
           AND T2.status = 'A'
        ) AND
        NOT EXISTS
        (
         SELECT *
         FROM MyTable T3
         LEFT OUTER JOIN MyTable T4 ON
          T4.code = T3.code AND
          T4.subcode = T3.subcode AND
          T4.status = 'A'
         WHERE T3.code = T1.code
           AND T3.status <> 'A'
           AND T4.code IS NULL
        )
    
    SELECT
        T1.code,
        T1.subcode,
        T1.version,
        T1.status
    FROM
        MyTable T1
    WHERE
        EXISTS
        (
         SELECT *
         FROM MyTable T2
         WHERE T2.code = T1.code
           AND T2.status = 'A'
        ) AND
        NOT EXISTS
        (
         SELECT *
         FROM MyTable T3
         WHERE T3.code = T1.code
           AND T3.status <> 'A'
           AND NOT EXISTS
          (
           SELECT *
           FROM MyTable T4
           WHERE T4.code = T3.code
             AND T4.subcode = T3.subcode
             AND T4.status = 'A'
          )
        )
    
    David B : Good solution. FROM is the enemy of performance, but you know this.
  • Here's my solution

    SELECT Code
    FROM
    (
      SELECT
        Code,
        COUNT(SubCode) as SubCodeCount
        SUM(CASE WHEN ACount > 0 THEN 1 ELSE 0 END)
          as SubCodeCountWithA
      FROM
      (
        SELECT
          Code,
          SubCode,
          SUM(CASE WHEN Status = 'A' THEN 1 ELSE 0 END)
            as ACount
        FROM CodeTable
        GROUP BY Code, SubCode
      ) sub
      GROUP BY Code
    ) sub2
    WHERE SubCodeCountWithA = SubCodeCount
    

    Let's break it down from the inside out.

        SELECT
          Code,
          SubCode,
          SUM(CASE WHEN Status = 'A' THEN 1 ELSE 0 END)
            as ACount
        FROM CodeTable
        GROUP BY Code, SubCode
    

    Group up the codes and subcodes (Each row is a distinct pairing of Code and Subcode). See how many A's occured in each pairing.

      SELECT
        Code,
        COUNT(SubCode) as SubCodeCount
        SUM(CASE WHEN ACount > 0 THEN 1 ELSE 0 END)
          as SubCodeCountWithA
      FROM
        --previous
      GROUP BY Code
    

    Regroup those pairings by Code (now each row is a Code) and count how many subcodes there are, and how many subcodes had an A.

    SELECT Code
    FROM
      --previous
    WHERE SubCodeCountWithA = SubCodeCount
    

    Emit those codes with have the same number of subcodes as subcodes with A's.

    Tom H. : Also an interesting solution. On the limited dataset above on a table with no indexes I'm seeing a high cost for sorting (relatively), but with an actual table with data this may be better than the table scans from my method.

0 comments:

Post a Comment