Sunday, February 13, 2011

Help write a query: Confusion over order of operations of GROUP BY and ORDER BY

I have a table named Info of this schema:

int objectId;
int time;
int x, y;

There is a lot of redundant data in the system - that is, objectId is not UNIQUE. For each objectId there can be multiple entries of time, x, y.

I want to retrieve a list of the latest position of each object. I started out with this query:

SELECT * FROM Info GROUP BY objectId

That got me just the kind of list I was looking for. However I want also to get just the latest times for each Object, so I tried:

SELECT * FROM Info GROUP BY objectId ORDER BY time DESC

This gave me a time descended list of Infos. However, I don't think it did what I want - that is return me the latest time, x, y for each object.

Can anyone imagine a query to do what I want?

Update I have tried the top three solutions to see how they perform against each other on a dataset of about 50,000 Infos. Here are the results:

-- NO INDEX: forever
-- INDEX: 7.67 s

SELECT a.*
FROM Info AS a
  LEFT OUTER JOIN Info AS b ON (a.objectId = b.objectId AND a.time < b.time)
WHERE b.objectId IS NULL;

-- NO INDEX: 8.05 s
-- INDEX: 0.17 s

select a.objectId, a.time, a.x, a.y
  from Info a,
       (select objectId, max(time) time from Info group by objectId) b
  where a.objectId = b.objectId and a.time = b.time;

-- NO INDEX: 8.30 s
-- INDEX: 0.18 s

SELECT A.time, A.objectId, B.x, B.y
FROM
(
   SELECT max(time) as time, objectId 
   FROM Info
   GROUP by objectId
) as A 
INNER JOIN Info B
   ON A.objectId = b.objectId AND A.time = b.time;

By a margin, it would seem where outperforms inner join.

  • SELECT A.time, A.objectID, B.X, B.Y
    FROM
    (
       SELECT max(time) as time, objectID 
       FROM table
       GROUP by objectID
    ) as A 
    INNER JOIN table B
       ON A.objectID = b.objectID AND A.Time = b.Time
    

    votenaders, solution won't work, if x & y decrement at anypoint in the time line.

  • One way is using a subquery.

    select distinct a.objectID, a.time, a.x, a.y
      from Info a,
           (select objectID, max(time) time from Info group by objectID) b
      where a.objectID = b.objectID and a.time = b.time
    

    EDIT: Added DISTINCT to prevent duplicate rows if one objectId has multiple records with the same time. Depends on your data if this is necessary, the question author mentioned there were many duplicate rows. (added by Tomalak)

    Eoin Campbell : Great minds think alike ;-)
    Frank Krueger : You guys are too good!
    Tomalak : Using DISTINCT clause prevents duplicate rows in the output, in case there are more than one entry with the same time for a singe objectId.
    Torbjørn : I like Glomek's SQL style better than Campbell's, so my vote goes for this one - but they are basically the same. It's just a matter of taste.
    Frank Krueger : Tomalak: Can you expand on that? Can you take Glomek's SQL and add your DISTINCT clause? Or perhaps Glomek wouldn't mind?
    Tomalak : Done. Writing my own answer wouldn't make much difference anymore I guess. :-)
    From Glomek
  • For what it's worth, here's another way of getting the desired result. I got into the habit of doing tricks like this in the MySQL 4.0 days, before subqueries were supported.

    SELECT a.*
    FROM Info AS a
      LEFT OUTER JOIN Info AS b ON (a.objectID = b.objectID AND a.time < b.time)
    WHERE b.objectID IS NULL;
    

    In other words, show me the row where there no other row exists with the same objectID and a greater time. This naturally returns the row with the max time per objectID. No GROUP BY required.

    Frank Krueger : Wow. That's fantastic. Thanks for teaching me something.
    DOK : Don't you need a MAX in there?
    Bill Karwin : No, you don't need a MAX(). When b.objectID IS NULL, that means the join condition was not met, which means the current row 'a' has the max time value of any row with the same objectID.
  • This is a pretty common way of getting at all the information in a row, for a row that is part of a group.

    Select Info.*
    from Info
    inner join
       (select ObjectId, max(time) as Latest
        from Info
        group by ObjectId)  I
    on Info.ObjectId = I.ObjectID and Info.time = I.Latest
    

    The same question has been asked in different forms a couple of times in the last couple of weeks. I forget how the questions were worded.

0 comments:

Post a Comment