2008-05-29

Missing SQL Feature: Select Top By Group

There's a common problem in SQL, where you want to take records in one table, and join each of them to a matching record in some other table that has the maximum value in some column, but you need data from several columns in that second table.

Let's see if I can come up with an example. Say you have a table of Sales Reps, and you have a table of the Sales they've closed, including the dates and amounts of sales. You want to write a query that shows the Reps' names, and the amount of their largest sale. Easy:

Select r.Name, Max(s.Amount)
From Rep r Inner Join Sale s On r.Id = s.RepId
Group By r.Name

But what if you want to include the date of the sale? Well, now it gets much more complicated. One common solution is to use a subquery that just gets two columns:

Select RepId, Max(Amount) MaxAmount
From Sale
Group By RepId

...and then join against that, and join the Sales table to that on the Amount. Where it falls down, though, is when the maximum is duplicated. What happens, for instance, if a certain rep happened to close two million-dollar deals as his top sellers?

You have to create an additional subquery that finds the maximum Id that matches each RepId and Amount, join on that, and then join on the Sales table to get the rest of the record (the Date field, in this example). The mess looks like this:

Select r.Name, s.Date, s.Amount
From Rep r
Inner Join (
   Select RepId, Max(Amount) MaxAmount
   From Sale
   Group By RepId
) MaxSales On r.Id = MaxSales.RepId
Inner Join (
   Select Max(Id) MaxId, RepId, Amount
   From Sale
   Group By RepId, Amount 
) MaxIds On MaxSales.RepId = MaxIds.RepId And MaxSales.MaxAmount = MaxIds.Amount
Inner Join Sale s On MaxIds.MaxId = s.Id

Wouldn't it be far more intuitive and easier if you could do something like this:

Select r.Name, s.Date, s.Amount
From Rep r
Inner Join Top 1 Sale s Order By s.Amount On r.Id = s.RepId

I found this suggestion on a blog by Raymond Lewallen here. It does leave a bit of assumptions on the SQL compiler, like having to do an order by subgroup on the items in the join clause; perhaps those fields would need to be specified (kind of like how you have to specify everything in "GROUP BY" whenever you use an aggregate, instead of it assuming that it should just by nature of being in the SELECT clause).

No comments: