Sunday, February 19, 2012

Help replacing where not exists in a query

I am trying to clean up an ugly query that's based on trying to find items that exist in one table but not the other.

My tables are like this:

ITEMS
itemID,
itemName,
itemDescription
etc...

ORDERITEMS
OrderLineID,
OrderID,
itemID

ORDERS
OrderID,
OrderCompleted

Currently my query looks something like this:

Select Items.* from ITEMS where not exists (select 1 from ORDERITEMS inner join ORDERS on ORDERITEMS.OrderID = ORDERS.OrderID where ORDERITEMS.itemID=ITEMS.itemID and (ORDERS.OrderCompleted=1))

So this query is looking for ITEMS what don't have a corresponding entry in the ORDERITEMS table. As I understand it this is pretty inefficient as it is going to be executing the sub query in the Not Exists statement for each entry in the ITEMS table. Is the preferred method to do something along the lines of somehow making the sub query into a derived table and doing a left or right join?

Thanks for reading!

Ryan

Does this help?

Find Orphans Example:

An orphan record exists when there is a Contact2 record but NOT a Contact1 record.

SELECT * FROM Contact2 WHERE AccountNoNOT IN (SELECT AccountNo from Contact1)

|||

Would the NOT IN select clause be executed on every loop though? Which is less efficient *as I understand it*, than a left join?

|||

Using NOT EXISTS is the way I've always done it and the way I've always seen it done. So long as your indexes are built right this will perform well -- doing it once per ITEM is no problem as long as there's an index ORDERITEMS.itemId. And, btw, you can simply put a Select * in the subquery, you don't need to say Select top 1.

Re Steve's suggestion, I don't know for sure, but I'm not convinced it will work well, I'd like to see a showplan with your data. Translating his suggestion into your data, I think it turns into:

select *
from ITEMS
where itemId not in (select itemId -- select distinct itemId??
from ORDERITEMS inner join ORDERS on ORDERITEMS.OrderID = ORDERS.OrderID
where ORDERS.OrderCompleted=1)

The subselect will be executed only once, not once for each row of ITEMS.

However, I am not so sure this will perform well -- and it may be dependant on sql server 2000 vs 20005 whether it works well. Among other things, is there a limit to the number of items in an IN list? Like I say, I'd like to see the query plans

|||

If the subquery referenced by the NOT IN clause doesn't reference any fields of the outer query, then it'll be executed only once. In fact, it should have the same query plan as a left join.

select *
from ITEMS
where itemId not in (select itemId -- select distinct itemId??
from ORDERITEMS inner join ORDERS on ORDERITEMS.OrderID = ORDERS.OrderID
where ORDERS.OrderCompleted=1)

should produce a similiar query plan to:

SELECT *

FROM items i

LEFT JOIN (SELECT itemId FROM orderitems oi JOIN orders o ON oi.OrderID=o.OrderID WHERE o.OrderCompleted=1) t1 ON i.ItemID=t1.ItemID

WHERE t1.ItemID IS NULL

|||

FYI, if OrderItems.ItemId is nullable there will be a potential for getting different results in each method, seehttp://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx

Motley:

In fact, it should have the same query plan as a left join.

Not trying to argue, but do you know for a fact that SQL Server resolves a nested NOT IN list subquery exactly thes same way as a NOT EXISTS? Do you have a citation on this. I really don't know myself -- for a fact -- one way or the other.

|||

I didn't say that NOT IN and NOT EXISTS produce the same query plans. As you pointed out, NOT EXISTS clauses almost always uses a WHERE/JOIN clause that references a field in the outer query, which would radically change the query plan because the results for the subquery would vary in each iteration of the outer query. In any case, I was referring to NOT IN and LEFT JOIN.

With that said, you are correct. If one or more of the fields are nullable (even if no records actually contain a null value), then it can change the query plan, sometimes significantly. As I wrote that, you'll notice I even backed off from saying that it will produce the exact same query plan (Just similiar) as I wrote the two query examples. Three-valued-logic discussions are often beyond the scope of the simple questions that people usually ask here on these forums and I prefer to keep my answers simple when possible.

The article you linked brings up some very good points regarding the use of IN, EXISTS, and JOIN when dealing with one or more columns that can possibly contain NULL values. I'd have to argue a few points the author makes regarding the "expected" results (Specificially, I suspect that most beginner SQL writers that don't understand three-valued logic would indeed expect the results that the IN gave for the last results -- No records since both the parent and child had records with an ID of NULL. However, there are other test cases that would fail, the author just happened to have test data that gave that result). Still, it's a good article that explores the differences when dealing with nullable fields.

As for a citation -- (overlooking that isn't what was meant to be conveyed) I don't trust what is said, even by experts. Only the truth is important, and even "experts" are too often wrong in what they say. It is however, an excellent starting point until I learn better.

--

After all that, the worst performing queries will be ones that use subqueries that depend on value from each row in an outer query. It is very difficult for the query planner to come up with an efficient query plan to such a query. Optimizing such a query into one that has no such dependancy whether using IN, EXISTS, or JOIN will usually produce much better results, especially as the size of the data grows.

|||

Motley:

I didn't say that NOT IN and NOT EXISTS produce the same query plans

You're right, I misquoted you. My apologies.

Motley:

As for a citation -- (overlooking that isn't what was meant to be conveyed) I don't trust what is said, even by experts. Only the truth is important, and even "experts" are too often wrong in what they say. It is however, an excellent starting point until I learn better

Huh? I'd accept a citation from eg MSDN citation or SQL Server books online, wouldn't you? Could be wrong, but they did design the product.. Simply looking at query plans from individual queries is not enough support a blanket statement. I would rephrase my question to ask whether (how) you know that this NOT IN and LEFT JOIN resolve to the same thing. I don't keep up with every advance of the optimizer, maybe you do, but you made a pretty strong claim and I'm asking on what that is based ("In fact, it should have the same query plan as a left join").

Interestingly, I just did find a 2005 refrerence in which MS says that EXISTS and IN subqueries, as we have been talking about them here, are "semantically equivilent"http://msdn2.microsoft.com/en-us/library/ms188336.aspx

I also just found this 2005 reference,http://msdn2.microsoft.com/en-us/library/ms172984.aspx, for 2005 Compact Edition (didn't know there was one) which says, "The SQL Server Compact Edition query processor always rewrites the IN subquery to use JOIN"

Motley:

After all that, the worst performing queries will be ones that use subqueries that depend on value from each row in an outer query. It is very difficult for the query planner to come up with an efficient query plan to such a query. Optimizing such a query into one that has no such dependancy whether using IN, EXISTS, or JOIN will usually produce much better results, especially as the size of the data grows.

To accept that I would definately want to see a MS reference.

No comments:

Post a Comment