Improve LINQ Query Performance

I was writing a small utility for Outlook 2007 and was using LINQ to query Outlook Tasks.  This query was nested within another query.  While debugging, I realized that looping though the query results in my For Each loop was taking too long.

If you have read anything about LINQ, I am sure you already know about LINQ’s deferred execution.  Deferred execution basically means that your query will not execute until you actually try to use it.  By "use it", I mean anything from looping through it, accessing the "Count" property, and so on. 

The idea behind deferred execution makes sense for LINQ to SQL, since the idea is that your SQL query will only be run when it is absolutely needed and can be optimized at the time it is needed – specially if you try to nest it within another query.  This could indoubtedly produce optimized SQL queries.  Like I said, this makes more sense for LINQ to SQL but doesn’t make too much sense for LINQ to objects.

Maybe an example would make more sense.  The LINQ query below is trying to find all the Outlook tasks that are not in my list of tasks.

Dim results = _
   From myTask In myTasks _
   Where Not (From outlookTask In outlookTasks _
             Select outlookTask.Subject).Contains(myTask) _
   Select myTask 

.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }

To clarify, if this was a SQL query it will be something like

select * from myTasks where name not in 
            (select subject from outlooktasks)

I think what happened was that the nested query selecting the outlook tasks was executed on every iteration in my loop which means that the Contains method ran on the entire outlook tasks list in every iteration.  I was kind of disappointed that the nested query didn’t get cached somehow or that the LINQ engine was smart enough to run it once.

Luckily I found a really neat trick to speed this up by orders of magnitudes.  I am talking minutes to seconds here.  And it was very simple.  By calling the ToList method on the LINQ query, I cause instant execution of the LINQ query and if I store the results in a variable then the query is ran only once.  So I got rid of deferred execution and I got rid of multiple executions.

My final LINQ query looked like this:

Dim outlookTasksName = (From outlookTask _
                        In outlookTasks _
                        Select outlookTask.Subject).ToList
Dim results = _
    (From myTask In myTasks _
    Where Not (outlookTasksName).Contains(myTask) _
    Select myTask).ToList

What do you think of this solution?  Do you see any drawbacks or issues?  Do you have other LINQ optimization tricks  or way to improve LINQ query performance?  Do  you think deferred execution is necessary for LINQ to objects?

NOTE: As I was writing this, I just realized that one drawback for large collections would be memory utilization.  But I think the gains in performance outweigh the cost in memory.

Advertisements

0 thoughts on “Improve LINQ Query Performance

  1. Essentially what you were looking for here was the difference of two collections (myTasks and outlookTasks). Enumerable has a built in method (Except()) to handle these when the collections are of the same type, and LINQ adds the ability to pass in a query to the “Except” method so that you can transform one collection before it is used.Your first approach was O(N^2), iterating over one collection then iterating over the second collection once for each iteration of the first collection.Your second approach is better: O(N) + O(N), which will iterate through collection one, then collection 2. However as the collections get bigger this will begin to take longer, which an optimized intersect algorithm can help you to avoid!if you instead used:Dim results = _ myTasks.Except(From outlookTask In outlookTasks _ Select outlookTask.Subject)which I believe is implemented as O(N log N) which should nicely plateau and provide reliable performance even as the collections get larger, not to mention fewer lines of code!Hope this helped!

    Like

  2. Essentially what you were looking for here was the difference of two collections (myTasks and outlookTasks). Enumerable has a built in method (Except()) to handle these when the collections are of the same type, and LINQ adds the ability to pass in a query to the “Except” method so that you can transform one collection before it is used.Your first approach was O(N^2), iterating over one collection then iterating over the second collection once for each iteration of the first collection.Your second approach is better: O(N) + O(N), which will iterate through collection one, then collection 2. However as the collections get bigger this will begin to take longer, which an optimized intersect algorithm can help you to avoid!if you instead used:Dim results = _ myTasks.Except(From outlookTask In outlookTasks _ Select outlookTask.Subject)which I believe is implemented as O(N log N) which should nicely plateau and provide reliable performance even as the collections get larger, not to mention fewer lines of code!Hope this helped!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s