TLDR: There’s no “silver bullet” to solve this issue as each situation is very different from another. For me two options have worked. Either: Split the subqueries into table writes. This adds IO/CPU but also a more reliable exec time. Or try to add OPTION (USE HINT ( ‘FORCE_LEGACY_CARDINALITY_ESTIMATION’ )) [read: https://docs.microsoft.com/en-us/sql/relational-databases/performance/cardinality-estimation-sql-server?view=sql-server-ver15 ]
Longer version: If you came here googling “CXCONSUMER” or “CXPACKET” then you probably don’t need much of an introduction to the issue, but in case not, here’s the underlying story:
I have an otherwise not particularly large table (about 500 MB, a few million rows, sitting on a PCIE3 M2 SSD, which can do around 2 gigs/sec read or something similar.) — there’s also an identical copy of the data on my laptop, which is a bit slower (PC: R7 1800X, 32G, M2 SSD, laptop: I7-3520M, 16G, SATA SSD)
It contains stock pricing info and looks something like this:

There’s a CI on the date and a NCI on the relevant column(s). The code parses this table to find if/what each individual stock/share price correlates with when the price changes. So when it runs for AMD, it will look at the daily data of around 8000ish other shares, stocks, ETFs and whatnots. There are further parts of the query that look at outliers and a few others things that aren’t relevant for the discussion here.
Here’s the problem. There’s a loop [@n = … @n =@n + 1, WHILE BEGIN/END] (not a cursor) that goes through each SEDOL (of which there are 8000ish) and does what’s been explained above. One would think that 8000*~240 working days = ~1,920,000 rows per loop and that’s then dependent on the computer’s configuration but one can sit back and relax and wait for magic to happen. Right? Wrong. Sooo wrong. The problem is that not all combinations are valid and so actual valid row counts range from a few thousand to around 1.8-1.9m
As per request the actual query of the main part: — this then gets processed further but this is the bulk of the time-consuming element.
SELECT
ScoringTable.DateHuman
, ScoringTable.SEDOL_COMP
, ScoringTable.Score
INTO tmpFactTable
FROM
(SELECT DISTINCT
TFP.DateHuman AS DateHuman
, TF_COMP.SEDOL AS SEDOL_COMP
, ISNULL(TF_BASE.PriceShift_Yesterday / NULLIF(TF_COMP.PriceShift_Days_Shifted, 0), 0) AS Score
FROM [tmpfact_Price] TFP
INNER JOIN
(SELECT
DateHuman
, PriceShift_Yesterday
FROM [tmpfact_Price] --WITH (INDEX(IX_tmpfact_Price_Sedol_CL))
WHERE SEDOL = (SELECT TOP 1 SEDOL FROM SEDOLList WHERE RUNSTATUS = 1)
)
TF_BASE
ON 1=1
AND TFP.DateHuman = TF_BASE.DateHuman
INNER JOIN
(SELECT
DateHuman
, SEDOL
, PriceShift_Days_Shifted
FROM [tmpfact_Price] --WITH (INDEX(IX_tmpfact_Price_Sedol_CL))
WHERE SEDOL IN (SELECT SEDOL FROM SEDOLList WHERE RUNSTATUS = 0)
)
TF_COMP
ON 1=1
AND TFP.DateHuman = TF_COMP.DateHuman
WHERE 1=1
AND NOT (TF_BASE.PriceShift_Yesterday = 0 AND TF_COMP.PriceShift_Days_Shifted = 0)
) ScoringTable
On the PC (R7 1800X) originally each loop took 45 seconds. Compare that to my flatmate’s C# code that did the same task and did so in 3 secs per loop. This was clearly unacceptable and eventually boiled down to me replacing the part of the script that found and dispensed with the outliers, which initially used PERCENTILE_RANK (PARTITION BY SEDOL) to a simpler AVG (partition same) as well as STDDEVP (partition same) and then a bit of calculating and a join/delete. Another part that used a WHERE something = @variable was replaced with another solution because SQL has a tendency to ignore indexes when @variable style code is used… Script was running at 6 sec/loop. Better but still rubbish. After I poked the indexes and separated out a few things I got the script to run at 1 sec per loop. Woohoo, right? Not, because around 1300 into the 8000 iterations the whole thing came to a grinding halt and took up to an hour (!!) per iteration to progress. Parameter sniffing comes into play.
One option that I found to work was to actually dump the sub-suqeries into temp tables, index them and do the rest of the code. The obvious downside of this is that it’s a lot of IOs but seems to solve the problem of hangs, so how useful it is depends on your config. Admittedly for me this was the solution that ended up being the “least unacceptable” so I stuck with it but here’s the rundown:
On the note of hangs: I have no wish to reinvent the wheel so for everyone that’s not familiar with how to figure out what’s going on please read the following articles then continue this one.
http://joshthecoder.com/2019/05/06/cxconsumer-slow-join.html (particularly useful)
https://www.brentozar.com/archive/2018/07/cxconsumer-is-harmless-not-so-fast-tiger/ (useful)
https://www.sqlskills.com/blogs/paul/cxconsumer-wait-type-history-and-what-you-need-to-know/ (so-so)
There are also some scripts/tools scattered around in those articles that will help you find if your errors are the same as this.
I don’t really want to bore everyone to death with the details but the Execution Plan didn’t help much, apart from the fact that there was an epic long “something happening” with one of the indexes being scanned and allegedly returning over 2,386,841,210 (!!) rows after an hour. Clearly something got lost in the woods there.


This was my logic: SQL Query Optimizer makes a number of assumptions but the data isn’t based on them all the time. See above for the expected 7k rows, which I have no idea where it came from. Nonetheless the problem is that the combination of valid rows for each loop can vary widely and so if you’re debugging it’s useful to get your values onto the screen one way or another to see what’s going on. Observe below with the time stamps as well:

The subquery that does the counts (“rows TTP” above) is a simplified version of the main query that runs comparatively quickly and from here on you can code extra stuff to say if the difference between the previous and this loop is more than “x” then RECOMPILE, otherwise ignore. Right? Wrong again. One thing is that actually that comparatively quickly isn’t that quick at all 0.3s – 0.5s per loop, now on top of the original. The idea was to use the counts to say if the count difference between ‘this loop’ and the ‘previous one’ is over 150k then recompile but for some reason SQL still decided to hang randomly. The coding itself is easy to do w/ dynamic SQL.

Didn’t help. Next useless step was to explicitly pre-count those (same) values, shove them into a table and have the loops run from largest to smallest with periodic recompile.

This quickly became un-feasible because I think I was now well exceeding the IO counts of the original “shove subqueries into tables” logic and if that wasn’t enough, SQL just decided to be a donkey anyway. The actual counting was done at around 100/min, so 8k would have been 80 mins, already “a bit long”. (remember the C# code doing the whole each loop in 2-3 secs.)
Code below for the theory of handling this. The dynamic SQL would be ordered by the ‘RC’ column descending and then occasionally recompile (the @cnt +1 = 1 is totally pointless, I know, it was just a failed experiment.)

Dummy run w/ 30 random loops:

At this stage I’ve tried, without trying to remember the full exhaustive list:
- MAXDOP =1 (or 2)
- QUERYTRACE flags (just about any relevant ones) — technically these result in OPTIMIZE FOR UNKNOWN, which is counterproductive
- Splitting the code into a separate SP
- Turning on Querystore
- Fiddling with the logging level (remember this is my home database so I can do as I please)
- RECOMPILE always (that works, but it’s silly long and costly)
- Updating/reorganising/rebuilding relevant stats/refreshing indexes on each loop (less costly than one would think because there’s not much to actually refresh but they still don’t make much difference)
- OPTION (USE HINT ( ‘FORCE_LEGACY_CARDINALITY_ESTIMATION’ )) [read: https://docs.microsoft.com/en-us/sql/relational-databases/performance/cardinality-estimation-sql-server?view=sql-server-ver15 ]
None of these helped. The legacy cardinality almost worked but then it got into the silly-slow after around 95%, so I’d probably flag it as ‘feasible perhaps’. The task is not only to stop the hangs but also to keep the speed acceptable. MAXDOP stops the hangs but slows the loops to the initial 6ish seconds so not preferred.
Ultimately I settled on the split-code-into-tables-and-index-them solution. This worked as following:

Hope this was useful for some of you. Comments welcome/encouraged.

So you’re basically correlating every security’s timeseries against every other security’s timeseries? I think you can both do this more efficiently and better utilize your CPUs by flipping the query a bit and exploiting a parallel nested loop.
What I would do is cross join on the outer side of the loop to produce a set of 8000*8000 security combinations that you’re interested in. Then on the inner side, seek for each security in the pair and evaluate the correlation. Make sure you have a one-to-many merge join on the inner side of the loop. This should give you the most efficiency on a per-thread basis, will drop a lot of the weirdness from your currently over-parallelized plan, will avoid your current one query per security setup, and it will allow you to scale across *all* of your CPUs and drive things to 100%.
Google “parallel apply pattern” for more info on how to set this up.
Hi Adam, thanks for that, I’ll have a look and try!
I have applying this technique for years without even knowing it was an actual pattern 🙂
parallel loop joins are very problematic, SQL Server is not good at choosing a strategy to achieve even distribution of work between threads, and sufficient work to make it worth while. Is there some reason you cannot show the query – hide the true column name if necessary
re: actual query, updated the post to have the part in question. thanks 🙂
Hi Viktor,
Thanks for the detailed article. Is this issue random and not consistently reproducible? We have a similar issue with SQL queries stuck on CXCONSUMER for long time but when session is terminated and same process is re-run, it completes much faster. Apparently, query plan looks exactly same in both cases and issue is not manually reproducible.
Thanks.
Hi Praveen,
Not sure if it’s consistently reproducible, however what I found (and what I’ve read on other sites) is that it tends to come up when you either loop or have generally varying data sizes — so for example in the case described above in the article, I had ~8k loops and therefore the ‘planned’ QP would only be based on the first few, which might or might not change. It’s also reasonably unlikely that even a post-exec QP would show all the different thousands of executed loops in a human-digestible way.
Your issue seems somewhat similar to mine in that in some cases I got to say 1000 iterations before it died and other cases 5000 iterations, so it’s not reproducible per se, annoyingly enough.
Appreciating this might not be much help but what I’m saying/asking is that your QP might not seem to change because it doesn’t cover the whole of your data/script execution, hence the issue in the first place, because SQL Server can’t come up with a plan that’s good for all your iterations.
Have you tried the OPTION (USE HINT ( ‘FORCE_LEGACY_CARDINALITY_ESTIMATION’ )) ?
Thanks for the reply. I will check if FORCE_LEGACY_CARDINALITY_ESTIMATION helps in consistent performance. Specifically, in our case, we are not using any variables/parameters. Queries are mostly fixed/static.