SQL Server — Widely Random Query Exec Times (CXCONSUMER WAIT/CXPACKET) & Possible Remediation

Standard

TLDR: There’s no “silver bullet” to solve this issue as each situation is very different from another. For me two options have worked. Either: Split the subqueries into table writes. This adds IO/CPU but also a more reliable exec time. Or try to add OPTION (USE HINT ( ‘FORCE_LEGACY_CARDINALITY_ESTIMATION’ )) [read: https://docs.microsoft.com/en-us/sql/relational-databases/performance/cardinality-estimation-sql-server?view=sql-server-ver15 ]

Longer version: If you came here googling “CXCONSUMER” or “CXPACKET” then you probably don’t need much of an introduction to the issue, but in case not, here’s the underlying story:

I have an otherwise not particularly large table (about 500 MB, a few million rows, sitting on a PCIE3 M2 SSD, which can do around 2 gigs/sec read or something similar.) — there’s also an identical copy of the data on my laptop, which is a bit slower (PC: R7 1800X, 32G, M2 SSD, laptop: I7-3520M, 16G, SATA SSD)
It contains stock pricing info and looks something like this:

AMD’s share price

There’s a CI on the date and a NCI on the relevant column(s). The code parses this table to find if/what each individual stock/share price correlates with when the price changes. So when it runs for AMD, it will look at the daily data of around 8000ish other shares, stocks, ETFs and whatnots. There are further parts of the query that look at outliers and a few others things that aren’t relevant for the discussion here.

Here’s the problem. There’s a loop [@n = … @n =@n + 1, WHILE BEGIN/END] (not a cursor) that goes through each SEDOL (of which there are 8000ish) and does what’s been explained above. One would think that 8000*~240 working days = ~1,920,000 rows per loop and that’s then dependent on the computer’s configuration but one can sit back and relax and wait for magic to happen. Right? Wrong. Sooo wrong. The problem is that not all combinations are valid and so actual valid row counts range from a few thousand to around 1.8-1.9m

As per request the actual query of the main part: — this then gets processed further but this is the bulk of the time-consuming element.

    SELECT
          ScoringTable.DateHuman
        , ScoringTable.SEDOL_COMP
        , ScoringTable.Score
    INTO tmpFactTable
    FROM 
        (SELECT DISTINCT
              TFP.DateHuman                                                                         AS DateHuman
            , TF_COMP.SEDOL                                                                         AS SEDOL_COMP
            , ISNULL(TF_BASE.PriceShift_Yesterday / NULLIF(TF_COMP.PriceShift_Days_Shifted, 0), 0)  AS Score
        FROM [tmpfact_Price] TFP
        INNER JOIN 
            (SELECT
                  DateHuman
                , PriceShift_Yesterday
              FROM [tmpfact_Price] --WITH (INDEX(IX_tmpfact_Price_Sedol_CL))
              WHERE SEDOL = (SELECT TOP 1 SEDOL FROM SEDOLList WHERE RUNSTATUS = 1)
          ) 
                              TF_BASE
        ON 1=1
            AND TFP.DateHuman = TF_BASE.DateHuman
        INNER JOIN 
            (SELECT
                  DateHuman
                , SEDOL
                , PriceShift_Days_Shifted
              FROM [tmpfact_Price] --WITH (INDEX(IX_tmpfact_Price_Sedol_CL))
              WHERE SEDOL IN (SELECT SEDOL FROM SEDOLList WHERE RUNSTATUS = 0)
            ) 
                              TF_COMP
        ON 1=1
            AND TFP.DateHuman = TF_COMP.DateHuman
        WHERE 1=1
            AND NOT (TF_BASE.PriceShift_Yesterday = 0 AND TF_COMP.PriceShift_Days_Shifted = 0)
        ) ScoringTable

On the PC (R7 1800X) originally each loop took 45 seconds. Compare that to my flatmate’s C# code that did the same task and did so in 3 secs per loop. This was clearly unacceptable and eventually boiled down to me replacing the part of the script that found and dispensed with the outliers, which initially used PERCENTILE_RANK (PARTITION BY SEDOL) to a simpler AVG (partition same) as well as STDDEVP (partition same) and then a bit of calculating and a join/delete. Another part that used a WHERE something = @variable was replaced with another solution because SQL has a tendency to ignore indexes when @variable style code is used… Script was running at 6 sec/loop. Better but still rubbish. After I poked the indexes and separated out a few things I got the script to run at 1 sec per loop. Woohoo, right? Not, because around 1300 into the 8000 iterations the whole thing came to a grinding halt and took up to an hour (!!) per iteration to progress. Parameter sniffing comes into play.

One option that I found to work was to actually dump the sub-suqeries into temp tables, index them and do the rest of the code. The obvious downside of this is that it’s a lot of IOs but seems to solve the problem of hangs, so how useful it is depends on your config. Admittedly for me this was the solution that ended up being the “least unacceptable” so I stuck with it but here’s the rundown:

On the note of hangs: I have no wish to reinvent the wheel so for everyone that’s not familiar with how to figure out what’s going on please read the following articles then continue this one.

http://joshthecoder.com/2019/05/06/cxconsumer-slow-join.html (particularly useful)

https://www.brentozar.com/archive/2018/07/cxconsumer-is-harmless-not-so-fast-tiger/ (useful)

https://www.sqlskills.com/blogs/paul/cxconsumer-wait-type-history-and-what-you-need-to-know/ (so-so)

There are also some scripts/tools scattered around in those articles that will help you find if your errors are the same as this.

I don’t really want to bore everyone to death with the details but the Execution Plan didn’t help much, apart from the fact that there was an epic long “something happening” with one of the indexes being scanned and allegedly returning over 2,386,841,210 (!!) rows after an hour. Clearly something got lost in the woods there.

Compare the Estimated and Actual Rows. Estd 7k rows is plain stupid. (Then again there aren’t that many rows in the table/index as in Actual either.)
Scipts in the articles above will help you find things like Wait Info

This was my logic: SQL Query Optimizer makes a number of assumptions but the data isn’t based on them all the time. See above for the expected 7k rows, which I have no idea where it came from. Nonetheless the problem is that the combination of valid rows for each loop can vary widely and so if you’re debugging it’s useful to get your values onto the screen one way or another to see what’s going on. Observe below with the time stamps as well:

As row counts start to vary the the exec times get oddly stretched out. [this one from the laptop, which is slower anway]

The subquery that does the counts (“rows TTP” above) is a simplified version of the main query that runs comparatively quickly and from here on you can code extra stuff to say if the difference between the previous and this loop is more than “x” then RECOMPILE, otherwise ignore. Right? Wrong again. One thing is that actually that comparatively quickly isn’t that quick at all 0.3s – 0.5s per loop, now on top of the original. The idea was to use the counts to say if the count difference between ‘this loop’ and the ‘previous one’ is over 150k then recompile but for some reason SQL still decided to hang randomly. The coding itself is easy to do w/ dynamic SQL.

That’s certainly an attempt worth trying but it didn’t help me.

Didn’t help. Next useless step was to explicitly pre-count those (same) values, shove them into a table and have the loops run from largest to smallest with periodic recompile.

This quickly became un-feasible because I think I was now well exceeding the IO counts of the original “shove subqueries into tables” logic and if that wasn’t enough, SQL just decided to be a donkey anyway. The actual counting was done at around 100/min, so 8k would have been 80 mins, already “a bit long”. (remember the C# code doing the whole each loop in 2-3 secs.)

Code below for the theory of handling this. The dynamic SQL would be ordered by the ‘RC’ column descending and then occasionally recompile (the @cnt +1 = 1 is totally pointless, I know, it was just a failed experiment.)

Attempt n+2

Dummy run w/ 30 random loops:

Two f…ing minutes.

At this stage I’ve tried, without trying to remember the full exhaustive list:

  • MAXDOP =1 (or 2)
  • QUERYTRACE flags (just about any relevant ones) — technically these result in OPTIMIZE FOR UNKNOWN, which is counterproductive
  • Splitting the code into a separate SP
  • Turning on Querystore
  • Fiddling with the logging level (remember this is my home database so I can do as I please)
  • RECOMPILE always (that works, but it’s silly long and costly)
  • Updating/reorganising/rebuilding relevant stats/refreshing indexes on each loop (less costly than one would think because there’s not much to actually refresh but they still don’t make much difference)
  • OPTION (USE HINT ( ‘FORCE_LEGACY_CARDINALITY_ESTIMATION’ )) [read: https://docs.microsoft.com/en-us/sql/relational-databases/performance/cardinality-estimation-sql-server?view=sql-server-ver15 ]

None of these helped. The legacy cardinality almost worked but then it got into the silly-slow after around 95%, so I’d probably flag it as ‘feasible perhaps’. The task is not only to stop the hangs but also to keep the speed acceptable. MAXDOP stops the hangs but slows the loops to the initial 6ish seconds so not preferred.

Ultimately I settled on the split-code-into-tables-and-index-them solution. This worked as following:

The log is busier but basically each loop is just under 3 secs. Not great not terrible (re: Chernobyl) but this time acceptable I guess. No hangs either. Total run: 5hr 4m so 2.25s/loop.

Hope this was useful for some of you. Comments welcome/encouraged.

On a post-run note, the script is rather CPU-heavy not so much on the IOs

8 thoughts on “SQL Server — Widely Random Query Exec Times (CXCONSUMER WAIT/CXPACKET) & Possible Remediation

  1. So you’re basically correlating every security’s timeseries against every other security’s timeseries? I think you can both do this more efficiently and better utilize your CPUs by flipping the query a bit and exploiting a parallel nested loop.

    What I would do is cross join on the outer side of the loop to produce a set of 8000*8000 security combinations that you’re interested in. Then on the inner side, seek for each security in the pair and evaluate the correlation. Make sure you have a one-to-many merge join on the inner side of the loop. This should give you the most efficiency on a per-thread basis, will drop a lot of the weirdness from your currently over-parallelized plan, will avoid your current one query per security setup, and it will allow you to scale across *all* of your CPUs and drive things to 100%.

    Google “parallel apply pattern” for more info on how to set this up.

  2. Joe Chang

    parallel loop joins are very problematic, SQL Server is not good at choosing a strategy to achieve even distribution of work between threads, and sufficient work to make it worth while. Is there some reason you cannot show the query – hide the true column name if necessary

  3. Praveen

    Hi Viktor,

    Thanks for the detailed article. Is this issue random and not consistently reproducible? We have a similar issue with SQL queries stuck on CXCONSUMER for long time but when session is terminated and same process is re-run, it completes much faster. Apparently, query plan looks exactly same in both cases and issue is not manually reproducible.

    Thanks.

    • Hi Praveen,

      Not sure if it’s consistently reproducible, however what I found (and what I’ve read on other sites) is that it tends to come up when you either loop or have generally varying data sizes — so for example in the case described above in the article, I had ~8k loops and therefore the ‘planned’ QP would only be based on the first few, which might or might not change. It’s also reasonably unlikely that even a post-exec QP would show all the different thousands of executed loops in a human-digestible way.
      Your issue seems somewhat similar to mine in that in some cases I got to say 1000 iterations before it died and other cases 5000 iterations, so it’s not reproducible per se, annoyingly enough.
      Appreciating this might not be much help but what I’m saying/asking is that your QP might not seem to change because it doesn’t cover the whole of your data/script execution, hence the issue in the first place, because SQL Server can’t come up with a plan that’s good for all your iterations.
      Have you tried the OPTION (USE HINT ( ‘FORCE_LEGACY_CARDINALITY_ESTIMATION’ )) ?

      • Praveen

        Thanks for the reply. I will check if FORCE_LEGACY_CARDINALITY_ESTIMATION helps in consistent performance. Specifically, in our case, we are not using any variables/parameters. Queries are mostly fixed/static.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.