DP0.2 submissions sometimes stuck during quantum graph generation

Description

Rescue submissions for DP0.2 step3 production, done using the RSP on data-int.lsst.cloud using stack v23_0_1_rc4, have sometimes become stuck during quantum graph generation, where normally the submission process would succeed after 1.5 to 2 hours.  In particular, identical submissions can be stuck one day but succeed on a different day.  An example submission was initiated with more logging, using the command

and turning on sqlalchemy debug logging by adding the following to the bps submit yaml file:

The submission became stuck on a complicated SQL query during quantum graph generation, as shown in the attached log file (which has the last 1000 lines of the much larger 3.6 GB full log file).  Also attached are the submission yaml file and 2 other yaml files used for the submission.

Attachments

4

Checklist

Issue Matrix

hide

Lucidchart Diagrams

Activity

Show:

Andy Salnikov January 9, 2023 at 9:24 PM

I just rechecked that log file, and there is nothing interesting there in addition to what we said above. I agree we should close it and wait for another occurrence.

Tim Jenness January 9, 2023 at 8:36 PM

Is this ticket going anywhere? It seems that was looking at a sqlalchemy dump but did not report. Given that dp0.2 has finished it's likely then we should shut this ticket down and re-open next time we have a problem.

Jim Bosch March 31, 2022 at 8:36 PM

I'm fixing the unexpected query complexity I commented on earlier on , but I still don't expect that to have played a big role in this hang.

Andy Salnikov March 23, 2022 at 7:30 PM

sqlalchemy would dump the result rows if it received it. So it does feel like postgres server is stuck in a query, though it may also be that something bad happened on client side silently that prevent it from reading the result.

Tim Jenness March 23, 2022 at 3:34 PM

I think the reason that it was felt to be on the database side is that the python process didn't seem to be doing anything (CPU or memory growth). Wouldn't we expect the client process to be showing activity if it was reading the results from the query.

Done
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Labels

RubinTeam

Ops Middleware

Components

Checklist

Created March 22, 2022 at 4:31 PM
Updated January 9, 2023 at 10:06 PM
Resolved January 9, 2023 at 10:06 PM