Investigate unexpected config comparison in w23 RC2 run

Description

See https://lsstc.slack.com/archives/C4JQP6FRS/p1686773632719249

We're seeing a config comparison failure in pipetask --init-only, when:

  • there should be no comparison, because this should be a new RUN collection;

  • the configs should compare as equal anyway.

My guess at the former is that QG generation is stuffing existing refs from --skip-existing-in but not the output RUN collection into the QG, rather than generating new refs.

Attachments

2

Issue Matrix

hide

Activity

Andy Salnikov June 20, 2023 at 6:22 PM

Looks good, a couple of minor questions.

Jim Bosch June 20, 2023 at 4:26 PM

, could you review this? It's a pair of small fixes to QG generation in pipe_base, combined with more code to test it:

  • pipe_base: the actual bugfixes, and a small upgrade to the new mocks package

  • ctrl_mpexec: a new command-line option to provide access to existing mocks functionality

  • analysis_tools: tiny fix to a config definition for a different bug (which helped reveal the presence of the main one)

  • ci_middleware: tests for QG generation and execution with --skip-existing-in.

Jim Bosch June 15, 2023 at 7:57 PM

tracked down the non-import config differences to a set being used to provide the default value for a ListField, and I've pushed an analysis_tools branch with a fix for that.  I'm guessing the import differences are something akin to and hence really hard to fix, so I'm just going to ignore them on this ticket.

Jim Bosch June 15, 2023 at 7:14 PM

I think I have a fix for the QG generation issue, but I need to add a unit test somewhere.  Doing a one-off test provided an opportunity to use the same pipeline and software version to make a new QG and hence look at another config, and it's different from both of the above!

But that helped me nail down part of it: those statsLabel lists have the same elements but different orders, so I bet some config overrides are being applied in a nondeterministic order somewhere.  , do you have any guesses as to where that might be?

The other differences (in the config imports) may also due to the same lack of determinism; I think I can imagine ways for that to happen.

Jim Bosch June 15, 2023 at 4:51 PM
Edited

As for the question of why we're doing the comparison at all, my theory from the Slack thread is correct: the rescue QG has the UUID from the previous RUN embedded in it.  So we need to go fix that in QG generation, and probably add a check somewhere that all output DatasetRef.run values are consistent with QuantumGraph.metadata["output_run"].

Done
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Labels

Reviewers

Andy Salnikov

Story Points

RubinTeam

Data Release Production

Components

Checklist

Created June 15, 2023 at 2:57 PM
Updated September 10, 2023 at 9:51 PM
Resolved June 21, 2023 at 12:59 AM