Improving consistency in butler dataset queries
Description
Checklist
Issue Matrix
hideLucidchart Diagrams
Activity
Jim Bosch September 19, 2022 at 3:29 PM
Reading through the original proposal, I see that I didn't specify what the deprecation procedure should be for the no-results behavior change in queryDataIds. I think the right approach is to also emit a FutureWarning
when a user passes a str
or DatasetType
instance to the datasets
argument that does not correspond to a registered dataset type, stating that in the future this will be an error and they should use getDatasetType
or queryDatasetTypes
to avoid this condition.
Jim Bosch September 19, 2022 at 3:24 PMEdited
The main usage I expect to break here would be a command-line invocation like this one:
or similar calls from Python, which will now fail due the wildcard on the dataset type. But in most of the places I've seen this, the user really wants to either write:
or they should use query-dataset-types to first resolve the wildcard and then query for just one dataset type.
I have a hard time imagining any user code or DM production code (as opposed to easy-to-find-and-fix DM test code) that would be broken by the proposed change in no-results handling or by dropping the components
keyword argument.
John Parejko September 15, 2022 at 11:18 PM
I don't see any obvious downsides to this proposal. It seems like it generally simplifies these interfaces, while making their inputs more explicit. I wonder if we have some examples of uses that we're going to explicitly break here, just to see what people are doing right now?
Jim Bosch September 8, 2022 at 5:42 PM
Escalating immediately due to API changes. Feedback from others still quite welcome!
Dataset type expressions
Butler "dataset type expressions" can be used in a few butler query contexts, and can be any of the following:
str
dataset type names.DatasetType
instances;re.Pattern
objects, representing regular expressions to match against dataset type names (these can also be constructed from string glob patterns);iterables containing any combination of the above;
The
...
singleton, which matches all dataset types.These are resolved to a list of
DatasetType
objects that are actually registered in the data repository (or, in the near future, are consistent with one registered in the data repository, if aDatasetType
instance is given with the same name and different storage class as a registered one).Experts: there's also some component-composite handling stuff I'm going to ignore in this section because it isn't relevant; let's pretend for now that there are no component dataset types.
In
queryDatasetTypes
, we just return this resolved list directly (or return an iterator to it; see RFC-878).In
queryDatasets
, we run a completely separate query for each dataset type, and chain together the results; this appears to the user like a SQL UNION over dataset types, but with the rows for different dataset types not necessarily having the same columns (because of differing dimensions).In
queryDataIds
, we constrain the data IDs returned to those for which all of the matching dataset types "exist". This is a SQL JOIN over dataset types; the full query has columns corresponding to the dimensions of all of the dataset types as well as the dimensions of the requested data IDs, even if the returned columns are a subset of these.All of these behaviors are useful, and need to be preserved, but I worry that the difference between the latter two may be surprising to users, especially when the original expression involves unmatched patterns. Let's consider a multi-element expression that includes some successful elements and a pattern:
When used with
queryDatasets
in a collection with no coadd outputs, you'll just get calexp results: fine. When used withqueryDatasets
in a collection that does have coadd data products, you'll get calexps and a bunch of coadd data products: also fine.When used with
queryDataIds
in a collection with no coadd output,s in a repo that never had any coadd outputs (i.e. never had anydeepCoadd_*
datasets registered), that pattern will have no effect, because the resolved list of dataset types will not include coadd dataset types. When using in a collection with no coadd outputs in a repo that does have thosedeepCoadd_*
dataset types, that pattern will cause you to get no results from the query at all, because the resolved list will include them and the collection doesn't have any of those datasets.The best resolution I see to this is to restrict the dataset type expressions passed to
queryDataIds
to explicitstr
names,DatasetType
instances, and iterables thereof, and to raise when any givenstr
orDatasetType
does not match an existing dataset type. This can be done via a regular deprecation process: we'll warn when patterns appear before removing support for them entirely.Passing
datasets
toqueryDataIds
is (or should be) very rare; it's important forQuantumGraph
generation, and there a few other use cases, but I've most frequently seen it used incorrectly, or at least in cases where the user really should be usingqueryDatasets
instead. So while this change may break some existing code, I think a lot of that code should be reexamined anyway, and the deprecation warning should probably reflect that (i.e. contain a suggestion to considerqueryDatasets
instead).In other contexts where we accept
str
orDatasetType
instances but not regular expressions or "...", I think we should also raise when anystr
orDatasetType
does not match a registered dataset type. Right now this is already the case in some contexts (Butler.get
,Registry.findDatasets
) but not in others (DataCoordinateQueryResults.findDatasets
currently returns no results). In contexts where patterns are accepted, a string orDatasetType
that does not match behaves like the limiting case of a pattern that does not match: no results are returned. I would like to make this change toDataCoordinateQueryResults.findDatasets
immediately
to sidestep other complications in an ongoing refactor.**The current proposal is to make this behavior contingent on the signature of the function or method: if the interface accepts patterns, then it always returns no results instead of raising when an explicit string or
DatasetType
does not match, and hence the current behavior ofqueryDatasetTypes
andqueryDatasets
would not be changed. An alternative that may be we worth discussing is to have the presence or absence of a pattern in the expression determine how non-pattern elements are treated for these methods. Another alternative would be to prohibit patterns (or even only permit single dataset types, rather than lists) inqueryDatasets
, too, requiring users to callqueryDatasetTypes
first to resolve patterns; this seems less error-prone, but is also less convenient.Component datasets in registry queries
Dataset types and dataset references may have components (e.g. "calexp.wcs"), but this is something handled fully by
Datastore
; theRegistry
database only has entries for the parent dataset type and knowledge from aStorageClass
that components may be available.At present, Registry query methods support components in dataset type names and patterns, and handle these simply by "exploding" the natural SQL result row for a parent into N per-component result rows. But this is something that could easily be done in client code that iterates over the results when needed, instead, simplifying the Registry query logic. More importantly, it would avoid the implication that component datasets returned by a query actually exist, in the "does this Exposure have a Psf attached" sense.
I would like to remove the
components
argument to all butler query methods and define them as supporting parent or standalone dataset types only. This would be done through a regular deprecation process.