Improving consistency in butler dataset queries

Description

Dataset type expressions

Butler "dataset type expressions" can be used in a few butler query contexts, and can be any of the following:

  • str dataset type names.

  • DatasetType instances;

  • re.Pattern objects, representing regular expressions to match against dataset type names (these can also be constructed from string glob patterns);

  • iterables containing any combination of the above;

  • The ... singleton, which matches all dataset types.

These are resolved to a list of DatasetType objects that are actually registered in the data repository (or, in the near future, are consistent with one registered in the data repository, if a DatasetType instance is given with the same name and different storage class as a registered one).

Experts: there's also some component-composite handling stuff I'm going to ignore in this section because it isn't relevant; let's pretend for now that there are no component dataset types.

In queryDatasetTypes, we just return this resolved list directly (or return an iterator to it; see RFC-878).

In queryDatasets, we run a completely separate query for each dataset type, and chain together the results; this appears to the user like a SQL UNION over dataset types, but with the rows for different dataset types not necessarily having the same columns (because of differing dimensions).

In queryDataIds, we constrain the data IDs returned to those for which all of the matching dataset types "exist".  This is a SQL JOIN over dataset types; the full query has columns corresponding to the dimensions of all of the dataset types as well as the dimensions of the requested data IDs, even if the returned columns are a subset of these.

All of these behaviors are useful, and need to be preserved, but I worry that the difference between the latter two may be surprising to users, especially when the original expression involves unmatched patterns.  Let's consider a multi-element expression that includes some successful elements and a pattern:

When used with queryDatasets in a collection with no coadd outputs, you'll just get calexp results: fine.  When used with queryDatasets in a collection that does have coadd data products, you'll get calexps and a bunch of coadd data products: also fine.

When used with queryDataIds in a collection with no coadd output,s in a repo that never had any coadd outputs (i.e. never had any deepCoadd_* datasets registered), that pattern will have no effect, because the resolved list of dataset types will not include coadd dataset types.  When using in a collection with no coadd outputs in a repo that does have those deepCoadd_* dataset types, that pattern will cause you to get no results from the query at all, because the resolved list will include them and the collection doesn't have any of those datasets.

The best resolution I see to this is to restrict the dataset type expressions passed to queryDataIds to explicit str names, DatasetType instances, and iterables thereof, and to raise when any given str or DatasetType does not match an existing dataset type.  This can be done via a regular deprecation process: we'll warn when patterns appear before removing support for them entirely.

Passing datasets to queryDataIds is (or should be) very rare; it's important for QuantumGraph generation, and there a few other use cases, but I've most frequently seen it used incorrectly, or at least in cases where the user really should be using queryDatasets instead.  So while this change may break some existing code, I think a lot of that code should be reexamined anyway, and the deprecation warning should probably reflect that (i.e. contain a suggestion to consider queryDatasets instead).

In other contexts where we accept str or DatasetType instances but not regular expressions or "...", I think we should also raise when any str or DatasetType does not match a registered dataset type.  Right now this is already the case in some contexts ( Butler.get, Registry.findDatasets) but not in others (DataCoordinateQueryResults.findDatasets currently returns no results).  In contexts where patterns are accepted, a string or DatasetType that does not match behaves like the limiting case of a pattern that does not match: no results are returned.  I would like to make this change to DataCoordinateQueryResults.findDatasets immediately to sidestep other complications in an ongoing refactor.**

The current proposal is to make this behavior contingent on the signature of the function or method: if the interface accepts patterns, then it always returns no results instead of raising when an explicit string or DatasetType does not match, and hence the current behavior of queryDatasetTypes and queryDatasets would not be changed.  An alternative that may be we worth discussing is to have the presence or absence of a pattern in the expression determine how non-pattern elements are treated for these methods.  Another alternative would be to prohibit patterns (or even only permit single dataset types, rather than lists) in queryDatasets, too, requiring users to call queryDatasetTypes first to resolve patterns; this seems less error-prone, but is also less convenient.

Component datasets in registry queries

Dataset types and dataset references may have components (e.g. "calexp.wcs"), but this is something handled fully by Datastore; the Registry database only has entries for the parent dataset type and knowledge from a StorageClass that components may be available.

At present, Registry query methods support components in dataset type names and patterns, and handle these simply by "exploding" the natural SQL result row for a parent into N per-component result rows.  But this is something that could easily be done in client code that iterates over the results when needed, instead, simplifying the Registry query logic.  More importantly, it would avoid the implication that component datasets returned by a query actually exist, in the "does this Exposure have a Psf attached" sense.

I would like to remove the components argument to all butler query methods and define them as supporting parent or standalone dataset types only.  This would be done through a regular deprecation process.

Checklist

Issue Matrix

hide

Lucidchart Diagrams

Activity

Jim Bosch September 19, 2022 at 3:29 PM

Reading through the original proposal, I see that I didn't specify what the deprecation procedure should be for the no-results behavior change in queryDataIds.  I think the right approach is to also emit a FutureWarning when a user passes a str or DatasetType instance to the datasets argument that does not correspond to a registered dataset type, stating that in the future this will be an error and they should use getDatasetType or queryDatasetTypes to avoid this condition.

Jim Bosch September 19, 2022 at 3:24 PM
Edited

The main usage I expect to break here would be a command-line invocation like this one:

or similar calls from Python, which will now fail due the wildcard on the dataset type.  But in most of the places I've seen this, the user really wants to either write:

or they should use query-dataset-types to first resolve the wildcard and then query for just one dataset type.

I have a hard time imagining any user code or DM production code (as opposed to easy-to-find-and-fix DM test code) that would be broken by the proposed change in no-results handling or by dropping the components keyword argument.

John Parejko September 15, 2022 at 11:18 PM

I don't see any obvious downsides to this proposal. It seems like it generally simplifies these interfaces, while making their inputs more explicit. I wonder if we have some examples of uses that we're going to explicitly break here, just to see what people are doing right now?

Jim Bosch September 8, 2022 at 5:42 PM

Escalating immediately due to API changes.  Feedback from others still quite welcome!

Done
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Planned End

Sep 18, 2022, 9:00 PM

Components

Checklist

Created September 8, 2022 at 5:39 PM
Updated September 24, 2024 at 7:22 PM
Resolved September 12, 2024 at 6:27 PM