Replace bit-packed flags column with individual flag fields in APDB/alerts
Description
is triggering
relates to
Confluence content
Issue Matrix
hideActivity

Implemented in DM-41530
Adopting with four triggered tickets. We'll implement it and measure how much it actually increases the size of an APDB row and/or alert packet and note that here for future reference.
"read in a whole table" will be effectively impossible once LSSTCam starts taking data. All the DiaSources in a single exposure with the current APDB schema will be about half a GB (6040 bytes/diaSource times ~1e5 diaSources per visit). We cannot plan around anyone trying to read "whole tables" into memory.
With bitpacked flags we also have to store the packing schema and keep track of it as it changes (which would be particularly difficult in APDB, I think), which we are not currently doing. There's also a good chance that we would run out of even a 64 bit packed field: there are currently 98 "Flag" columns in the `src` table alone, let alone relevant coadd, diffim, or association flags.
I am OK with this solution so long as we are pretty confident it won't make alert packets (and/or the APDB / PPDB) too unwieldy. We have some 28 defined flag fields presently. My main concern with this approach is the science user case where someone wants to read in a whole table (e.g., in a notebook in pandas), and because we have 20-someodd more columns, it becomes a memory issue.
I would honestly prefer a universal bitpacked flag scheme for all our source and object flavored data tables that is trivial to pack/unpack, but we are so far from that at present and this is blocking so many AP analysis_tools metrics, that I'm inclined to kick this can down the road and hope K-T's future of a Pipelines-wide packing and unpacking utility eventually comes to fruition.
The higher priority in the short term is getting the AP and DRP associated DiaSource tables to both have accessible columns so we can write metrics that work for both. (Incidentally, I am used to everything breaking when there is a new version/schema and would rather have this sooner than later, sorry John.)
One interesting aspect of that, mentioned on Slack already, is the need for schema migration when new flags are added. Migrating huge databases is a non-trivial issue, and our case is complicated also by a mix of Cassandra and Postgres (for APDB and PPDB) that have to be migrated in sync. I understand that schema changes are triggered by addition of new AP plugins and would typically result in more than one column being added. One potential approach for managing schema updates could be to use extra tables for per-plugin columns. Adding a new empty table is supposedly a much faster operation than adding a bunch of columns to an existing table (at least in Postgres). The queries would require joins of more than one table then, which is a potential performance issue. This is just an idea, possibly for the future when we actually run into trouble doing migrations.
APDB has a set of bit-packed "flags" fields (in these tables: DiaSource, DiaObject, SSObject, DiaForcedSource), with the bits defined in ap_association's flag map. DRP's flags are written directly as individual fields (e.g. centroid_flag in DiaSource), not bit-packed. From a discussion on dm-science-pipelines on slack, it seems that the AP/DRP difference here was not an intentional design choice, but rather one of over-reliance on the DPDD vs. expedience.
This RFC proposes that we remove the
flags
fields from the APDB and instead write the individual flag fields directly, like DRP does. This will simplify the user interface: users won't have to unpack the bitfield to determine if individual bits are set. It will also reduce one potential source of bugs, as currently the flag map is not stored directly in APDB (see DM-39498). It will also let us more easily share analysis_tools with DRP, as our parquet file schemas will match.We could have a period where we are writing both the packed
flags
and the single fields, to ease the transition period.Unfortunately, we still don't have APDB versioning in place (DM-41029); if we could delay implementing this schema change until that is done, it woulds make sorting out the schema timeline easier.
@Eric Bellm has checked that avro has a
boolean
type (though whether it requires one bit or one byte is unknown), so we can also write individual fields to the alert packet, though we should check how much this increases the alert packet size. Not sending bit-packed flags would be more user friendly for alert recipients, too, as otherwise we would have to publish the bit definitions (which may change over time).