Done
Details
Assignee
Yusra AlSayyad
Yusra AlSayyadReporter
John Swinbank
John SwinbankLabels
Reviewers
Leanne Guy
RubinTeam
System Management
Due date
Dec 09, 2024
Priority
Checklist
Created January 10, 2019 at 7:25 PM
Updated September 15, 2025 at 10:12 PM
Resolved September 15, 2025 at 10:12 PM

LDM-503-17: Final Operations Rehearsal (WBS 02C.00, due 2025-07-04) requires demonstration of the Data Management (DM) subsystem operating under conditions representative of full survey operations. It builds upon LDM-503-16 and calls for an integrated test of all major DM components, including both batch (Data Release Production) and prompt (Alert Production) pipelines, running at LSSTCam scale.
This milestone was fulfilled by Operations Rehearsal 5 (OR5). Results are documented in DMTN-313. OR5 achieved the following:
LSSTCam-scale load was applied using DESC’s Run2.2i DC2 dataset, replayed to mimic 800 visits per night across 189 detectors, consistent with design survey cadence.
Batch processing tests demonstrated the Nightly Validation pipeline (coadds, no DIA) sustained throughput of 800 visits in 24 hours, to validate that week-scale survey data can be processed in less than one week. Critical bottlenecks were identified (quantum graph generation limits, scheduling inefficiencies, calibration data I/O) and resolved during the rehearsal or tracked for follow-up.
Prompt processing tests executed the full AP pipeline at design cadence (~37 seconds per exposure), exercising Cassandra APDB, Kafka eventing, and alert distribution. While the core pipeline ran successfully, the rehearsal uncovered scaling limits in dataset export and Butler registry throughput, leading to a redesign path documented in DMTN-310.
Subsystem integration was demonstrated across storage (embargo rack, Weka), workflow managers (HTCondor and PanDA), databases (PostgreSQL, Cassandra), orchestration platforms (Knative, KEDA), and monitoring systems. This satisfied the milestone requirement for “operation of all aspects of the DM subsystem under simulated operational conditions.”
Operational realism was achieved by running with the same infrastructure layers and services that will be in use during commissioning and survey operations at the USDF. Logs, monitoring, and real-time troubleshooting practices mirrored those expected during observatory operations.
Collaborative participation included representatives from pipelines, middleware, campaign management, and data facilities, ensuring that the end-to-end system was validated under integrated team operations.
In summary, DMTN-313 documents that OR5 successfully demonstrated integrated end-to-end operations of the DM subsystem at LSSTCam scale.