Lockups in the controller-workers protocol of the Qserv Replication system
Description
Checklist
Lucidchart Diagrams
Issue Matrix
hideActivity
Show:
John Gates
January 21, 2021 at 6:14 PM
Looks good.
Igor Gaponenko
December 6, 2020 at 8:48 AM
(edited)
Tests
These two simple tests demonstrate lockups after a few seconds of testing. Both tests launch 20,000 requests. The flow-control logic of the application was configured to allow up to 2,000 requests to be in flight at any moment of time. In both cases, requests are configured to expire 3 seconds after they're submitted.
The first test shows the higher throughput of requests in the initial stage (a few seconds) before request expiration times are starting to fire up. These requests are not queued and get "processed" by the worker instantaneously.
The performance of the queued requests (the second case) is limited by the number of worker threads and the duration of the timeout.
Short (no queued) requests
db01 --requests=20000 --max-active-requests=2000 --proc-time-sec=0 --expiration-ival-sec=3 --events-report-ival-sec=1
2020-12-06 08:22:13.262 STAT active: 1999 finished: 914 success: 914 expired: 0 failed: 0
2020-12-06 08:22:14.262 STAT active: 2000 finished: 1755 success: 1755 expired: 0 failed: 0
2020-12-06 08:22:15.262 STAT active: 2000 finished: 2454 success: 2454 expired: 0 failed: 0
2020-12-06 08:22:16.263 STAT active: 2000 finished: 3068 success: 3068 expired: 0 failed: 0
2020-12-06 08:22:17.263 STAT active: 2000 finished: 3700 success: 3369 expired: 331 failed: 0
2020-12-06 08:22:18.264 STAT active: 1988 finished: 4394 success: 3369 expired: 1025 failed: 0
2020-12-06 08:22:19.264 STAT active: 2000 finished: 5079 success: 3596 expired: 1483 failed: 0
2020-12-06 08:22:20.265 STAT active: 1990 finished: 5692 success: 4181 expired: 1511 failed: 0
2020-12-06 08:22:21.265 STAT active: 1999 finished: 6384 success: 4222 expired: 2162 failed: 0
2020-12-06 08:22:22.265 STAT active: 1987 finished: 7069 success: 4265 expired: 2804 failed: 0
2020-12-06 08:22:23.266 STAT active: 2000 finished: 7725 success: 4758 expired: 2967 failed: 0
2020-12-06 08:22:24.266 STAT active: 2000 finished: 8395 success: 5264 expired: 3131 failed: 0
2020-12-06 08:22:25.268 STAT active: 2000 finished: 9000 success: 5693 expired: 3307 failed: 0
2020-12-06 08:22:26.269 STAT active: 1999 finished: 9727 success: 5693 expired: 4034 failed: 0
2020-12-06 08:22:27.269 STAT active: 1988 finished: 10380 success: 6139 expired: 4241 failed: 0
2020-12-06 08:22:28.269 STAT active: 1999 finished: 11001 success: 6139 expired: 4862 failed: 0
2020-12-06 08:22:29.310 STAT active: 1939 finished: 11492 success: 6139 expired: 5353 failed: 0
2020-12-06 08:22:30.310 STAT active: 2000 finished: 11977 success: 6139 expired: 5838 failed: 0
2020-12-06 08:22:31.786 STAT active: 1958 finished: 12583 success: 6139 expired: 6444 failed: 0
2020-12-06 08:22:32.786 STAT active: 1949 finished: 13136 success: 6139 expired: 6997 failed: 0
2020-12-06 08:22:33.786 STAT active: 1947 finished: 13737 success: 6139 expired: 7598 failed: 0
2020-12-06 08:22:34.840 STAT active: 1937 finished: 13937 success: 6139 expired: 7798 failed: 0
2020-12-06 08:22:36.280 STAT active: 1902 finished: 14537 success: 6139 expired: 8398 failed: 0
2020-12-06 08:22:37.280 STAT active: 2000 finished: 14962 success: 6139 expired: 8823 failed: 0
2020-12-06 08:22:38.280 STAT active: 1983 finished: 15482 success: 6139 expired: 9343 failed: 0
2020-12-06 08:22:39.459 STAT active: 1923 finished: 15896 success: 6139 expired: 9757 failed: 0
2020-12-06 08:22:41.093 STAT active: 1904 finished: 16312 success: 6139 expired: 10173 failed: 0
2020-12-06 08:22:42.147 STAT active: 1938 finished: 16804 success: 6139 expired: 10665 failed: 0
2020-12-06 08:22:43.252 STAT active: 1924 finished: 17032 success: 6139 expired: 10893 failed: 0
2020-12-06 08:22:44.341 STAT active: 2000 finished: 17463 success: 6139 expired: 11324 failed: 0
2020-12-06 08:22:46.099 STAT active: 1929 finished: 17920 success: 6139 expired: 11781 failed: 0
2020-12-06 08:22:47.668 STAT active: 1169 finished: 18831 success: 6139 expired: 12692 failed: 0
2020-12-06 08:22:48.825 STAT active: 609 finished: 19391 success: 6139 expired: 13252 failed: 0
2020-12-06 08:22:50.501 STAT active: 0 finished: 20000 success: 6139 expired: 13861 failed: 0
Long (queued) requests
Workers are configured with 64 processing threads
db01 --requests=20000 --max-active-requests=2000 --proc-time-sec=1 --expiration-ival-sec=3 --events-report-ival-sec=1
2020-12-06 08:43:26.245 STAT active: 1999 finished: 1 success: 1 expired: 0 failed: 0
2020-12-06 08:43:27.683 STAT active: 1999 finished: 64 success: 64 expired: 0 failed: 0
2020-12-06 08:43:28.683 STAT active: 1917 finished: 1331 success: 64 expired: 1267 failed: 0
2020-12-06 08:43:29.837 STAT active: 1999 finished: 2005 success: 64 expired: 1941 failed: 0
2020-12-06 08:43:30.927 STAT active: 1999 finished: 2065 success: 64 expired: 2001 failed: 0
2020-12-06 08:43:31.936 STAT active: 1849 finished: 3023 success: 64 expired: 2959 failed: 0
2020-12-06 08:43:32.936 STAT active: 1999 finished: 4034 success: 64 expired: 3970 failed: 0
2020-12-06 08:43:33.936 STAT active: 2000 finished: 4067 success: 64 expired: 4003 failed: 0
2020-12-06 08:43:34.941 STAT active: 1898 finished: 4723 success: 64 expired: 4659 failed: 0
2020-12-06 08:43:35.946 STAT active: 1924 finished: 5390 success: 64 expired: 5326 failed: 0
2020-12-06 08:43:36.967 STAT active: 1999 finished: 6068 success: 64 expired: 6004 failed: 0
2020-12-06 08:43:38.058 STAT active: 2000 finished: 6517 success: 64 expired: 6453 failed: 0
2020-12-06 08:43:39.058 STAT active: 1998 finished: 7183 success: 64 expired: 7119 failed: 0
2020-12-06 08:43:40.058 STAT active: 1995 finished: 7717 success: 64 expired: 7653 failed: 0
2020-12-06 08:43:41.058 STAT active: 1939 finished: 8365 success: 64 expired: 8301 failed: 0
2020-12-06 08:43:42.122 STAT active: 1975 finished: 8939 success: 64 expired: 8875 failed: 0
2020-12-06 08:43:43.179 STAT active: 2000 finished: 9459 success: 64 expired: 9395 failed: 0
2020-12-06 08:43:44.218 STAT active: 1925 finished: 10040 success: 64 expired: 9976 failed: 0
2020-12-06 08:43:45.222 STAT active: 1894 finished: 10384 success: 64 expired: 10320 failed: 0
2020-12-06 08:43:46.222 STAT active: 2000 finished: 10952 success: 64 expired: 10888 failed: 0
2020-12-06 08:43:47.244 STAT active: 1947 finished: 11366 success: 64 expired: 11302 failed: 0
2020-12-06 08:43:48.294 STAT active: 1913 finished: 11502 success: 64 expired: 11438 failed: 0
2020-12-06 08:43:49.356 STAT active: 1932 finished: 11891 success: 64 expired: 11827 failed: 0
2020-12-06 08:43:50.360 STAT active: 1938 finished: 11891 success: 64 expired: 11827 failed: 0
2020-12-06 08:43:51.458 STAT active: 1946 finished: 12314 success: 64 expired: 12250 failed: 0
2020-12-06 08:43:52.481 STAT active: 1953 finished: 12314 success: 64 expired: 12250 failed: 0
2020-12-06 08:43:53.576 STAT active: 1962 finished: 12314 success: 64 expired: 12250 failed: 0
2020-12-06 08:43:54.576 STAT active: 1935 finished: 12914 success: 64 expired: 12850 failed: 0
2020-12-06 08:43:55.791 STAT active: 1953 finished: 13219 success: 64 expired: 13155 failed: 0
2020-12-06 08:43:57.017 STAT active: 1960 finished: 13219 success: 64 expired: 13155 failed: 0
2020-12-06 08:43:58.017 STAT active: 2000 finished: 13421 success: 64 expired: 13357 failed: 0
2020-12-06 08:43:59.110 STAT active: 1948 finished: 13953 success: 64 expired: 13889 failed: 0
2020-12-06 08:44:00.169 STAT active: 1954 finished: 13953 success: 64 expired: 13889 failed: 0
2020-12-06 08:44:01.209 STAT active: 1961 finished: 13953 success: 64 expired: 13889 failed: 0
2020-12-06 08:44:02.238 STAT active: 1907 finished: 14634 success: 64 expired: 14570 failed: 0
2020-12-06 08:44:03.393 STAT active: 1913 finished: 14634 success: 64 expired: 14570 failed: 0
2020-12-06 08:44:04.394 STAT active: 2000 finished: 15196 success: 64 expired: 15132 failed: 0
2020-12-06 08:44:05.851 STAT active: 1957 finished: 15316 success: 64 expired: 15252 failed: 0
2020-12-06 08:44:06.887 STAT active: 1879 finished: 15725 success: 64 expired: 15661 failed: 0
2020-12-06 08:44:07.898 STAT active: 1882 finished: 15725 success: 64 expired: 15661 failed: 0
2020-12-06 08:44:08.898 STAT active: 1981 finished: 16371 success: 64 expired: 16307 failed: 0
2020-12-06 08:44:10.042 STAT active: 1931 finished: 16471 success: 64 expired: 16407 failed: 0
2020-12-06 08:44:11.043 STAT active: 1971 finished: 16471 success: 64 expired: 16407 failed: 0
2020-12-06 08:44:12.791 STAT active: 2000 finished: 17091 success: 64 expired: 17027 failed: 0
2020-12-06 08:44:16.130 STAT active: 1931 finished: 17595 success: 64 expired: 17531 failed: 0
2020-12-06 08:44:17.130 STAT active: 1952 finished: 17976 success: 64 expired: 17912 failed: 0
2020-12-06 08:44:19.318 STAT active: 1712 finished: 18288 success: 64 expired: 18224 failed: 0
2020-12-06 08:44:22.116 STAT active: 597 finished: 19403 success: 64 expired: 19339 failed: 0
2020-12-06 08:44:23.210 STAT active: 0 finished: 20000 success: 64 expired: 19936 failed: 0
Done
Pinned fields
Click on the next to a field label to start pinning.
Details
Details
Assignee
Igor Gaponenko
Igor GaponenkoReporter
Igor Gaponenko
Igor GaponenkoReviewers
John Gates
Story Points
6
RubinTeam
Data Access and Database
Components
Sprint
None
Checklist
Open Checklist
Checklist
Open Checklist
Created December 3, 2020 at 8:43 PM
Updated January 22, 2021 at 2:36 AM
Resolved January 22, 2021 at 2:36 AM
The problem
The current implementation of the TCP-based protocol (BOOST ASIO) between Qserv Replication system's controllers and workers may have a bug causing communications over an established connection to stall. The bug seems to be triggered by cancelling controller's requests due to timeouts. Although a probability of the alleged lockups is rather low, any such event causes the affected channel to become blocked, this resulting in a multiplicative effect on the Controller.
The lockups were seen when the Replication/Ingest system was under heavy load by executing multiple (many thousand) requests (subclasses of class
replica::Request
) simultaneously.An additional observation was that once a connection from the Master Controller to a worker was locked up, other controller-type applications started from the command line were still able to communicate with the worker.
Further steps to investigate and debug the problem
The first step would be to improve an existing test application to simulate the above described (heavy load, short timeouts) scenario:
core/modules/replica/tools/qserv-replica-messenger-test.cc core/modules/replica/MessengerTestApp.h core/modules/replica/MessengerTestApp.cc
The improved application will be later used to test an improved version of the protocol after finding and fixing the bug.
Revisit the implementation of the protocol at both ends of the communication:
Messenger.h Messenger.cc MessengerConnector.h MessengerConnector.cc WorkerServer.h WorkerServer.cc WorkerServerConnection.h WorkerServerConnection.cc
Instrument these classes to report ongoing activities at the log stream at the
DEBUG
level. Use this info during the investigation by cross-matching identifiers of requests versus activities.Report all findings as comments posted on the ticket.
Notes
A detailed explanation of the protocol is found at https://rubinobs.atlassian.net/wiki/display/DM/Replication+System+design. See section "Messaging Network" starting at page 9 of the PDF document linked from that page.