Virgo Logbook

AdV-COM (automation)

carbognani, masserot - 17:56 Friday 22 January 2021 (50550)

Largely fluctuating latency of Automation nodes

It has been known for some time (back already in O3) that the Automation nodes were characterized by a strange oscillating, increasing and then resetting back latency impacting the downstream DAQ (See Daq_FbmAlpTest_TESTTRON2_latency channel on Fig. 1 and 2).
This behaviour has been investigated and traced back to the way the metatron nodes are plugged in into the DAQ chain within the ezca library.
Modifications have been made on the main metatron/DAQ integration loop making it more similar to the one implemented into the PyALP application and boling down into respecting the following order while dealing with frames in the loop:

frame.close()
frame = fdio.get_frame()
out_frame = frame.copy()
out_frame.write_sms_prefix(.....)
fdio.put_frame(out_frame)

Those modification have been implemented into metatron-1.5 and tested within the VirgoOnlineTest framework. Results can be seen on Fig. 3 were the latency of the TESTTRON2 node is now fluctuating around an average value (approx 2.2sec). The behaviour of the channels accessed into the user code seems unaffected (as shown by the count/COUNTER channels).
We intend to put in operation this new version of metatron on the dedicated shift next Monday afternoon

Images attached to this report

Comments to this report:

bersanetti, carbognani, poulton - 23:22 Monday 25 January 2021 (50570)

During this afternoon shift we attempted to put in operation the new metatron version as tested offline.
We soon discovered that the fixing of the latency was corresponding to the writing of metratron specific channel (like the node _index) with a second of additional delay (Fig.1) respect to the previous metatron version (See Fig.2)
After deep investigations, we resorted in spitting the frames interaction between check_connections() and the new manage_connections() ezca method to replicate the same situation in PyALP corresponding to the sequence:

check_connections()
- frame = fdio.get_frame()
- out_frame = frame.copy()
user code
manage_connections()
- out_frame.write_sms_prefix(.....)
- fdio.put_frame(out_frame)
- frame.close()

Many tests involving nodes stop/restart and NARM_LOCK + WARM_LOCK driving to locked were done from single user installation.
Then at the end of the shift a site installation of metatron 1.5 (containing the updated version 1.4 of guardian and ezca) was done and a last locking of the arms performed.
Fig 3 show the fixed latency of all restarted nodes and Fig. 4 show the absence of the additional one sec delay.

Images attached to this comment

bersanetti - 17:53 Tuesday 26 January 2021 (50579)

Something very unexpected seems to have happened, which is most probably related to yesterday's activity, although the reason is still unclear. Anyway, for the time being, do not pause Metatron nodes, or they will eat up all the memory of olserver52 (this is probably the cause of the sudden death of many *Moni processes earlier this afternoon).

Verifications ongoing.

carbognani - 23:30 Tuesday 26 January 2021 (50585)

Problem analyzed, fully understood, and new version of metatron in development.
For the time being is better to roll back to the previous metatron version (1.4) and then goes as far as possible with testing offline of the new version which should be possible on the TESTTRON nodes.
I will check with the control room or agree during the next daily on the best short time slot for a restart of all automation nodes.

bersanetti, carbognani - 21:32 Wednesday 27 January 2021 (50599)

This afternoon we rolled back to Metatron version 1.4, the one in use up until monday afternoon, restoring the previous situation (nodes are pausable, latency can fluctuate). After some more offline tests, the new version will be re-deployed in a dedicated time slot.

carbognani - 16:37 Sunday 07 February 2021 (50708)

Today I have put in operation release 1.6 of metatron (corresponding to ezca v1r6 and guardian v1r5) fixing the large latency fluctuations without introducing memory leaks.
This has been done by adding a flag, at the level of the guardian library, in the worker.py main loop running the user code. The flag triggers the execution of manage_connections(), at the very beginning of the loop, in case this has not happened at the previous cycle because of any reason (automation node in pause, exceptions in the user code, etc.).
The solution has been verified first on the TESTRON2 node. In Fig. 1 the memory leak is immediately evident from the memory increase reports on the node logs in VPM, and the on Fig. 2 the memory leak is not anymore there when the node is paused. Fig. 3 show that the latency is back to a normal behaviour and unaffected during the pause period.

All running automation nodes have been then brought to DOWN, restarted on top of the new 1.6 release and then brought back to the upper possible states namely:

INJ_MAIN: IMC_RESTORED
NARM_LOCK: LOCKED_ON_B7_BOOST_ON
WARM_LOCK: LOCKED_ON_B8_BOOST_ON
SQZ_MAIN: LOCKED_PLL_INT

without problems.
Fig. 4 show the latency of all running node back to normal and it can be also compared with the FbmMain (from which the nodes are getting the frames) one.
Fig. 5 provide a snapshot of the situation whit a couple of nodes (INJ_MAIN, SQZ_MAIN) put in pause. Nodes latency and olserver52 (the nodes running machine) memory are not effected.
I leave the interferometer with both arms locked on the IR beam as I found it this morning.

Images attached to this comment

masserot - 8:37 Monday 08 February 2021 (50711)

The latency reduction can be seeing in the attached plot

Images attached to this comment