Virgo Logbook

AdV-DAQ (Data Acquisition and Global Control)

berni, letendre, masserot - 16:26 Thursday 12 May 2022 (55812)

rtpc21 only SSFS , ALS_CEB on rtpc10

The ALS_CEB servers preiously running with the SSFS on the rtpc21(SSFS_rtpc) are now on the rtpc10(CEB_ALS_rtpc) :

a new Tolm fpath has been setup to connect the rtpc10 link1 in the Computing room on the MxDx_v2_SN19 link13, thank to Francesco
The CEB_ALS DBox SN86 is now managed by the CEB_ALS_dbox server . As consequence all the data provided by its mezzanines are sent now to the CEB_ALS_rtpc

The servers have been renamed CEB_ALS and CEB_ALS_BPC and the ARMS_LOCK metatron server python configuration has been upgraded accordingly .

After these updates, the ITF has been reloched at the LOCKED_ARMS_IR_ALS to adjust the ALS_CEB BPC PZT positions and then upto ACQUIRE_DRMI.

The attached plots compare the SSFS servers elapsed_time , with the ALS_CEB tasks(purple) and without(blue) : the overall cycle has been reduced by 1.4us

Images attached to this report

55812_1652365489_DAQ_rtpc21-SSFS_ALS-SSFS_alone-elapsed-time-20220512-cmp.png

Comments to this report:

masserot - 16:16 Thursday 12 May 2022 (55820)

To take in account a new option at the bios level, the rtpc21 has been rebooted at May 12 16:08:30 LT

letendre, masserot, pacaud - 9:49 Friday 13 May 2022 (55829)

Since the migration of ALS_CEB to the rtpc10 and the last reboot of the SSFS_rtpc, looking at the SSFS_Ctrl server logfile , one can find the following messages :

2022-05-12-14h01m20-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336399298-000092890
2022-05-12-14h13m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336400028-047670900
2022-05-12-14h28m31-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336400928-051550890
2022-05-12-14h38m31-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336401528-045290900
2022-05-12-15h53m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336406028-049330890
2022-05-12-16h23m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336407828-044290890
2022-05-12-17h18m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336411128-129490890
2022-05-12-17h48m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336412928-054850890

which refers to a not update content of this channel . In fact some times all the channels sent by the FAST DAC mezzanine seems to be deliverate too late .

Analysing the time between difference between these events it appears that they are modulo 300s(5mn) , exactly the same periode between 2 consecutive peaks in the SSFS_Ctr_Tpro channel and others Tpro channels .

This period of 300s(5mn) is the period of the ganglia request use to monitor the differents hosts managed by the IT departments .

Morever as the digital noise cancelation is used for the SSFS, these missing samples introduce a small glicth in the SSFS_Corr signal (zoom)

Images attached to this comment

55829_1652427622_SSFS_Corr_glitch-packet_too_late-20220512-20h28m30-UTC.png

55829_1652427669_SSFS_Corr_glitch-packet_too_late-20220512-20h28m30-UTC-zoom.png

Non-image files attached to this comment

SSFS_missing.txt

dibiase - 11:18 Friday 13 May 2022 (55832)

We already spotted a perturbing event for the rtpcs interrupts which was the ssh access ! The periodicity was linked to the ssh check from the external Nagios server which, among other things, probes to see if the ssh daemon is capable to handle a session.

The first event was reported here: https://logbook.virgo-gw.eu/virgo/?r=50562 25/01/2021

The second investigation started around 15/09/2021... we exchanged emails about it with same conclusion: issue doesn't come from ganglia but from ssh nagios monitoring.

The final conclusion is that this jump may be related to the writing to disk of the ssh connection logs since it does not happen on the diskless rtpcs.

Disabling Syslog or journald services has no effect and issue remain.

Emmanuel suggest:

"We have checked more carefully at LAPP, and it seems we also have jumps at each ssh connection, but with a lower amplitude (1 to 2 µs). We are able to produce the same jumps by clearing some caches using:

sudo echo 1 > /proc/sys/vm/drop_caches"

and at the end

"No jump at this time. So no correlation between these caches and the ssh induced jumps."

letendre, masserot, pacaud - 14:39 Friday 13 May 2022 (55834)

The 300s periods is due to the NAGIOS monitoring nit the ganglia one .

The IT department disabled the NAGIOS monitoring on the rtpc21 around 8h40-UTC .

The attached plots show

the trends with 2 time periods of 2h each
- 20220513-04h-UTC to 20220513-06h-UTC where the SSFS is not locked and the NAGIOS monitoring enabled (red line)
- 20220513-09h-UTC to 20220513-11h-UTC where the SSFS is locked during 1H and the NAGIOS monitoring disabled (green line)
the histograms of SSFS elasped time channels related to the 2 periods

On peut voir que lorsque la surveillance NAGIOS est désactivée,

il n’y a plus de saut au dessus de 20us sur le SSFS_Ctrl_Tpro_elaped_time channel
and there is no more event on the SSFS_Ctrl server logfile complaining about SSFS FASTDAC Tolm packets delivered too late

If the NAGIOS monitoring is not mandatory , it would better to disable it on all the RTPCs involved in the ITF control to ensure the real-time performances

Images attached to this comment

55834_1652444818_SSFS_trend -NAGIOS_effect-20220513-04h_12h-UTC-note.png

55834_1652445181_SSFS_elapsed_time_Histo_NAGIOS_ON_OFF-20220513.png

cortese, masserot - 15:44 Friday 13 May 2022 (55837)

After talking with S.Cortese, a test doing a simple ssh every 10s on the rtpc21 was done between 13h06m59-UTC and 13h13m40-UTC with the command /usr/bin/ssh -o BatchMode=yes -x virgorun@${rtpc} "sh -c 'hostname'".

The result is obvious:

jumps on the elapsed_time
and new events of Toln SSFS FASTDAC packets delivered too late
- 2022-05-13-13h06m59-UTC-GPS1336482437 - delta 16709(s)
  2022-05-13-13h09m45-UTC-GPS1336482602 - delta 165(s)
  2022-05-13-13h12m39-UTC-GPS1336482776 - delta 174(s)

Looking the SSFS_Ctrl logfiles before and after the IT shutdown , the relationship between the NAGIO or SSH accesses is obvious only after the reboot of the rtpc21 using the " Low Latency Optimized Profile " bios option

Images attached to this comment