Reports 1-1 of 1 Clear search Modify search
AdV-DAQ (Data Acquisition and Global Control)
berni, letendre, masserot - 16:26 Thursday 12 May 2022 (55812) Print this report
rtpc21 only SSFS , ALS_CEB on rtpc10

The ALS_CEB servers preiously running with the SSFS on the rtpc21(SSFS_rtpc) are now  on the rtpc10(CEB_ALS_rtpc) :

  • a new Tolm fpath has been setup to connect the rtpc10 link1  in the Computing room on the MxDx_v2_SN19 link13, thank to Francesco
  • The CEB_ALS DBox SN86 is now managed by the CEB_ALS_dbox server . As consequence all the data provided by  its mezzanines are sent now  to the CEB_ALS_rtpc

The servers have been renamed CEB_ALS and CEB_ALS_BPC  and the ARMS_LOCK  metatron server python configuration has been upgraded accordingly .

After these updates, the ITF has been reloched at the LOCKED_ARMS_IR_ALS to adjust the ALS_CEB BPC PZT  positions and then upto ACQUIRE_DRMI.

The attached plots compare the SSFS servers elapsed_time , with the ALS_CEB tasks(purple) and without(blue) : the overall cycle has been reduced by 1.4us

 

Images attached to this report
Comments to this report:
masserot - 16:16 Thursday 12 May 2022 (55820) Print this report

To take in account a new option at the bios level, the rtpc21 has been rebooted at May 12 16:08:30 LT

letendre, masserot, pacaud - 9:49 Friday 13 May 2022 (55829) Print this report

Since the migration of ALS_CEB to the rtpc10 and the last reboot of the SSFS_rtpc, looking at the SSFS_Ctrl server logfile , one can find the following messages :

  • 2022-05-12-14h01m20-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336399298-000092890
  • 2022-05-12-14h13m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336400028-047670900
  • 2022-05-12-14h28m31-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336400928-051550890
  • 2022-05-12-14h38m31-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336401528-045290900
  • 2022-05-12-15h53m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336406028-049330890
  • 2022-05-12-16h23m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336407828-044290890
  • 2022-05-12-17h18m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336411128-129490890
  • 2022-05-12-17h48m30-UTC>WARNING-AcAdcChCheck> Err_post - start delayed or missing at GS 1336412928-054850890

which refers to a not update content of this channel .  In fact some times all the channels sent by the FAST DAC mezzanine seems to be deliverate too late .

Analysing the time between difference  between these events  it appears that they are modulo 300s(5mn) , exactly the same periode between 2 consecutive peaks in the SSFS_Ctr_Tpro channel and others Tpro channels .

This period of 300s(5mn) is the period of the ganglia request use to monitor the differents hosts managed by  the IT departments .

Morever as the digital noise cancelation is used for the SSFS, these missing samples introduce a small glicth in the SSFS_Corr signal (zoom)

Images attached to this comment
Non-image files attached to this comment
dibiase - 11:18 Friday 13 May 2022 (55832) Print this report

We already spotted a perturbing event for the rtpcs interrupts which was the ssh access ! The periodicity was linked to the ssh check from the external Nagios server which, among other things, probes to see if the ssh daemon is capable to handle a session.

The first event was reported here: https://logbook.virgo-gw.eu/virgo/?r=50562  25/01/2021

The second investigation started  around 15/09/2021... we exchanged emails about it with same conclusion: issue doesn't come from ganglia but from ssh nagios monitoring.

The final conclusion is that this jump may be related to the writing to disk of the ssh connection logs since it does not happen on the diskless rtpcs.

Disabling Syslog or journald services has no effect and issue remain.

Emmanuel suggest:

"We have checked more carefully at LAPP, and it seems we also have jumps at each ssh connection, but with a lower amplitude (1 to 2 µs). We are able to produce the same jumps by clearing some caches using:

sudo echo 1 > /proc/sys/vm/drop_caches"

and at the end

"No jump at this time. So no correlation between these caches and the ssh induced jumps."

 

letendre, masserot, pacaud - 14:39 Friday 13 May 2022 (55834) Print this report

The 300s periods is due to the NAGIOS monitoring nit the ganglia one .

The IT department disabled the NAGIOS monitoring on the rtpc21 around 8h40-UTC .

The attached plots show

On peut voir que lorsque la surveillance NAGIOS est désactivée,

  • il n’y a plus de saut au dessus de 20us sur le SSFS_Ctrl_Tpro_elaped_time channel
  • and there is no more event on the SSFS_Ctrl server logfile complaining about SSFS FASTDAC Tolm packets delivered too late

 

If the NAGIOS monitoring is not mandatory , it would better to disable it on all the RTPCs involved in the ITF control to ensure the real-time performances

Images attached to this comment
cortese, masserot - 15:44 Friday 13 May 2022 (55837) Print this report

After talking with S.Cortese, a test doing a simple ssh every 10s  on the rtpc21 was done between 13h06m59-UTC and 13h13m40-UTC  with the command /usr/bin/ssh  -o BatchMode=yes -x virgorun@${rtpc} "sh -c 'hostname'".

The result is obvious:

  • jumps on the elapsed_time
  • and new events of Toln SSFS FASTDAC packets delivered too late
    • 2022-05-13-13h06m59-UTC-GPS1336482437 - delta 16709(s)
      2022-05-13-13h09m45-UTC-GPS1336482602 - delta 165(s)
      2022-05-13-13h12m39-UTC-GPS1336482776 - delta 174(s)

Looking the SSFS_Ctrl logfiles before and after the IT shutdown , the relationship between the NAGIO or SSH accesses is obvious only after the reboot of the rtpc21 using the " Low Latency Optimized Profile " bios option

Images attached to this comment
Search Help
×

Warning

×