Reports 1-1 of 1 Clear search Modify search
AdV-DAQ (Data collection)
bouedo, masserot, mours - 9:16 Monday 25 July 2022 (56526) Print this report
DAQ: rtpc TolmFrameBuilder frame at 5Hz (dt:0.2s)

Last friday, 20220722 between 11h and 12h30-UTC, we setup the frame building at 5Hz on most of the TolmFrameBuilder servers, except the the FDS_Img_Tfb one :

  • the frames with a frame dt at 0.2s  are received by the FbmFFE server . Today a single FdIOServer called FbmFFEDy allows the low latency data accesses  for commissioners
  • the TolmFrameBuilder server configurations were updated to allow the frame building at 5Hz and the frame transmission to the FbmFFE server
  • The 50Hz online frame building and the frame dt extension from 0.2s to 1s are done between the FbmFFE and the FbmFE frame merger stages by the server called *_50Hz in data collection Front End DAQ sub-group . It was done previously between the FbmFE and the FbmMain frame merger stages
  • Note that during the time period of the operations some channels may be missing .

Others minor changes have been performed too

  • the input of the FreqAna server in now the FbmMain online SHM  instead of FbmFE  one previously
  • The input for  the  transmission of the online DAQ stream to the olserver53 , done by the servers StToSpare* in now the FbmMain online SHM  instead of FbmFE  one previously

The first attached plot show the trend data over the week-end  for the more DAQ relevant channels:

  • no strange behaviours observed during the week-end
  • Due to  the new DAQ  frame merging stage, the latency for the 1s frame has increase by 0.3s

On the second attached plot one can see periodic (every 4h) peaks on the latency for the SNEB_Fb server and the SQB2_Fb server ::

 

Images attached to this report
Comments to this report:
masserot, mours, pacaud - 10:31 Tuesday 26 July 2022 (56542) Print this report

The 4h periodic peaks on the latency of the SNEB_Fb and SQB2_Fb are du to the rhsmcertd daemon , configured by default to run every 4h . The activity of this daemon is reported in the /var/log/rhsm/rhsmcertd.log logfile:

  • rtpc8 (SNEB_Fb) rhsmcertd logfile content  with a time report as local time
    • Mon Jul 25 00:45:01 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 04:45:01 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 08:45:02 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 12:45:02 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 13:21:38 2022 [INFO] (Auto-attach) Certificates updated.
      Mon Jul 25 16:45:02 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 20:45:02 2022 [INFO] (Cert Check) Certificates updated.
      Tue Jul 26 00:45:02 2022 [INFO] (Cert Check) Certificates updated.
      Tue Jul 26 04:45:02 2022 [INFO] (Cert Check) Certificates updated.
      Tue Jul 26 08:45:02 2022 [INFO] (Cert Check) Certificates updated.

  • rtpc3 (SQB2_Fb) rhsmcertd logfile content with a time report as local time
    • Mon Jul 25 00:42:53 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 04:42:53 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 06:39:13 2022 [INFO] (Auto-attach) Certificates updated.
      Mon Jul 25 08:42:53 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 12:42:53 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 16:42:53 2022 [INFO] (Cert Check) Certificates updated.
      Mon Jul 25 20:42:54 2022 [INFO] (Cert Check) Certificates updated.
      Tue Jul 26 00:42:54 2022 [INFO] (Cert Check) Certificates updated.
      Tue Jul 26 04:42:54 2022 [INFO] (Cert Check) Certificates updated.
      Tue Jul 26 06:39:13 2022 [INFO] (Auto-attach) Certificates updated.
      Tue Jul 26 08:42:54 2022 [INFO] (Cert Check) Certificates updated.

The attached plots shows for the rtpc8 and rtpc3

  • the latency channel at the TolmFrameBuilder level
  • the CPU0 idle on which the TolmFrameBuilder is running
  • the red cycle refers to a jump in coincidence with the rhsmcertd activity.

As this daemon is useless on the rtpcs , it must be disable to ensure the correct data taking

Images attached to this comment
cortese, masserot, pacaud - 13:41 Tuesday 26 July 2022 (56549) Print this report

The  rhsmcertd daemon has been disabled on all the rtpcs (see the attached file for the details)

Non-image files attached to this comment
cortese - 11:11 Thursday 11 August 2022 (56718) Print this report

Considerations on data collection glitches caused by ordinary Operating System tasks

So far, at least two tasks have been spotted impacting the real-time RTPC operations: ssh connections ( see https://logbook.virgo-gw.eu/virgo/?r=55832 ) and the rhsmdcertd daemon.
In both cases it can be noted that asymmetric criptografic keys generation/verification is involved:

  1. in ssh when the session keys are generated and exchanged for the onset of the ssh encrypted channel.
    This occurs when sshd receives the client connection and and before the user authentication, which corresponds to what observed.
    Note that session re-keying occurs also periodically on a long-lived connection after 1 to 4 GB has been transferred, therefore, in this hypothesis, the glitches should appear also quasi-randomically on rtpcs with long-running processes started via VPM
  2. when some certificate/public_key is verified such as with the rhsmcertd daemon, but it could occur also with other tasks that need to use a public/private keys encryption/verification algorithm

It is unlikely there is something specific about these algorithms causing the glitches except for the fact that they are computationally heavy and so saturate for a while the single core dedicated to the OS.
Therefore it is possible that these episodes will occur more frequently as long the default key length for these algorithms is increased following OS upgrades or anyway in any other cases when the OS-dedicated core is heavily used.

This kind of OS tasks impact on real-time operation suggests that the platform has become sensitive to the activity of ordinary system daemons (by the way it is not true for rhsmcertd that "this daemon is useless on the rtpcs" because we use the free version of the Redhat Subscription infrastructure - Redhat Satellite for the bare metal automatic installation and package provisioning which is embodied in the Foreman platform for RTPCs, i.e. the daemon will be eventually re-enabled in case of need during the mainenance windows).

The conclusion is that the current configuration of a single core being dedicated to the Operating System and non-realtime tasks while the others are isolated for realtime tasks could be anymore not enough to sustain the ordinary system activity, maybe even in case  the DAQ collection load is distributed on a higher number of rtpcs.

Maybe as a possible solution  the introduction of a second processor could be re-evaluated in order to dedicate the first processor to the OS, unless there is some realtime/RTAI constraint forbidding this.

pacaud - 10:26 Friday 12 August 2022 (56731) Print this report
Another reason why we became more sensitive to non realtime activities is also because the machine are more loaded now, with a typical elapsed time not far from the real time scheduling period. While it is acceptable to occasionally have an elapsed time longer than the scheduling period, troubles appear when we miss the deadline several time consecutively. So the less margin we have, the more we are sensitive.

The use of another processor may improve things, but that is not granted. We have to check if this is the case. I'm not sure the CPU use of non real time tasks is the cause of troubles, as they run on another core, and the real time tasks run in an interrupt context, with all interrupts masked. What could explain the increased elapsed time is a cache miss. We know for example that when the TolmFrameBuilder is accessing the TOLM packet FIFO, we see an increase of the elapsed time.
Search Help
×

Warning

×