Reports 1-1 of 1 Clear search Modify search
AdV-DAQ (Data collection)
rolland - 9:03 Saturday 17 August 2019 (46685) Print this report
DAQ on-call

I have been called since there were gray flags in the DMS "missing channels", mainly from the suspensions lines (*_Electr in particular).

The data collection itself seems to work fine. I noticed that FbsAlp is complaining much more than usual about too late reply of some devices (probably VAC servers, asked to send data via Cm/Tango interface). FbmAlp is complaining a lot this morning about late frames received from different Metatron nodes.

Large spikes in FbmAlp latency appeared on August 12th, 7h22 UTC, see figure 1 (before), figure 2 (August 12th) and figure 3 (today).

Figure 4 shows some trend channels around that time: channels from the real-time (CAL_NE_MIR_Z_NOISE generated in RTPC and PCAL_power from and adc) and generated in the data collection pipeline (FbmAlp_latency) are always present, while channels coming from Metatron nodes start to be missing sometimes (too late, and rejected by the data collection pipeline at the level of FbmAlp).

I noticed that some Metatron nodes were complaining rather often about "Tango Errors" since a few days (for example for ITF_LOCK: it complains often since August 13th, while it happened only during on day (July 31st) since the beginning of the run). 

 

Images attached to this report
Comments to this report:
fcarbogn - 10:39 Saturday 17 August 2019 (46686) Print this report
There are indeed problems in collecting data from the suspensions.
This can be clearly seen, without involving the Virgo DAQ or DMS by simply issuing: checkSms SatServer
It can be seen that the number of available channels is continuously fluctuating, indicating that those are not collected from the lowest level servers.
Hoping it could help, I have restarted the whole Tango SAT machinery on olserver120, all the application servers, the Starter and the database interface DataBaseds.
This didn´t seem to have changed the situation too much and then I suggested the operator to ask the support of the SAT on-call to diagnose the problem at a lower level.

As mentioned by Loic, those problems manifest themselves not only as missing channels, but also with failed communications with suspension nodes. For example it has happened few times during the week that the PR suspension do not receive the misalign command during the automation sequence and it needs to be misaligned manually before restarting from DOWN.
This is obviously heavily effecting the Automation in various ways with the need to often restart nodes from scratch.
Boschi - 11:41 Saturday 17 August 2019 (46688) Print this report
Looking at the LockMoni VPM is clear that the problem is more general. I suggest to check the process load on olserver120 machine. A quick check on the suspension side does not show any evident problem.
Search Help
×

Warning

×