Virgo Logbook

Virgo Runs (O4c)

menzione - 6:43 Friday 20 June 2025 (67058)

Operator Report - Night shift

ITF found locked at LOW_NOISE_3 in SCIENCE mode.

At 21:37 UTC ITF unlocked due to Instability in the LN2 NI_MAR -> NE_MIR reallocation filter or end TM (MIR or MAR) correction saturation. Unfortuntely the unlock caused an opening of WI ID loop. Properly closed.
Relocked at first attempt. SCIENCE mode set at 22:35 UTC.
At 00:19 UTC ITF unlocked again due to ASC DIFFp TY diverging or end TM (MIR or MAR) correction saturation.
Autorelocked at first attempt. SCIENCE mode at 01:08 UTC.
ITF unlocked again at 02:09 UTC (TBC). Unfortunately INJ_MAIN was not able to complete FmodErr. The node was stuck looping between IMC_RESTORED and FMODERR_TUNED due to a problem with LNFS.
Under suggestion of ISC expert I temporary bypassed FmodErr via ITF_LOCK.ini "fmoderr_skip = True".
Relocked after two cross-alignments in ACQUIRE_DRMI. SCIENCE mode at 04:07 UTC.

Guard Tour (UTC)
20:56 - 21:35
23:03 - 23:41
01:06 - 01:46
03:08 - 03:50

Sub-system reports

DAQ
01:27 UTC - Lnfs100 crashed. Killed via shell, restarted via VPM.

Pending actions

DAQ
(20-06-2025 03:00 - ) LNFS not responding. INJ_MAIN metatron node stuck before FMODERR_TUNED.

Oncall events

ISC
(20-06-2025 03:00 - 20-06-2025 03:10) Operator on site with expert from remote
Status: Ended
Description: FmodErr
Actions undertaken: INJ_MAIN Metatron node stuck while trying to reach FMODERR_TUNED due to a problem with LNFS.
FmodErr disabled via "fmoderr_skip = True" in ITF_LOCK.ini.

Images attached to this report

67058_1750393620_Screenshot 2025-06-20 at 06-26-19 DMS Event Monitor.png

67058_1750393620_Screenshot from 2025-06-20 06-25-49.png

Comments to this report:

bersanetti - 8:31 Friday 20 June 2025 (67061)

Looking at the log files of INJ_MAIN, I could find several of these errors:

138:2025-06-20-02h20m22-UTC>WARNING-[FMODERR_CHECK.run] USERMSG 0: EZCA CONNECTION ERROR: Any Other Error: Could not get value from channel: INJ_LNFS_AMPL2

So the node could not read the INJ_LNFS_AMPL2 online, as it is in fact the case also with dataDisplay (see Figure).

We should investigate why the channels are not available anymore.

Images attached to this comment

67061_1750401064_Screenshot from 2025-06-20 08-28-23.png

masserot - 9:08 Friday 20 June 2025 (67068)

The Lnfs SMS data are collected by the FbsISC slow frame builder . According the operator report the Lnfs server has been stopped and restarted at 01:27 UTC .

At the same time period one can find in the FbsISC logfile the following lines:

2025-06-20-00h57m16-UTC>ERROR..-FbsCmSmsData> GPS:1434416254 No data extracted for Lnfs100 : get answer but without data (see plot)
2025-06-20-01h20m34-UTC>INFO...-FbsFrameSmsQuery> GPS:1434417652, sms Lnfs100 pending 4/3 : no answer from the Lnsf100 server to the FbsISC requets, so stop sending requests
2025-06-20-01h20m34-UTC>INFO...-CfgReachState> Active(Active) Ok
2025-06-20-01h26m35-UTC>INFO...-Cm> CheckMasksPoll> Lnfs100 - POLLERR : Cm detects an error but not the server disconnection

To restore the Lnfs100 SMS data collection the following actions were performed :

2025-06-20-06h35m52-UTC Remove SMS [SMS server=Lnfs100]' sent to FbsISC
2025-06-20-06h35m55-UTC 'Reload Configuration' sent to FbsIS

After these operations, the Lnfs SMS channels are again available in the DAQ (see plot)

Images attached to this comment

67068_1750403194_Lnsf100-troubes-20250620-00h57m-trend.png

67068_1750403319_Lnsf100-troubes-fixed-20250620-06h306m-UTC-trend.png

bersanetti - 16:55 Friday 20 June 2025 (67072)

Later today the issue happened again. The FbsISC framebuilder lost connection with the Lnfs100 process:

FbsISC:
2025-06-20-09h49m05-UTC>ERROR..-FbsCmSmsData> GPS:1434448163 No data extracted for Lnfs100 2025-06-20-10h54m40-UTC>WARNING-Main> local - gps 1434452098-130968000, prev 1434452097-002215000(000000000) - frDt 1, dt 1.13097 - nb 1 - tmo 0.004522

Lnfs100:

2025-06-20-07h46m00-UTC>INFO...-Sent AMPL 2 -6 1 command to LNFS1 2025-06-20-09h49m04-UTC>INFO...-CfgReachState> Error(Error) Ok 2025-06-20-09h49m04-UTC>WARNING-Timeout from worker process! 2025-06-20-09h50m09-UTC>WARNING-Timeout from worker process!

However the problem became evident only later, once we actually tried to access the LNFS data, i.e. during FmodErr at the beginning of the lock acquisition (Figure 1).

This time the Lnfs100 was apparently dead from the VPM, but connecting to olserver129 I could see both processes actually still alive:

virgorun@olserver129[~]: ps aux | grep Lnfs100 virgorun 32242 0.0 0.0 113424 1692 ? S 04:24 0:00 bash /virgoApp/PyLnfs100/v4r1p1/Linux-x86_64-CL7/bin/PyLnfs100-conda /virgoData/VirgoOnline/Lnfs100.cfg Lnfs100 virgorun 32714 1.5 0.9 741672 75620 ? SNl 04:24 9:11 python3 /virgoApp/PyLnfs100/v4r1p1/scripts/PyLnfs100.py /virgoData/VirgoOnline/Lnfs100.cfg Lnfs100 virgorun 33491 0.0 0.0 112824 992 pts/8 R+ 14:15 0:00 grep --color=auto Lnfs100 virgorun@olserver129[~]: kill -9 32242 32714 virgorun@olserver129[~]: ps aux | grep Lnfs100 virgorun 33582 0.0 0.0 112820 988 pts/8 S+ 14:16 0:00 grep --color=auto Lnfs100

After killing the processes (as virgorun) I could restart it from VPM, then the same steps done by Alain (remove SMS from Lnfs100 and ReloadConfig, both actions on FbsISC) restored the proper communication.

In case this happens again the same procedure can be followed; however, most probably it won't be sufficient: the issue happens when the INJ_MAIN node resets the modulation amplitude in the Lnfs and cannot read it back. But the command is not actually received because the communication was lost already before.

So, if INJ_MAIN still notifies "Waiting for 8MHz mod ampl going to default", one should send again the command from a standard iPython shell:

cm_send('Lnfs100','SET8AMPL',15)

Images attached to this comment