Virgo Logbook

AdV-COM (AdV commissioning (1st part) )

Carbognani, Cortese, Rolland - 14:53 Sunday 23 April 2017 (37340)

Recovery after network outage

This morning I received an alarm about TCS Chillers. Checking on the DMS and VPM I could see a flooding of red flags and I came onsite to check the situation.

Once onsite it was clear that the problem was at network level with some of the devices on the Central, NE and WE Buildings not reachable. I could not spot any evident electrical problem as a cause of the network outage.

After consulting with Antonella over the phone, Stefano came onsite and started fixing the problem (see his entry for details). The recovery also implied the reboot of the rtpc servers.

Situation could be almost fully recovered by resetting all devices disconnected from network (such as netcom bridges) and by restarting all processes running on rtpc and the whole DAQ with Loic support.

The communication with Cryo and Tower vacuum devices needed also to be restored (all corresponding Tango DAQ brides are back providing data).

The remaining problems are associated to missing SAT DSP g-names which are blocking the automation. The situation could not be recovered by stopping/restarting all SAT tango processes.

Since I could not find Valerio over the phone I leave the current situation and recovery will be continued tomorrow morning.

Comments to this report:

cortese - 15:42 Sunday 23 April 2017 (37341)

The network problem affected many hosts on the site access network switches and was caused by the main radius server not working properly because of the filling of its system disk.
By configuration, for the majority of the hosts the network connections already up should not have been dropped, irrespectively of the radius server status. Why this didn't happen is under investigation.

G. Ballardin - 11:00 Monday 24 April 2017 (37343)

Because of the problem reported in #37341 it was necessary to restart some motor servers (TSrvMtr_IB,TSrvMtr_PR,TSrvMtr_SR,TSrvMtr_BS,TSrvMtr_NI,TSrvMtr_WI,TSrvMtr_WE,TSrvMtr_DET)
Now everythings is working correctly.

G. Ballardin and Vacuum team - 11:12 Monday 24 April 2017 (37344)

Because of the problem reported in #37341 it was necessary to restart the PLC for both INJ and DET small-cryotraps:
neither of the PLCs were accessible via the network. In the meantime I also updated the firmware to read temperatures better.
Now the temperatures of both small-cryotraps are conistent.

mohan - 11:40 Monday 24 April 2017 (37346)

Because of the problem reported in #37341 it was also necessary to restart some tango servers in vacuum and injection which were red.

injection:

setenv TANGO_HOST tangoitf:10000; astor &

"Start sdt2011/DDS1"

vacuum:

setenv TANGO_HOST olserver122:10000;astor &

"Start LN2Rack/NI" and "Start LN2Rack/WI".

( Note: LN2Rack/NE and LN2Rack/WE which are identical to the 2 servers above did not go red.)

Images attached to this comment

Boschi, Trozzo - 11:47 Monday 24 April 2017 (37347)

The network outage caused some DSP boards of several SAs (MC, PR, IB, NI, BS) to lose their IPs as happened recently (see entry #37246). All affected boards are now up and the associated control loops have been closed and are properly working.