Virgo Logbook

IT infrastructure (Network)

cortese - 17:43 Thursday 10 September 2020 (49388)

NAC (Network Authentication and Control) problem

Today since midnight many physical devices started to disconnect from the network apparently randomly until when the majority was not reachable this morning ( see this entry ) both in the experiment network and the general network (thin/vdi clients and phones).
The problem was related to the centralized infrastructure that assigns the devices to the vlan/IP_networks according to their ethernet mac-address: this has been traced down to the primary RADIUS server which started to deny access requests randomly.
After having reconfigured the LAN switches to point to the secondary RADIUS server the NAC assigment was working again around 13AM LT.
Since the current SL6.x RTPCs depend on this infrastructure to allow for their mobility, they also have been affected and required to be rebooted.
Some other devices whose dhcp client is not resilient to server outages needed to be either restarted manually by Franco (Netcom eth/serial bridges) or their interface flapped remotely from the switches (Hameg PSUs , env monitoring boxes).

(in NEB, the link of a device, probably ipcam05 , has been disabled since it floods the network with multiple source ethernet addresses causing load on the radius servers; at the moment it is not possible to point it as the cause of the NAC problem)

The primary radius server will stay offline until when the investigations will be completed.

Comments to this report:

carbognani - 7:35 Friday 11 September 2020 (49395)

As mentioned, as side effect of this problem, I had to unplug/plug the Netcom eth/serial bridges in the TCS room, EERoom and INJ Elab and restart the corresponding servers, namely:
TCSChillerNI
TCSChillerWI
Lnfs100 (Tango side)
SL_TempController (Tango side)

I also attempted the reset of the Netcom present on the half rack on the CEB entrance (serving the Piezos and Translation Stages) but they would not connect to the dedicated switch on the rack (no communication leds blinking). They are not currently in use and Stefano confirmed they were not present on the network for some time. To be further investigated.

cortese - 17:15 Monday 14 September 2020 (49413)

The problem has been traced down to an abnormal behaviour of ipcam05 (indeed in NEB and not WEB as previously written) and, to a lesser extent, of ipcam37 (in WEB).

Since end of August these devices started to flood the network with ethernet frames with unauthorized, apparently random, ethernet source addresses.

They finally increased the flooding to a rate of 2M packets per day in the first week of september.

Although the primary radius server was denying the access to these packets, the log of the transactions in the end filled its disk space until the point that it started to deny all the request (legitimate or not) that it could not log so causing the progressive/random kick out of many ITF devices and RTPCs.

To avoid similar problems in the future the two radius servers have just been put under monitoring/alarming for their disk space, and the log files rotation augmented with safety measures.

Investigation is ongoing by LAPP to solve this weird behaviour of the ipcams.

cortese - 10:09 Wednesday 16 September 2020 (49431)

The anomalous behaviour of the D-Link ipcams in the terminal buildings has been found to be caused by the UPnP service defined on them and their loss of configuration occurred a few months ago due to electricity interruptions.

Thierry is going to disable this service on all the ipcams, since it is unneeded.

masserot, bouedo, pacaud - 15:50 Monday 21 September 2020 (49456)

The same troubles occurred once more this morning around 05hUTC due to the camera "SNEB_FrontRight" - ipcam37 even if the UPnP service was disable on all IP cameras last week.

There is a camera spare available in the CEB-LAPP-office to replace with the faulty one . It will helpful if the swap could be done soon .

As consequence all the ITF Ethernet devices have to be reconnected to the network.

cortese - 11:13 Tuesday 22 September 2020 (49465)

The problem was the same occurred last week caused by ipcam37 having the UPnP service still enabled as it was originally on ipcam05.
I disabled it yesterday morning.

Unfortunately the disk space check we put last week on the radius servers to warn of the flooding didn't work because of a typo and therefore the disk filled again.

Now the alarming has been corrected.

pacaud, tacca - 15:10 Friday 25 September 2020 (49501)

The ipcam37 has been replaced by the ipcam42 and the Telescreen configuration upgraded . These operations were done yesterday (2020-09-24) at the beginning of the afternoon