Reports 1-1 of 1 Clear search Modify search
On-call intervention (General)
Oncall-system - 9:18 Friday 22 August 2025 (67536) Print this report
On-call intervention

The following report has been submitted to the On-call interface.

On-call events -> Network

Title: EGO firewall issue

Author(s): kraja, cortese

Called at: 00:50, 22-08-2025, by: Sposito
Remote intervention: Started: 00:50, 22-08-2025; Ended: 03:45, 22-08-2025
On-site intervention: Started: ; Ended:
Status: Action pending
Operator when issue resolved: Sposito

Details:

The call was originated by the operator on duty (G. Sposito) at approximately 00:50 LT, who reported a problem with the connection to EGO services (Captive portal, DMS site, Mail, Operators) from remote.
After an initial analysis, S. Cortese was also involved (called at approximately 01:10 LT).
The analysis revealed that the problem affects the EGO firewall, more precisely one of its two nodes. Once we were certain that only one node was involved, we relocated the services to the second node, and the situation returned to normal at approximately 03:40 LT.
Tests have revealed that the problem on the first node persists, so further investigations will be carried out in the coming days to understand the type of problem encountered.
During this issue, only external connections were affected, as well as the various EGO/Virgo websites (i.e., logbook, vmd, etc.).
There was no impact on the interferometer network. In fact, for the entire duration of the issue, the interferometer remained locked in science mode.
In addition, the operator was able to connect via VPN and open thinlinc sessions as usual for the normal duty.

* Note that any files attached to this report are available in the On-call interface.

Comments to this report:
carbognani, seder - 16:10 Friday 22 August 2025 (67538) Print this report

Among the external network connections also the Low Latency Data Transfer via Kafka was also effected, resulting in Virgo data  missing on CIT side.

Predecessors of the problems were reported by the Iciga2 monitoring at the top level by flagging the Virgo Low Latency machines as not pingable: 

lowlatency-virgo is DOWN

Host check output:

PING CRITICAL - Packet loss = 80%, RTA = 8635.61 ms

Notification type: PROBLEM
Date time: Thu Aug 21 20:18:11 2025 UTC

Then the process in the LowLatencyAnalysis VPM dedicated to the monitoring of the Cascina -> CIT link (V1KafkaCITIn) reported accurately such data loss:

2025-08-22-01h23m36-UTC>WARNING-Miss 11408 seconds between 1439849472 and 1439860880
2025-08-22-01h23m36-UTC>INFO...-CfgReachState> Active(Active) Ok

At the moment, V1KafkaCITIn report those errors as warning (tipically we can have few seconds interruptions that get managed by the internal Kafka mechanism), we intend to modify the process so it goes in error state (and trigger DMS notifications for better monitoring) in case the interruption is lasti more than a certain time (to be put as a parameter for the process) 

To be noted that the data loss also happened for the incoming data but only for the LLO link, as reported by the writing process L1KafkaCasIn

2025-08-22 01h21m13 UTC FdIOGetFrame: miss 11413 seconds between 1439849472.0 and 1439860885.0

2025-08-22 01h21m13 UTC Input frames are back; gps=1439860885 latency=6.6

The LHO link was not effected. This may be due to the complexity of the network outage generated by the firewall.

Search Help
×

Warning

Error

The present report has been modified outside this window. Please check for its integrity in the main page.

Refreshing this page will move this report into drafts.

×