Reports 1-1 of 1 Clear search Modify search
Online Computing (Online System Processes)
carbognani, cortese, dibiase - 18:51 Thursday 11 August 2022 (56721) Print this report
Problems with the Tango database cluster (olserver124 node corrupted)

This afternoon we experienced problems on the Tango database cluster with one machine (olserve124) becoming unresponsive and the cluster doing automatic swapping on olserver125.
Unfortunately, in this situation the Tango database servers (DataBaseds) were anyhow not able to connect to the database and reporting the error:

Starting tango db server in itf domain as user uid=1228(virgod) gid=5000(cvs) groups=5000(cvs),1100(virgo) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
virgod@olserver129[~]: main(): arrived
Received a CORBA::Exception
Tango exception
Severity = ERROR
Error reason = DB_SQLError
Desc : Failed to query TANGO database (error=Got error 157 'Unknown error code' from NDBCLUSTER)
The query was: SELECT count(*) FROM property_device_hist
Origin : DataBase::check_history_tables()
Exiting


From the ITF/Automation point of view, processes like SatServer (interfacing with the Tango DSPServers) doing only parameters reading at the moment, were able to carry on, but as soon as the Automation (INJ_MAIN) tried to set a value for the LNFS, we got immediately the error:
Received set_8_ampl message from INJ_MAIN
2022-08-11 14h00m23 UTC
Error communication with LNFS
.......
desc = Failed to connect to device sys/database/2
....

so the locking procedure was stuck.


Continuing on the investigation of the problem, the machine olserve124 resulted corrupted (and need to be later restored) and to complete a working fail over on olserve125 the restart of the whole cluster was needed (which took some time).

Ultimately, as soon as the cluster was again operational, we could restart correctly the DataBaseds, resume standard operation for Tango and ITF locking could proceed.

This event (and few other recent occurrences of Tango infrastructure problems) is another push toward the removal of the very few remaining critical dependencies on Tango within the Automation/Locking procedure, basically involving only LNFS100, SatServer and the Virgopll SQZ servers. Failures on the remaining Tango process (providing maily data collection) sould not prevent locking.
Porting of Virgopll SQZ servers to pyserver is already ongoing and, since quite straightforward, we are also working on a new version of the PyLnfs100 server connectly directly to the LNFS box and making redundant the Lnfs100 Tango server.

Comments to this report:
dibiase - 10:38 Friday 12 August 2022 (56732) Print this report

Olserver124 recovered. All nodes in Cluster Mysql on line and working:

[root@olserver124 ~]# ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]    2 node(s)
id=3    @90.147.137.26  (mysql-5.6.21 ndb-7.3.7, Nodegroup: 0, *)
id=4    @90.147.137.27  (mysql-5.6.21 ndb-7.3.7, Nodegroup: 0)


[ndb_mgmd(MGM)]    2 node(s)
id=1    @90.147.137.24  (mysql-5.6.21 ndb-7.3.7)
id=2    @90.147.137.25  (mysql-5.6.21 ndb-7.3.7)


[mysqld(API)]    15 node(s)
id=5    @90.147.137.25  (mysql-5.6.21 ndb-7.3.7)
id=6    @90.147.137.25  (mysql-5.6.21 ndb-7.3.7)
id=7    @90.147.137.25  (mysql-5.6.21 ndb-7.3.7)
id=8    @90.147.137.24  (mysql-5.6.21 ndb-7.3.7)
id=9    @90.147.137.24  (mysql-5.6.21 ndb-7.3.7)
id=10 (not connected, accepting connect from any host)
id=11 (not connected, accepting connect from any host)
id=12 (not connected, accepting connect from any host)
id=13 (not connected, accepting connect from any host)
id=14 (not connected, accepting connect from any host)
id=15 (not connected, accepting connect from any host)
id=16 (not connected, accepting connect from any host)
id=17    @90.147.137.24  (mysql-5.6.21 ndb-7.3.7)
id=18    @90.147.137.25  (mysql-5.6.21 ndb-7.3.7)
id=19    @90.147.137.24  (mysql-5.6.21 ndb-7.3.7)

 

 

Search Help
×

Warning

×