From: Brancaleoni Matteo Date: June 21 2004 5:33pm Subject: Re: DB node hang on start List-Archive: http://lists.mysql.com/cluster/23 Message-Id: <1087839220.2694.4.camel@athlon> MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Hi. I managed to get it work on the same machine. What I didn't know, is that a db node waits all other nodes on startup, if nr of db nodes >1 . so that is solved... so, can we go on with the remote one? could I follow this same procedure on the remote node? Another question: NoOfReplicas what means? from the manual it can be from 1 to 4, but really I cannot set it more than 2... so seems that a cluster with 4 db nodes (all on the same machine, with multiple computer entries) uses only first 2, even all nodes are up. is that expected? Thanks, Matteo Il lun, 2004-06-21 alle 20:36, Tomas Ulin ha scritto: > you should be able to run with 2 COMPUTER definitions on the same > machine, let's start with making that work > > do the following: > - clean up directory from NDB_Trace, error.log etc > - get back to the stuck state (make sure it's stuck for a minute or so) > - identify the "ndb pid" = e.g. 17993 for the stuck node, you should see > something like 2004-06-21 18:29:18 [NDB] INFO -- Angel pid: 17991 > ndb pid: 17993 > - force an abort with kill -6 17993 > - tar.gz the NDB_Trace... file and send it to me > > T > > Matteo Brancaleoni wrote: > > >Hi. > > > >I was able to run 2 db nodes on the same machine. > >The problem was into the [COMPUTER] > >definition. Following the demos, I thought that > >I need 2 [COMPUTER] definitions, even pointing > >to the same machine, and let DB node 1 to run > >on computer 1 and db node 2 to run on computer 2 > >(that's the same entry). > > > >Simply removing the 2nd computer entry and > >letting db node #2 to run on computer 1 (as the first > >db node) works ok. > > > >so far so good. > > > >but now I have the problem about having the 2nd db node > >on another machine... still no joy. > > > >Matteo > > > >Il lun, 2004-06-21 alle 17:34, Tomas Ulin ha scritto: > > > > > >>but, I saw the below. It shows that you did not start cluster empty (-i). > >> > >>T > >> > >>2004-06-20 16:49:34 [NDB] INFO -- Angel pid: 5558 ndb pid: 5560 > >>2004-06-20 16:49:34 [NDB] INFO -- NDB Cluster -- DB node 2 > >>2004-06-20 16:49:34 [NDB] INFO -- Version 3.5.0 (beta) -- > >>2004-06-20 16:49:34 [NDB] INFO -- Start initiated (version 3.5.0) > >>Dbdict: name=sys/def/SYSTAB_0,id=0 > >>Dbdict: name=sys/def/NDB$EVENTS_0,id=2 > >>Dbdict: name=test/def/matteotabella2,id=4 > >>Dbdict: name=test/def/4/PRIMARY,id=6 > >>Dbdict: name=test/def/matteo,id=8 > >>Dbdict: name=test/def/8/PRIMARY,id=10 > >>Dbdict: name=test/def/mytabella,id=12 > >>Dbdict: name=test/def/12/PRIMARY,id=14 > >>2004-06-20 16:50:12 [NDB] INFO -- Started (version 3.5.0) > >> > >> > >> > >>Tomas Ulin wrote: > >> > >> > >> > >>>when going from 1-node to 2-nodes, did you restart both nodes with -i > >>>flag? > >>> > >>>T > >>> > >>>Matteo Brancaleoni wrote: > >>> > >>> > >>> > >>>>Hi > >>>> > >>>>Il lun, 2004-06-21 alle 13:45, Tomas Ulin ha scritto: > >>>> > >>>> > >>>> > >>>> > >>>>>Did you try to start the second node with "ndbd -i"? > >>>>> > >>>>> > >>>>> > >>>>yes, without success. > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>>Brancaleoni Matteo wrote: > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>Hi, thanks for the fast answer :) > >>>>>>see my comments inline. > >>>>>> > >>>>>>Il lun, 2004-06-21 alle 00:43, Tomas Ulin ha scritto: > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>first of all, if you download the latest source you don't have to > >>>>>>>specify the "[TCP]" connections at all > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>Ok, done. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>1) please look where you started ndb_mgmd, you should find a > >>>>>>>cluster.log (look at the end "tail -n100 cluster.log") > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>ok, got it. unfortunately no trace about the db node #3, that's > >>>>>>the one onto the remote machine > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>2) please make sure that you don't have any trailing "ndbd" > >>>>>>>processes on the failing machine. (we're working on better > >>>>>>>detection on clashes), if so kill and restart (if a "ndb" process > >>>>>>>hangs this is often due to that there are "multiple" processes > >>>>>>>trying to connect as the same "id") > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>ok. no trailing processes. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>3) make sure you have your [COMPUTER] sections correct in the > >>>>>>>config file > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>ok, done > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>4) make sure that your Ndb.cfg/NDB_CONNECTSTRING points to the > >>>>>>>actual host:port that run the ndb_mgmd > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>sure done. > >>>>>>If I write something wrong (done just 4 testing) the node > >>>>>>doesn't go at all into starting phase (should be phase 1, I think). > >>>>>>But when starts, is stick in that state. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>>>and try again until you get the config right > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>mmh... I tried to start 2 db nodes on the same machine > >>>>>>(of course with different fs), the 2nd db node starts, > >>>>>>but after phase #4 crashes. > >>>>>> > >>>>>>I have a rather long trace file for that. > >>>>>>the error into ndbd error.log is : > >>>>>> > >>>>>>Date/Time: x 20 June 2004 - 23:15:49 > >>>>>>Type of error: error > >>>>>>Message: Internal program error (failed ndbrequire) > >>>>>>Fault ID: 2341 > >>>>>>Problem data: DbdihMain.cpp > >>>>>>Object of reference: DBDIH (Line: 1080) 0x00000002 > >>>>>>ProgramName: NDB Kernel > >>>>>>ProcessID: 10904 > >>>>>>TraceFile: NDB_TraceFile_1.trace > >>>>>>***EOM*** > >>>>>> > >>>>>> > >>>>>>The mgm config is (for 2 db nodes on same machine) > >>>>>>[COMPUTER] > >>>>>>Id: 1 > >>>>>>ByteOrder: Little > >>>>>>HostName: bestia > >>>>>>[COMPUTER] > >>>>>>Id: 2 > >>>>>>ByteOrder: Little > >>>>>>HostName: bestia > >>>>>>[MGM] > >>>>>>Id: 1 > >>>>>>ExecuteOnComputer: 1 > >>>>>>ArbitrationRank: 1 > >>>>>>[DB DEFAULT] > >>>>>>NoOfReplicas: 2 > >>>>>>[DB] > >>>>>>Id: 2 > >>>>>>ExecuteOnComputer: 1 > >>>>>>FileSystemPath: /root/ndb/ndb_data1 > >>>>>>[DB] > >>>>>>Id: 3 > >>>>>>ExecuteOnComputer: 2 > >>>>>>FileSystemPath: /root/ndb/ndb_data2 > >>>>>>[API] > >>>>>>Id: 4 > >>>>>>ExecuteOnComputer: 1 > >>>>>>ArbitrationRank: 1 > >>>>>> > >>>>>>Regarding 2 db nodes on different machines, I'm stick > >>>>>>to node #3 not starting (stops at phase 1, without > >>>>>>exiting...) > >>>>>>The only difference in mgm config.ini is the hostname > >>>>>>of COMPUTER with id #2 > >>>>>> > >>>>>>any clue? > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>> > >>> -- Brancaleoni Matteo Espia Srl