From: Tomas Ulin Date: June 21 2004 6:36pm Subject: Re: DB node hang on start List-Archive: http://lists.mysql.com/cluster/22 Message-Id: <40D72AC6.7070307@mysql.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit you should be able to run with 2 COMPUTER definitions on the same machine, let's start with making that work do the following: - clean up directory from NDB_Trace, error.log etc - get back to the stuck state (make sure it's stuck for a minute or so) - identify the "ndb pid" = e.g. 17993 for the stuck node, you should see something like 2004-06-21 18:29:18 [NDB] INFO -- Angel pid: 17991 ndb pid: 17993 - force an abort with kill -6 17993 - tar.gz the NDB_Trace... file and send it to me T Matteo Brancaleoni wrote: >Hi. > >I was able to run 2 db nodes on the same machine. >The problem was into the [COMPUTER] >definition. Following the demos, I thought that >I need 2 [COMPUTER] definitions, even pointing >to the same machine, and let DB node 1 to run >on computer 1 and db node 2 to run on computer 2 >(that's the same entry). > >Simply removing the 2nd computer entry and >letting db node #2 to run on computer 1 (as the first >db node) works ok. > >so far so good. > >but now I have the problem about having the 2nd db node >on another machine... still no joy. > >Matteo > >Il lun, 2004-06-21 alle 17:34, Tomas Ulin ha scritto: > > >>but, I saw the below. It shows that you did not start cluster empty (-i). >> >>T >> >>2004-06-20 16:49:34 [NDB] INFO -- Angel pid: 5558 ndb pid: 5560 >>2004-06-20 16:49:34 [NDB] INFO -- NDB Cluster -- DB node 2 >>2004-06-20 16:49:34 [NDB] INFO -- Version 3.5.0 (beta) -- >>2004-06-20 16:49:34 [NDB] INFO -- Start initiated (version 3.5.0) >>Dbdict: name=sys/def/SYSTAB_0,id=0 >>Dbdict: name=sys/def/NDB$EVENTS_0,id=2 >>Dbdict: name=test/def/matteotabella2,id=4 >>Dbdict: name=test/def/4/PRIMARY,id=6 >>Dbdict: name=test/def/matteo,id=8 >>Dbdict: name=test/def/8/PRIMARY,id=10 >>Dbdict: name=test/def/mytabella,id=12 >>Dbdict: name=test/def/12/PRIMARY,id=14 >>2004-06-20 16:50:12 [NDB] INFO -- Started (version 3.5.0) >> >> >> >>Tomas Ulin wrote: >> >> >> >>>when going from 1-node to 2-nodes, did you restart both nodes with -i >>>flag? >>> >>>T >>> >>>Matteo Brancaleoni wrote: >>> >>> >>> >>>>Hi >>>> >>>>Il lun, 2004-06-21 alle 13:45, Tomas Ulin ha scritto: >>>> >>>> >>>> >>>> >>>>>Did you try to start the second node with "ndbd -i"? >>>>> >>>>> >>>>> >>>>yes, without success. >>>> >>>> >>>> >>>> >>>> >>>>>Brancaleoni Matteo wrote: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>Hi, thanks for the fast answer :) >>>>>>see my comments inline. >>>>>> >>>>>>Il lun, 2004-06-21 alle 00:43, Tomas Ulin ha scritto: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>first of all, if you download the latest source you don't have to >>>>>>>specify the "[TCP]" connections at all >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>Ok, done. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>1) please look where you started ndb_mgmd, you should find a >>>>>>>cluster.log (look at the end "tail -n100 cluster.log") >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>ok, got it. unfortunately no trace about the db node #3, that's >>>>>>the one onto the remote machine >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>2) please make sure that you don't have any trailing "ndbd" >>>>>>>processes on the failing machine. (we're working on better >>>>>>>detection on clashes), if so kill and restart (if a "ndb" process >>>>>>>hangs this is often due to that there are "multiple" processes >>>>>>>trying to connect as the same "id") >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>ok. no trailing processes. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>3) make sure you have your [COMPUTER] sections correct in the >>>>>>>config file >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>ok, done >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>4) make sure that your Ndb.cfg/NDB_CONNECTSTRING points to the >>>>>>>actual host:port that run the ndb_mgmd >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>sure done. >>>>>>If I write something wrong (done just 4 testing) the node >>>>>>doesn't go at all into starting phase (should be phase 1, I think). >>>>>>But when starts, is stick in that state. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>>and try again until you get the config right >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>mmh... I tried to start 2 db nodes on the same machine >>>>>>(of course with different fs), the 2nd db node starts, >>>>>>but after phase #4 crashes. >>>>>> >>>>>>I have a rather long trace file for that. >>>>>>the error into ndbd error.log is : >>>>>> >>>>>>Date/Time: x 20 June 2004 - 23:15:49 >>>>>>Type of error: error >>>>>>Message: Internal program error (failed ndbrequire) >>>>>>Fault ID: 2341 >>>>>>Problem data: DbdihMain.cpp >>>>>>Object of reference: DBDIH (Line: 1080) 0x00000002 >>>>>>ProgramName: NDB Kernel >>>>>>ProcessID: 10904 >>>>>>TraceFile: NDB_TraceFile_1.trace >>>>>>***EOM*** >>>>>> >>>>>> >>>>>>The mgm config is (for 2 db nodes on same machine) >>>>>>[COMPUTER] >>>>>>Id: 1 >>>>>>ByteOrder: Little >>>>>>HostName: bestia >>>>>>[COMPUTER] >>>>>>Id: 2 >>>>>>ByteOrder: Little >>>>>>HostName: bestia >>>>>>[MGM] >>>>>>Id: 1 >>>>>>ExecuteOnComputer: 1 >>>>>>ArbitrationRank: 1 >>>>>>[DB DEFAULT] >>>>>>NoOfReplicas: 2 >>>>>>[DB] >>>>>>Id: 2 >>>>>>ExecuteOnComputer: 1 >>>>>>FileSystemPath: /root/ndb/ndb_data1 >>>>>>[DB] >>>>>>Id: 3 >>>>>>ExecuteOnComputer: 2 >>>>>>FileSystemPath: /root/ndb/ndb_data2 >>>>>>[API] >>>>>>Id: 4 >>>>>>ExecuteOnComputer: 1 >>>>>>ArbitrationRank: 1 >>>>>> >>>>>>Regarding 2 db nodes on different machines, I'm stick >>>>>>to node #3 not starting (stops at phase 1, without >>>>>>exiting...) >>>>>>The only difference in mgm config.ini is the hostname >>>>>>of COMPUTER with id #2 >>>>>> >>>>>>any clue? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>> >>>