List:Cluster« Previous MessageNext Message »
From:Mikael Ronström Date:August 27 2004 9:12am
Subject:Re: Bandwidth, and SCI vs. Gigabit
View as plain text  
Hi Clint,

2004-08-26 kl. 19.57 skrev Clint Byrum:

>> From what I'm seeing.. the number of nodes isn't a great way to 
>> predict
> how much network bandwidth you'll need. It seems more important to
> determine it based on the level of reads/writes. Since queries send a
> small message to each node, and then could potentially return HUGE
> results, its also important to think of the types of queries you'll
> have.
>
> I'm not up on the protocols used, but just by the theory involved, it
> would seem that you can predict fairly accurately how much bandwidth
> will be required per-insert/update. 1 insert on a 100byte row, will hit
> two servers (hash based..) with 100bytes+overhead. 1 update on an 8byte
> column will hit two servers with 8bytes+overhead.
>

The request as such to update a record is about 100 bytes in size. Each 
message
has 20 bytes overhead and each Primary key access message (independent 
of
whether read, update, insert, delete) has around of 60 bytes see
MYSQL_TOP/include/kernel/signaldata/TcKeyReq.hpp contains a class 
definition
of the message (there are some optional fields in this as well).

In addition the Primary key needs to be sent which is always a multiple 
of 4 bytes.
Tables without Primary Key definition will have an 8 byte generated PK.

In addition the update program needs to be sent. The update program is 
really
simple. It contains one entry for each attribute updated. This entry 
has an
AttributeHeader of 4 bytes followed by the data.

Thus an update of one 8 bytes attribute with an 8 byte PK requires
20 + 60 + 8 + 4 + 8 = 100 bytes.

Setting a NULL value only requires the AttributeHeader.

This message goes to the DBTC module which is the coordinator of the 
transaction.
Next the message goes to the nodes containing the data. There is a 
similar message
LQHKEYREQ sent to the module DBLQH which has some more data since the 
nodeids
to update is sent in the message.

Thus 20 + 70 + 8 + 4 + 8 is approximately sent in a small LQHKEYREQ.
Important here is that the first LQHKEYREQ to the primary replica is 
sometimes not over
the network since the primary replica could be on the same node as 
DBTC. The message
from the primary replica to the backup replica is however always 
distributed.

The response to the DBTC is sent from the last replica by sending 
LQHKEYCONF a 32 byte
message (if I remember correctly) and then if there is a COMMIT phase 
there is a number of
COMMIT and COMMITTED messages going between the TC and the replicas.

Then finally there is a TCKEYCONF sent back to the mysqld. After that 
in the background
a round of COMPLETE and COMPLETED messages are sent to complete the 
transaction
and free all resources.

If larger update programs are sent then it is segmented into a set of 
messages each carrying
100 bytes of data and 20 bytes of overhead.
The same goes for the primary key.

Reads are mostly quicker since they can cut out the COMMIT and COMPLETE 
rounds unless
of course the read locks needs to be hold. Read program only contains 
Attribute Headers for the
attributes to read. In addition a signal TRANSID_AI is sent from the 
node containing the data to
the mysqld with the data. This message contains AttributeHeaders plus 
the data read. One
such message per record read.

An additional note is that the MySQL Server always performs reads of 
the records to update before
the update is performed.

For detailed info see among others the following message classes
MYSQL_TOP/ndb/include/kernel/signaldata
TcKeyConf.hpp
LqhKey.hpp
AttrInfo.hpp

> Queries are tougher. If you have 1 query on a table of 1000 100byte
> rows, each node could return between (100000/db node
> count)bytes+overhead and 0 bytes. The per-node thing hits a bit here, 
> as
> each query does get sent to each db node group (multicast, anyone?).
>

Queries use either the PK protocol mentioned or a similar protocol for 
unique keys
which is similar but uses an extra table lookup for every access.

Otherwise they use the scan protocol. This sends a SCAN_TABREQ message 
to the
DBTC in charge of the scan. This then sends one SCAN_FRAGREQ per 
partition
(=fragment) in parallel to all nodes. Then each node sends a batch of 
records
to the mysqld, then waits for mysqld to process those and then is told 
to fetch
the next and so forth and finally at the end there is a close of the 
scan process.

The number of records returned to mysqld for a full table scan without 
indexes is equal
to the number of records in the table. For range scans using an ordered 
index only records
in the range is returned.

Only attributes valid for the query is sent back to mysqld.

> So anyway, has anyone come up with some good formulas for figuring this
> out? The toughest thing must be recovery time. At that time all nodes
> could be sending all data to one node. That sounds like it could really
> drag the whole cluster down.
>

At node recovery the surviving node helps the starting node resynch the 
data.
This is implemented in such a manner as to perform the node recovery 
fast in
a non-loaded systems. When load increases the node recovery takes 
longer time.
No particular programming for this is necessary since it happens 
automatically
with the message execution handling. The messages are executed in FIFO 
order
and when the load increases each message has to wait longer and there is
a carefully designed parallelism in the node recovery to protect the 
cluster from
downgraded service at node recovery.

> ALSO:
> SCI is something not a lot of us in the low-end space are familiar with
> yet. I see where cards run about $1000 each, and if I understand
> correctly, each card has two connectors, and things are setup where 
> each
> node connects to two others. Is there a definitive site you guys can
> recommend for an SCI newbie to look at?
>
> Still, $1000 extra for each server is a lot if you're going the "lots
> and lots of servers" route. What about bonded gigabit ethernet?
>

The major reason for using SCI is that decreases the overhead per 
message (CPU overhead)
by a factor of 10 and the overhead per byte with a factor of 2-3 
compared to using TCP/IP
over Ethernet. So obviously if your server equipped with say 8 GByte of 
memory and dual
CPU costs around 4000$ and you add between 800-1500$ of extra HW to 
that which increases
your performance by a factor of 2 then SCI is good econometrics. So the 
choice of using SCI
or not, is really a business decision with different answers based on 
your server costs and also
on your usage scenario. When communication cost is high it is usually 
high, when most processing
is not communication then it is mostly a bad idea and this depends a 
lot on how the application
behaves.

Rgrds Mikael

>
> -- 
> MySQL Cluster Mailing List
> For list archives: http://lists.mysql.com/cluster
> To unsubscribe:    
> http://lists.mysql.com/cluster?unsub=1
>
>
Mikael Ronström, Senior Software Architect
MySQL AB, www.mysql.com

Clustering:
http://www.infoworld.com/article/04/04/14/HNmysqlcluster_1.html

http://www.eweek.com/article2/0,1759,1567546,00.asp


Thread
Bandwidth, and SCI vs. GigabitClint Byrum26 Aug
  • Re: Bandwidth, and SCI vs. GigabitMartin Skold27 Aug
  • Re: Bandwidth, and SCI vs. GigabitMikael Ronström27 Aug