2004-08-26 kl. 19.57 skrev Clint Byrum:
>> From what I'm seeing.. the number of nodes isn't a great way to
> how much network bandwidth you'll need. It seems more important to
> determine it based on the level of reads/writes. Since queries send a
> small message to each node, and then could potentially return HUGE
> results, its also important to think of the types of queries you'll
> I'm not up on the protocols used, but just by the theory involved, it
> would seem that you can predict fairly accurately how much bandwidth
> will be required per-insert/update. 1 insert on a 100byte row, will hit
> two servers (hash based..) with 100bytes+overhead. 1 update on an 8byte
> column will hit two servers with 8bytes+overhead.
The request as such to update a record is about 100 bytes in size. Each
has 20 bytes overhead and each Primary key access message (independent
whether read, update, insert, delete) has around of 60 bytes see
MYSQL_TOP/include/kernel/signaldata/TcKeyReq.hpp contains a class
of the message (there are some optional fields in this as well).
In addition the Primary key needs to be sent which is always a multiple
of 4 bytes.
Tables without Primary Key definition will have an 8 byte generated PK.
In addition the update program needs to be sent. The update program is
simple. It contains one entry for each attribute updated. This entry
AttributeHeader of 4 bytes followed by the data.
Thus an update of one 8 bytes attribute with an 8 byte PK requires
20 + 60 + 8 + 4 + 8 = 100 bytes.
Setting a NULL value only requires the AttributeHeader.
This message goes to the DBTC module which is the coordinator of the
Next the message goes to the nodes containing the data. There is a
LQHKEYREQ sent to the module DBLQH which has some more data since the
to update is sent in the message.
Thus 20 + 70 + 8 + 4 + 8 is approximately sent in a small LQHKEYREQ.
Important here is that the first LQHKEYREQ to the primary replica is
sometimes not over
the network since the primary replica could be on the same node as
DBTC. The message
from the primary replica to the backup replica is however always
The response to the DBTC is sent from the last replica by sending
LQHKEYCONF a 32 byte
message (if I remember correctly) and then if there is a COMMIT phase
there is a number of
COMMIT and COMMITTED messages going between the TC and the replicas.
Then finally there is a TCKEYCONF sent back to the mysqld. After that
in the background
a round of COMPLETE and COMPLETED messages are sent to complete the
and free all resources.
If larger update programs are sent then it is segmented into a set of
messages each carrying
100 bytes of data and 20 bytes of overhead.
The same goes for the primary key.
Reads are mostly quicker since they can cut out the COMMIT and COMPLETE
of course the read locks needs to be hold. Read program only contains
Attribute Headers for the
attributes to read. In addition a signal TRANSID_AI is sent from the
node containing the data to
the mysqld with the data. This message contains AttributeHeaders plus
the data read. One
such message per record read.
An additional note is that the MySQL Server always performs reads of
the records to update before
the update is performed.
For detailed info see among others the following message classes
> Queries are tougher. If you have 1 query on a table of 1000 100byte
> rows, each node could return between (100000/db node
> count)bytes+overhead and 0 bytes. The per-node thing hits a bit here,
> each query does get sent to each db node group (multicast, anyone?).
Queries use either the PK protocol mentioned or a similar protocol for
which is similar but uses an extra table lookup for every access.
Otherwise they use the scan protocol. This sends a SCAN_TABREQ message
DBTC in charge of the scan. This then sends one SCAN_FRAGREQ per
(=fragment) in parallel to all nodes. Then each node sends a batch of
to the mysqld, then waits for mysqld to process those and then is told
the next and so forth and finally at the end there is a close of the
The number of records returned to mysqld for a full table scan without
indexes is equal
to the number of records in the table. For range scans using an ordered
index only records
in the range is returned.
Only attributes valid for the query is sent back to mysqld.
> So anyway, has anyone come up with some good formulas for figuring this
> out? The toughest thing must be recovery time. At that time all nodes
> could be sending all data to one node. That sounds like it could really
> drag the whole cluster down.
At node recovery the surviving node helps the starting node resynch the
This is implemented in such a manner as to perform the node recovery
a non-loaded systems. When load increases the node recovery takes
No particular programming for this is necessary since it happens
with the message execution handling. The messages are executed in FIFO
and when the load increases each message has to wait longer and there is
a carefully designed parallelism in the node recovery to protect the
downgraded service at node recovery.
> SCI is something not a lot of us in the low-end space are familiar with
> yet. I see where cards run about $1000 each, and if I understand
> correctly, each card has two connectors, and things are setup where
> node connects to two others. Is there a definitive site you guys can
> recommend for an SCI newbie to look at?
> Still, $1000 extra for each server is a lot if you're going the "lots
> and lots of servers" route. What about bonded gigabit ethernet?
The major reason for using SCI is that decreases the overhead per
message (CPU overhead)
by a factor of 10 and the overhead per byte with a factor of 2-3
compared to using TCP/IP
over Ethernet. So obviously if your server equipped with say 8 GByte of
memory and dual
CPU costs around 4000$ and you add between 800-1500$ of extra HW to
that which increases
your performance by a factor of 2 then SCI is good econometrics. So the
choice of using SCI
or not, is really a business decision with different answers based on
your server costs and also
on your usage scenario. When communication cost is high it is usually
high, when most processing
is not communication then it is mostly a bad idea and this depends a
lot on how the application
> MySQL Cluster Mailing List
> For list archives: http://lists.mysql.com/cluster
> To unsubscribe:
Mikael Ronström, Senior Software Architect
MySQL AB, www.mysql.com