I know this is a long email, but if you could find the time to reply,
it would be greatly appreciated.
I am in the design phase of a new project and am trying to determine
how to use data sharding and replication together to offer
high-availability, performance, and easy scalability for the site.
It's a social network with many of the typical social network site
features (profiles, photos, comments, videos, etc). We are using the
LAMP environment. I've looked into MySQL Cluster and after
researching it, decided not to use it. Instead, I've chosen to use
data sharding and replication.
We anticipate the database to grow very quickly over the next 12
months. We expect the database to reach around 800GB by the end of
the 12 months. As the database size grows the number of transactions
per second will also grow to the point that a single master MySQL
server will probably not be able to keep up. So we are looking at
ways to distribute the load of the database requests. We anticipate
the ratio of READs to WRITEs to be 6 to 1 over the course of the 12
month period. That is 6 selects to every insert/update.
- The MySQL machines will all have atleast 8GB RAM, 4-8 cores of CPU,
and plenty of 10-15K RPM diskspace, 1Gbit Ethernet connections.
- The data will be mostly of types int, tinyint, timestamp, enum, and
varchar (nothing more than varchar(500)).
- Photos, videos, or any other files will not be stored in the
database. Only the meta data for them (i.e. where they are).
- We're either going to go with MySQL 5.0 or MySQL 5.1, not sure yet.
We want to do the following:
A number of web servers (running php/apache) sitting behind some squid
Behind the web servers would be the database layer, which would consist of:
- shards sitting on a number of MySQL servers.
- for HA the MySQL servers would be set up in a Master-Master
(active/passive) configuration as is described in High Performance
MySQL 2nd edition on page 365. This would allow for online
maintenance and other goodies (as described in the book).
- a MySQL server that would act as a directory to locate
information in the shards (i.e user1's data sits in shard 1, thread3's
messages all sit on shard 4, etc)
- we would also use memcached in front of these MySQL servers to
help with the expected loads.
- This would allow us to scale quickly and easier (especially the WRITEs).
My concerns with this setup is:
1. It would seem that without creating another level of MySQL servers
that would act as directories, the 1 MySQL server that would act as a
directory in the current setup could become overwhelmed, because of
the millions of users, comments, photos, etc that need to be looked up
for each and every request? Is there a solution to this problem?
2. How many millions of rows can be inserted into a table before I
should start worrying about performance and capacity?
3. Can you comment on the proposed setup? Do you see any flaws,
possible problems, or inconsistencies? Any ideas would be
Thanks for any advice and insights!