I've read a few technical papers on how large companies handle massive amounts of data
(i.e. search engines, phone companies).
Really it comes down to no general purpose database is strong enough (including Oracle) to
handle the amount of data involved. Which
is why phone companies, google, etc create their own "databases". I think the ex/new
AT&T calls their "database" Hawkeye, which
holds over 300TB of data. Google refers to theirs as the Google File System (not even
using the term database).
Google released a paper a few years ago on how their system worked. Basically, the network
was the database and the raid setup. Web
pages were split into snippets that matched the network packet size (assuming about 1.4K,
unless jumbo packets are used). Each
snippet of data was stored on 3 machines (think of the Rama series by Arthur C. Clarke).
The index of the data was also split and
stored on different machines. If a box failed, the data that was on that machine would be
pulled from the other 2 and replicated so
there were always at least 3 copies.
I may not be remembering it completely accurately. I think this is the white paper:
MySQL can scale pretty good if you carefully take advantage of the different table types
and features for the different traffic
patterns. I've used InnoDB, MyISAM, Merge tables and even Black Hole with Replication to
handle different needs. You don't need this
attention to detail with Oracle, but then Oracle costs a little more.
----- Original Message -----
From: "mos" <mos99@stripped>
Sent: Thursday, April 26, 2007 11:53 PM
Subject: Re: FW: MySQL patches from Google
> This sounds a lot like what I'm attempting. I tried a proprietary database and got
> around 30k queries/second, compared to MySQL's
> of only 1-1.5k queries /second. I'm torn between using the Windows proprietary
> database (still has some minor buggy parts) on each
> webserver or going with MySQL 5.x and perhaps using clusters to improve scalability.
> I'll need to do more testing before deciding.
> The thing that has me leaning towards MySQL is if I stick with Windows, I'll have to
> use Vista eventually because XP will no
> longer be sold. And that's not something I'd wish on anyone. :)
> At 08:20 PM 4/25/2007, David T. Ashley wrote:
>>On 4/25/07, mos <mos99@stripped> wrote:
>>>At 02:36 PM 4/25/2007, you wrote:
>>> >On 4/25/07, Daevid Vincent <daevid@stripped> wrote:
>>> >>A co-worker sent this to me, thought I'd pass it along here. We do
>>> >>failover/replication and would be eager to see mySQL implment the
>>> >>patches in the stock distribution. If anyone needs mission critical,
>>> >>scaleable, and failover clusters, it's Google -- so I have every
>>> >>their patches are solid and worthy of inclusion...
>>> >This isn't surprising for Google. They've done the same thing to Linux.
>>> >I don't know much about Google's infrastructure these days, but several
>>> >years ago they had a server farm of about 2,000 identical x86 Linux
>>> >serving out search requests. Each machine had a local hard disk
>>> >the most recent copy of the search database.
>>>So you're saying they had a MySQL database on the same machine as the
>>>webserver? Or maybe 1 webserver machine and one MySQL machine?
>>>I would have thought a single MySQL database could handle the requests
>>>25-50 webservers easily. Trying to maintain 2000 copies of the same
>>>database requires a lot of disk writes. I know Google today is rumored to
>>>have over 100,000 web servers and it would be impossible to have that many
>>>databases in sync at all times.
>>When I read the article some years ago, I got the impression that it was a
>>custom database solution (i.e. nothing to do with MySQL).
>>If you think about it, for a read-only database where the design was known
>>in advance, nearly anybody on this list could write a database solution in
>>'C' that would outperform MySQL (generality always has a cost).
>>Additionally, if you think about it, if you have some time to crunch on the
>>data and the data set doesn't change until the next data set is released,
>>you can probably optimize it in ways that are unavailable to MySQL because
>>of the high INSERT cost. There might even be enough time to tune a hash
>>function that won't collide much on the data set involved so that the query
>>cost becomes O(1) rather than O(log N). You can't do that in real time on
>>an INSERT. It may take days to crunch data in that way.
>>My understanding was the Google's search servers had custom software
>>operating on a custom database format. My understanding was also that
>>each search server had a full copy of the database (i.e. no additional
>>network traffic involved in providing search results).
>>As far as keeping 100,000 servers in sync, my guess would be that most of
>>the data is distilled for search by other machines and then it is rolled out
>>automatically in a way to keep just a small fraction of the search servers
>>offline at any one time.
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe: http://lists.mysql.com/mysql?unsub=1