List:General Discussion« Previous MessageNext Message »
From:Johan De Meersman Date:July 26 2013 10:19pm
Subject:RE: hypothetical question about data storage
View as plain text  
Your argument against FS assumes that you don't know the exact filename (directory
traversals), but your argument for InnoDB assumes that you do (index lookup). Apples and

Besides, the venerable ext2 handled up to a couple of tens of thousands of files per
directory smoothly when listing; things have only improved since then. "Small amounts" is
a very relative concept.

Rick James <rjames@stripped> wrote:
>"Count the disk hits"
>If you have a filesystem directory, consider that it is designed to
>handle small numbers of files per directory.  Consider that there is a
>limited cache for directories, etc.  Plus there is the inode (vnode,
>whatever) storage for each file.  I don't know the details (and it
>varies wildly with "filesystem" (ext, xfs, zfs, etc)).
>Looking at InnoDB...
>Let's say you have a billion rows in a single table, and you need to
>fetch one row by the PRIMARY KEY, and it is a MD5 (sha-1, UUID, etc). 
>Such a key is _very_ random.
>A billion rows would need about 5 levels of BTree.  The top levels
>would quickly all be cached.  (100M blocks * 16KB = 1.6GB.)  If the
>leaf nodes add up to 200GB, that is probably bigger than you
>innodb_buffer_pool_size.  In that case, a _random_ fetch is likely to
>be a cache miss.
>A cache miss is about 100ms on normal rotating-media; perhaps 10ms on
>SSDs.  This limits your reads to 10 (or 100) per second.
>If you have big BLOBs in the table, then it gets messier.  InnoDB does
>not put more than 8K of a row in the actual 16KB block.  The rest is
>stored in another block(s).  So, it is likely to take an extra disk hit
>If your data size is 100 times as big as your buffer pool, then it
>becomes likely that the next level of the BTree won't be fully
>cacheable.  Now 300ms/30ms.
>I think it is likely that the small number of disk hits for InnoDB is
>better than the many disk hits for traversing a directory tree (with
>large directories) in the filesystem.  I vote for InnoDB over the
>directory tree.
>Yes, you will have seeks.
>No, adding more RAM won't help much.  Here's an argument:
>Suppose your data is 20 times as big as the buffer pool and you are
>doing random fetches (MD5, etc).  Then 1/20 of fetches are cached; 95%
>cache miss.  Estimated time: 0.95 * 100ms = 95ms.
>Now you double your RAM.  1/10 cached -> 90% cache miss -> 90ms average
>-> Not much improvement over 95.
>> -----Original Message-----
>> From: cknipe@stripped [mailto:cknipe@stripped] On Behalf Of
>> Chris Knipe
>> Sent: Friday, July 26, 2013 12:30 AM
>> To: Johan De Meersman
>> Cc: mysql
>> Subject: Re: hypothetical question about data storage
>> Hi All,
>> Thanks for the responces, and I do concur.  I was taking a stab in
>> dark so to speak.
>> We are working with our hosting providers currently and will be
>> introducing a multitude of small iSCSI SANs to split the storage
>> structure over a multitude of disks...   This is something that needs
>> to be addressed from a systems perspective rather than an
>> one.
>> SSD (or Fusion and the like) are unfortunately still way to expensive
>> the capacity that we require (good couple of TBs) - so mechanical
>disks it
>> would need to be.  However, with the use of SANs as we hope, we
>should be
>> able to go up from 4 to over 64 spindles whilst still being able to
>> the storage and have redundancy.
>> Many thanks for the inputs and feedbacks...
>> --
>> C
>> On Fri, Jul 26, 2013 at 9:23 AM, Johan De Meersman
>> wrote:
>> > Hey Chris,
>> >
>> > I'm afraid that this is not what databases are for, and the first
>> you'll likely run into is amount of concurrent connections.
>> >
>> > This is typically something you should really tackle from a systems
>> perspective. Seek times are dramatically improved on SSD or similar
>> storage - think FusionIO cards, but there's also a couple of vendors
>> (Violin comes to mind) who provide full-blown SSD SANs.
>> >
>> > If you prefer staying with spinning disks, you could still improve
>> seeks by focusing on the inner cylinders and potentially by using
>> sector formatting. Again, there's SANs that do this for you.
>> >
>> > Another minor trick is to turn off access timestamp updates when
>> mount the filesystem (noatime).
>> >
>> > Also benchmark different filesystems, there's major differences
>> them. I've heard XFS being recommended, but I've never needed to
>> for seek times myself. We're using IBM's commercial GPFS here, which
>> good with enormous amounts of huge files (media farm here), not sure
>> it'd fare with smaller files.
>> >
>> > Hope that helps,
>> > Johan
>> >
>> > ----- Original Message -----
>> >> From: "Chris Knipe" <savage@stripped>
>> >> To: mysql@stripped
>> >> Sent: Thursday, 25 July, 2013 11:53:53 PM
>> >> Subject: hypothetical question about data storage
>> >>
>> >> Hi all,
>> >>
>> >> We run an VERY io intensive file application service.  Currently,
>> >> problem is that our disk spindles are being completely killed due
>> >> insufficient SEEK time on the hard drives (NOT physical read/write
>> >> speeds).
>> >>
>> >> We have an directory structure where the files are stored based on
>> >> the MD5 checksum of the file name, i.e.
>> >> /0/00/000/000044533779fce5cf3497f87de1d060
>> >> The majority of these files, are between 256K and 800K with the
>> >> exception (say less than 15%) being more than 1M but no more than
>> >> in size.  The content of the files are pure text (MIME Encoded).
>> >>
>> >> We believe that storing these files into an InnoDB table, may
>> >> actually give us better performance:
>> >> - There is one large file that is being read/written, instead of
>> >> BILLIONS of small files
>> >> - We can split the structure so that each directory (4096 in
>> >> sit's on their own database
>> >> - We can move the databases as load increases, which means that we
>> >> can potentially run 2 physical database servers, each with 2048
>> >> databases
>> >> each)
>> >> - It's easy to move / migrate the data due to mysql and
>replication -
>> >> same can be said for redundancy of the data
>> >>
>> >> We are more than likely looking at BLOB columns of course, and we
>> >> need to read/write from the DB in excess of 100mbit/s
>> >>
>> >> Would the experts consider something like this as being feasible? 
>> >> it worth it to go down this avenue, or are we just going to run
>> >> different problems?  If we are facing different problems, what can
>> >> possibly expect to go wrong here?
>> >>
>> >> Many thanks, and I look forward to any input.
>> >>
>> >
>> > --
>> > Unhappiness is discouraged and will be corrected with kitten
>> --
>> Regards,
>> Chris Knipe
>> --
>> MySQL General Mailing List
>> For list archives:
>> To unsubscribe:

Sent from Kaiten Mail. Please excuse my brevity.
hypothetical question about data storageChris Knipe25 Jul
  • Re: hypothetical question about data storageVahric Muhtaryan25 Jul
  • Re: hypothetical question about data storageJohan De Meersman26 Jul
    • Re: hypothetical question about data storageChris Knipe26 Jul
      • RE: hypothetical question about data storageRick James26 Jul
        • RE: hypothetical question about data storageJohan De Meersman26 Jul
          • Re: hypothetical question about data storageChris Knipe26 Jul
            • Re: hypothetical question about data storagehsv27 Jul
            • Re: hypothetical question about data storagewilliam drescher27 Jul
              • RE: hypothetical question about data storageRick James29 Jul
                • RE: hypothetical question about data storageJohan De Meersman29 Jul
                  • RE: hypothetical question about data storageRick James29 Jul
                    • Re: hypothetical question about data storageCarsten Pedersen30 Jul
                    • Re: hypothetical question about data storageManuel Arostegui30 Jul
                    • RE: hypothetical question about data storageJohan De Meersman30 Jul