List:General Discussion« Previous MessageNext Message »
From:Steve Edberg Date:June 22 2007 7:24am
Subject:Re: Blob data
View as plain text  
At 12:11 PM +0530 6/22/07, Ratheesh K J wrote:
>Hello All,
>I want a clarification. Whe run a forum wherein people send messages 
>with/without attachments. Attachments may contain images, documents 
>etc.. We are actually storing the attachment in a blob column. 
>Sometimes the attachments are big. And today the table size has 
>grown to 40 GB. This has created a headache for any maintanance 
>task, backup, restoration. etc.
>I want to know whether this is the right approach. Or should we 
>actually store the attachments in directories and just stiore the 
>attachment path in the database.
>Kindly suggest the best approach so that I can reduce the database size.
>Thanks in advance

Yes, storing files - especially non-textual files - in the file 
system instead of the database is generally considered the best 

At one point I had created a document management system that stored 
everything in the database as you are doing; my rationale was that it 
allowed me to manage permissions using the existing database 
permissions, and to back up the whole database using mysqldump, vs 
mysqldump + doing a tar of the files. However, I abandoned this 
approach for the following reasons:

(1) Storing non-plaintext items (eg; pictures) in the database makes 
it bigger and slower without added value - you can't (at least not 
yet, or in the foreseeable future) do a meaningful search on a blob.

(2) It becomes more difficult to split storage out onto multiple 
filesystems; eg, leaving the database files in /var/database, putting 
the documents themselves into /home/docmanager, etc.

(3) It makes queries on the commandline unwieldy; if you have a blob 
field, doing a select * to check a record's contents can dump a lot 
of garbage on the screen.

(4) It can make doing incremental backups more difficult; if the 
documents themselves are relatively static, but the document metadata 
stored in the database is very dynamic, it becomes simple to do a 
compact daily database dump + a weekly document directory backup (for 
example) if the files are not in the database.

What I do is create a unique SHA1 hash when a file is uploaded (eg; 
sha1(rand()). The original filename and the 40-character hash are 
stored in the database, and the document is stored in the filesystem 
using the hash as the filename. I can optionally compress and encrypt 
the document as well, storing the encryption key in the database. 
This gives (for me) adequate document security. An additional 
advantage is that you can take advantage of the filesystem tree if 
you have a large number of documents. For example, if a document hash 
is 'f0118e9bd2c4fb29c64ee03abce698b8', you can store the file in the 
directory tree f0/11/8e/f0118e9bd2c4fb29c64ee03abce698b8 (extending 
to as many levels as you feel necessary). By keeping the number of 
files per directory fairly small, file retrieval becomes relatively 
fast. As the hashes approximate a random distribution, you should 
always have a close-to-balanced tree.

Lastly, I store a hash of the document itself in the database as 
well. This allows me to detect if duplicate files are uploaded, and 
to determine if a previously-uploaded file has been corrupted in some 


+--------------- my people are the people of the dessert, ---------------+
| Steve Edberg                       |
| UC Davis Genome Center                            sbedberg@stripped |
| Bioinformatics programming/database/sysadmin             (530)754-9127 |
+---------------- said t e lawrence, picking up his fork ----------------+
Blob dataRatheesh K J22 Jun
  • Re: Blob dataSteve Edberg22 Jun
    • Re: Blob dataPaul McCullagh26 Jun
  • Re: Blob dataWarren Young26 Jun
    • Re: Blob dataKevin Waterson26 Jun
      • Re: Blob dataWarren Young27 Jun
Re: Blob dataAlex Arul Lurthu25 Jun
Re: Blob dataPaul McCullagh27 Jun