> On Mo, 1999-08-02 01:38:38 -0500, Jay Jarvinen <3pound@stripped> wrote:
> > In effect, I want to store a huge array of binary-coded integers in a
BLOB
> > type field. So I could store 0 .. 4294967295 in 4 bytes (plus mysql L+4
> > bytes for the BLOB column), and avoid storing useless commas.
>
> Storing these integers into a BLOB field to me only would makes sense,
> if you want to concatenate them all into one huge binary string which
> you then store a a single column.
Yes. A massive string of fixed length (4 bytes) concatenated numbers - which
I'd then extract at runtime with something like:
my @nums = unpack("A4"x( length($bigstring)/4 ), $bigstring); # or
substring or s/// or ...
..... <insert more testing> .....
Silly me. I'm not lost. Not completely.
Same script, adding <perl>$bigint = $bigint x 100</perl> just before the
insert:
With 100 (diff of 9)
-rw-rw---- 1 mysql mysql 409 Aug 1 15:11
/home/mysql/data/test/foo.ISD
-rw-rw-r-- 1 root root 400 Aug 1 15:11 mybinnum.txt
My original tests worked with an array, I then simplified to a single number
for testing purposes - that was the gotcha.
With 3 (diff of 9):
-rw-rw---- 1 mysql mysql 21 Aug 1 15:18
/home/mysql/data/test/foo.ISD
-rw-rw-r-- 1 root root 12 Aug 1 15:18 mybinnum.txt
With 2 (diff of 12):
-rw-rw---- 1 mysql mysql 20 Aug 1 15:18
/home/mysql/data/test/foo.ISD
-rw-rw-r-- 1 root root 8 Aug 1 15:18 mybinnum.txt
Extra overhead at <= 2?
In any case, you can see once I insert more than 2 values I get the desired
effect.
<superfluous explanation>
Perhaps I wasn't clear on why I wanted something like this.
1) I'm considering using MySQL to store an inverted index for free-text
searching throughout the database. Where I've indexed the word column and
have corresponding 'binary-int-array' blob that holds the document ID's.
2) Search 'firewalling', storing given words/rules for users which are then
applied to each search. Basically the same storage concept, except it's a
user's ID and an array of 'banword' ID's which are indexed.
I can live with 4,294,967,295 being the maximum row-ID I can store.
"Modern Information Retrieval" gives some guidelines where a "medium
collection (200mb)" could be represented by an index 36% that size (72MB).
On my text data, just the first 1000 records has ~10,000 unique words. So
using a seperate row with INT columns would be ~93,000 rows, rather than
10,000.
Per process overhead is much larger to slurp in a giant array instead of
iterating over the rows, but with cheap memory and CPU power - it seemed a
win over using many many selects. Opinions, ideas?
Is saving 600 bytes per 100 numbers worth the packing/unpacking trouble?
Maybe not, but storing big integers seperated by commas just didn't sit
right with me, and if a given word matched 100,000 doc-ID's .. that's around
600Kb of savings. If 100 words match 100,000 documents ... etc..
</superfluous explanation>
Thanks for the nudge,
-Jay J