End of story:
What follows is the ugliest hack I've ever taken credit for, but, it
does the simple job I need it to do.
If any perl guru out there can improve on two things, I'll revisit it,
but, I have other things to move on to.
1. Can we get the Date separated as it's own field? Critical for
sorting.
2. Also subject and sender would be nice.
Now, the ugliness begins (remember be gentle folks, I'm not a perl
programmer):
<begin perl script>
#!/usr/bin/perl -w
#This was sent to me by Thimble Smith (tim@stripped) on a request I'd
sent
#to the MySQL users group to help finding a decent way of parsing a
mailbox
#file into a MySQL friendly import format. Much Thanks to him, if I
pull
#this off.
use strict;
my @emails;
while (<>) {
if (/^From / .. /^$/) {
# inside the header
# don't want the line terminator
chomp;
# ignore the blank line
next unless length;
/^From (.*)/ and do {
# the From stuff is saved in the $1 variable
# start a new e-mail
push @emails, {
'headers' => {}, # for instant access by name
'headers_list' => [], # in case you need the order
'body' => '',
};
# reset line number count
$. = 1;
next;
};
unless (@emails) {
warn "non 'From' header before any 'From' line\n";
next;
}
my $email = $emails[-1];
/^([\S:][^:]*):(.*)$/ and do {
push @{$email->{'headers_list'}}, $1;
push @{$email->{'headers'}{$1}}, $2;
next;
};
/^\s/ and do {
unless (@{$email->{'headers_list'}}) {
warn "found continuation line before any headers\n";
next;
}
# do you understand this next line? ;-P
$email->{'headers'}{$email->{'headers_list'}[-1]}[-1] .= $_;
next;
};
warn "unrecognized line: $.: $_\n";
}
else {
unless (@emails) {
warn "body line before any headers\n";
next;
}
my $email = $emails[-1];
$email->{'body'} .= $_;
}
}
# print them all out:
for (my $i = 0; $i < @emails; ++$i) {
my $headers = $emails[$i]{'headers'};
my $body = $emails[$i]{'body'};
print "Message #", $i + 1, "";
print "\tHeaders:";
foreach (sort keys %$headers) {
print "$_";
foreach (@{$headers->{$_}}) {
$_ =~ s#\t##g;
print "$_: ";
}
}
print "\tBody:\n";
$_ = $body;
$_ =~ s#\n##g;
$_ =~ s#\t##g;
print "$_";
print "\0";
}
exit 0;
<end perl script>
Now, I do
$cat nsmail/mbox | perl suckmail.PL > tmp.txt
Then:
mysql>load data infile '/home/vanboers/tmp.txt' into table nsmail fields
terminated by '\t' lines terminated by '\0';
The table schema is:
+---------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| msgno | varchar(12) | YES | | NULL | |
| headers | longtext | YES | | NULL | |
| body | blob | YES | | NULL | |
+---------+-------------+------+-----+---------+-------+
Thanks to Tim, and all who helped me with this kludge.
Regards,
Van
--
=========================================================================
Linux rocks!!! www.dedserius.com
=========================================================================