Rick Dwyer <rdwyer@stripped> wrote on 11/17/2005 10:28:51 AM:
> Hello All.
> I am hoping for a bit of help with some code that has really given me
> some trouble. If this is not he correct forum for this any help in
> pointing me to a more suited list would be appreciated.
> I have a MySQL 4.1.x database containing records with phone numbers.
> Most of the phone numbers are enter in 12035551212 format, but some
> are entered with spaces or "-" or "(" or other characters.
> I need to read the first 4 positions in the phone number to determine
> it's location.
> My statement looks like this:
> 'Select mid(phone, 1,4) as phoneareacode from phonetable'
> This works but if the number is entered as 1(203)-555-1212 the above
> would return "1(20" which is not what I am looking for.
> Is there a way to have the select statement examine only numeric
> values in the phone number so it would disregard the other charcters?
> In Lasso, you can use a Replace with a Regular Expression function to
> have just the digits 0-9 examined but haven't been able find a way
> to do this in SQL.
> Any help is appreciated.
> Thank you.
The problem is, you have "dirty" data making it very hard to use it as a
search target. There are different ways of handling this:
a) scrub your data (preferably during input or import) so that all phone
numbers are stored in the exact same pattern
b) store each number as its component parts (country code, city/area code,
c) add a field of scrubbed data to your existing table and populate it.
d) add fields for each part of the phone number to your tables and
e) handle all searching and substring matching outside of SQL.
Options a) and c) require that a "clean" copy of the data be stored in the
database. That means that you pick a pattern and make all of your numbers
look like that pattern. If, for instance, you get just the number 555-1212
you would need to generate something like 'x-xxx-555-1212' as a replacment
(where the x is used to indicate missing information). This is slower to
search on because you have to do substring matches but since phone numbers
are already organized from least-specific to most-specific (left to right)
it's already optimized for some types of substring matches.
Options b) and d) provide the ability to index each part of a phone
number. Here is an example breakdown:
1 (203) 555-1212 -> country code:1, citycode:203, exchange: 555, circut:
If any part of a number is missing, you can use a NULL value for that
555-1212 -> country code: NULL, citycode: NULL, exchange: 555, circut:
This would be all integer comparisons, VERY fast to search. If you have
millions of phone numbers or international phone numbers, consider this
schema. It may be hard to determine the foreign `exchange` and `circuit`
parts. In that case just pick either the `exchange` or `circuit` field and
put that whole portion of it in there. This takes more time to set up but
is MUCH faster to search (how do you think the phone company does it?)
Options b) and d) also allow you to have searchable data while preserving
the original information.
Option e) has the most flexibility but takes the database server out of
the loop, which will destroy your search performance.
Unimin Corporation - Spruce Pine