As I'm new on this list and can't find a "policy statement", I assume it
is OK to reply to the list in "discussion list" fashion. If instead it is
preferred to reply to each sender privately and later I'd post a summary,
please let me know and I'll comply.
On Sat, 1 Sep 2007, Shawn Green wrote:
> Lucio Chiappetti wrote:
> > I have some queries, involving a largish number of JOIN, which are
> > [...] very slow or [...] remain in the "statistics" status [forever]
> > This involved creating a working table G
> > [...]
> > The simultaneous access was achieved in our interface by a mechanism
> > we called virtual tables, which essentially was
> > SELECT
> > some subset of columns in some of the t0...tn
> > or some expression thereof
> > FROM
> > G left join t0 on G.t0=t0.seq
> > left join t1 on G.t1=t1.seq
> > ...
> > left join tn on G.tn=tn.seq
> > We refer to the t0...tn as "the member tables" of G.
> I normally do not get lost in symbolic descriptions such as yours.
> However your description of G and how you build it leaves me in the
> Can you show me a few sample rows of G (symbolically, if you like) and
> describe what is in each column G.t0 to G.tn? What I am curious to know
> is what do each of these n object have in common that allows them to
> represented as a single tuple on the G table.
The explanation was already (too tersely ?) contained in my sentence
> > This table has columns named t0 t1 ... tn (containing the pointers
> > t0.seq t1.seq ... for counterparts associated and validated according
> > to some criteria) plus other service columns
Let me make an example.
t0 is a table of X-ray sources, it contains sky coordinates (ra,dec),
count rates and fluxes in 5 energy bands, detection probabilities etc.
In addition it contains a column named "seq" which is just an
auto_increment sequential number, and is the unique way to identify a
given source (there is a further complication due to the fact some records
are redundant detections of the same source, but since they are not used
in building the G's you'd better ignore this).
t1 might be e.g. a table of optical sources, with sky coordinates,
magnitudes, and, of course, its own "seq". There may be different tables
of optical sources (t1, t2, t3).
Or t4 can be a table of radio sources, with sky coordinates, fluxes, and
Or t5 can be a table of URLs into external astronomical sites like SIMBAD
or NED (if you know them), again with sky coordinates and a "seq".
Some of all these tables have their own identifiers, but sometimes these
aren't unique, or aren't numeric. So in general these tables MAY have an
UNIQUE PRIMARY index which can be either the original identifier, or some
combination of original columns (for instance a source identifier and a
field identifier) and an unique auxiliary key which is my auto_increment
seq (numeric, built at data ingestion). This again is mostly irrelevant to
you. The point is that each table has an unique auto_increment seq.
A "G" table will simply contain the "seq's" in the "member tables". For
ease of use the column names in G will be the table names of the member
tables. I indicated them as t0 t1 t2 ... that their actual names are e.g.
"nov06", "d1t3", "ukidss", "radio", "simbad" is irrelevant.
So a record in G may contain for instance :
- its own seq 253719 (do not be worried by the fact the number is large
there are lots of gaps for records removed during construction)
- the seq in the X-ray table (t0) : 1521
- the seq in an optical table (t1) : 1229
- the seq in another optical table (t2) : 42168
- the seq in the radio table (t3) : null
- the seq in an IR table (t4) : 9
Another record with seq 260429 can have the same t0=1521, but e.g. the seq
in the IR table t4=11, and all other t1 t2 t3 null.
Essentially G says that X-ray source 1521 can have up to 2 (or 1 or 7 or
whatever) potential counterparts, one is optical t2=1229 which is the same
as optical t3=42168 and the same as IR t4=9 ; the other is only IR t4=11,
All associations are pre-computed via some sort of other (proximity)
> You also mention other "service columns". What kinds of information are
> you keeping in those?
Information which is irrelevant to the present discussion, except for a
marginal point (see below). Like for instance a numeric rank which says
that the association 1521/1229/42168/null/9 is preferred, and the
association 1521/null/null/null/11 is unlikely, or to be rejected. Or
flags produced during the identification. Or the chance probabilities that
the association of a source in t0 and t1 or t2 is real considered the
distance and the density of objects having a given magnitude.
> > We have different versions of G corresponding to different sets
> > of member tables and different association criteria.
> "Each G" ? Again, that makes the concept of what a G really is more
> confusing to me. I understand databases and I know more than a little
> about stellar cartography, cosmology, and physics. Please don't hold
Each G in the sense of different versions. I may have one G starting from
a t0 with X-ray sources in an area of the sky, and another with a t0 in
another area or coming from a different analysis. Or even with the same t0
I may have a G whose members are t0 t1 t2 t3 (identification done two
years ago with tables available then) and a G whose members are t0 t1 t5
t6 t7 t8 t9 t10 (identification done now, with more tables, or with newer
versions of some tables). Again this is irrelevant for MY PROBLEM, except
for the following statement :
- I have no problems with the SELECT statements I reported above for
whatever number of t0...tn member tables. Neither if I issue the
statement as such, nor if I encapsulate it in a CREATE VIEW
- I start encounter problems if I want to join the VIEW created with the
above statement with something else, when n is slighly large (e.g. a G
with 11 members), but if I use ALGORITHM=TEMPTABLE the problem goes
away, at least for n=26 (my largest G).
- I have a problem again for a more complex join, which arises around
Actually it's look like further experimenting shows a solution, but
I'm still unsure whether that gives the same results of the old one :
such a solution is : replace all LEFT JOIN by STRAIGHT_JOIN
Maybe you'd just want to comment on that ?
> Please do explain the data problems you are trying to solve in some more
> detail as I may be able to help you to design a less cumbersome method
> of achieving the same goals.
I thank you for your offer. However we do not want to revolutionize all
our system and interfaces (Shawn, if you are interested I can give you
privately more specific pointers), which has been working well since some
years and used inside our consortium, and has just "gone public" for some
The "data problem" is likely to have no or little relation with the
astronomical nature of the data, or the background explained above (which
howver I hope was of some interest), but can be summarized as follows.
Note the problems occur ONLY in the latest steps, which are those of
lesser importance for the user (or important for less users) :
- there are n tables (all with their "seq" column), which contain
several columns, of which some are particularly interesting, and
some other are less interesting
- these tables are associated via a "G" table which links all the seqs
of associated objects. This is precomputed once forever.
- the majority of the users will be interested in a small number of
interesting columns taken from some of the member tables (or
expressions thereof). Howevever we wanted to screen them from entering
in a SELECT statement columns explicitly by names like nov06.seq,
nov06.fluxb, w1t3.magip or even
dist(nov06.ra_corr,nov06.dec_corr,w1t3.ra,w1t3.decl)*3600 (where dist
is an UDF) ... and I have even worse expressions.
So we originally devised our own java interface to somehow hide the
expansion of such aliases, and later embedded the aliases in a
The "FROM" of such statements is the one listed above, involving
G and a sequence of left joins on t0 to tn.
This works nicely and fast in both the old (no view) and new (with
- a minority of users could be interested in accessing also some of
the columns in the member tables, which are NOT listed in our
non-hidden list or are NOT members of the view.
In the old arrangement, we simply had a check box which enabled
"view member also" and un-hided the hidden columns in a menu, Also
the user could simply type their name (say nov06.snrcd) in the
SELECT if he knew the name.
Such statement was the SAME involving G and a sequence of left joins on
t0 to tn. It is just a matter of listing more columns in the
"select" before the "from".
However if the new arrangement uses a view V to hide the entire
statement, if I WANT USERS TO BE ABLE ALSO TO SEE HIDDEN COLUMNS
(if you want THIS is the data problem !!!) I have to explicitly
repeat the list of joins. I cannot issue
select nov06.snrcd from V
because nov06.snrcd was not included in the definition of the view
(and there are too many columns in the n members t0...tn to be all
named in a viable view). So I generate a statement like that listed
at about 2/3 of the post which started the thread
list of (V.colname and ti.colname with i chosen among 0 and n)
( G left join t0 on G.t0=t0.seq
left join t1 on G.t1=t1.seq
left join tn on G.tn=tn.seq
) left join V on G.seq=V.seq
Again this works provided one uses ALGORITHM=TEMPTABLE in the CREATE
VIEW. Despite the fact some redundant queries are issued.
- a real minority of users (me only ? for debugging purposes ? it
happened me once that I detected and fixed a typo that way )
could be interested in accessing not only the interesting columns
in the view AND the hidden columns in the members, but also
columns in a FURTHER TABLE tk (e.g. which is not yet a member, or
is a new release of some member which I have to evaluate whether
to replace). This involves a more complex join ... which is listed
towards the end of the original post and I won't repeat here
Such join involves one further view which in turn is a join of
the G and a correlation table between tk and the X-ray table t0
(essentially listing all seqs in t0 and tk whose object are closer
than a predefined sky distance).
It is this latter statement (of rare use, hence my reluctance to
revolutionize all the rest) which enters the "statistics state"
forever for large number of members.
However, as I said, replacing LEFT JOIN by STRAIGHT_JOIN seems
to solve it.
Now coming to more general issues, I have read (or re-read) most of the
manual pages you quoted, but your explanation below is what condenses more
effectively what happens.
> EXPLAIN SELECT simply stops a normal SELECT statement from actually
> performing the data retrieval steps and shows us [...]
I used in the past EXPLAIN SELECT mainly to check whether introducing an
index (or sometimes a parenthesized order) improved the performance,
although why it did it was sort of black magic.
> It is during this optimization phase that most of your CPU time is being
> used as the engine will work many permutations of joining one table to
> another until it reaches a decision about which plan is "less expensive"
> that all of the others.
So essentially this matches what a lost Google reference said, that in
case of many joins an excessive times may be spent checking up to n!
combinations unless one somehow "forces the order".
What I gather from some hints in the documentation (and my very rough and
quick test) is that STRAIGHT_JOIN might be the way to force the order (but
will it have some unpleasant side effects or is it safe ?)
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
() ascii ribbon campaign - against html mail