List:General Discussion« Previous MessageNext Message »
From:Lucio Chiappetti Date:September 3 2007 11:36am
Subject:Re: excessive time spent in "statistics" status
View as plain text  
As I'm new on this list and can't find a "policy statement", I assume it 
is OK to reply to the list in "discussion list" fashion. If instead it is 
preferred to reply to each sender privately and later I'd post a summary, 
please let me know and I'll comply.

On Sat, 1 Sep 2007, Shawn Green wrote:
> Lucio Chiappetti wrote:

> > I have some queries, involving a largish number of JOIN, which are
> > [...] very slow or [...] remain in the "statistics" status [forever]

> > This involved creating a working table G
> > [...]
> > The simultaneous access was achieved in our interface by a mechanism
> > we called virtual tables, which essentially was
> > 
> > SELECT
> > some subset of columns in some of the t0...tn
> > or some expression thereof
> > FROM
> > G left join t0 on G.t0=t0.seq
> >   left join t1 on G.t1=t1.seq
> > ...
> >   left join tn on G.tn=tn.seq    
> > 
> > We refer to the t0...tn as "the member tables" of G.

> I normally do not get lost in symbolic descriptions such as yours. 
> However your description of G and how you build it leaves me in the 
> dark.
> 
> Can you show me a few sample rows of G (symbolically, if you like) and 
> describe what is in each column G.t0 to G.tn?  What I am curious to know 
> is what do each of these n object have in common that allows them to 
> represented as a single tuple on the G table.

The explanation was already (too tersely ?) contained in my sentence

> > This table has columns named t0 t1 ... tn (containing the pointers
> > t0.seq t1.seq ... for counterparts associated and validated according
> > to some criteria) plus other service columns

Let me make an example. 

t0 is a table of X-ray sources, it contains sky coordinates (ra,dec), 
count rates and fluxes in 5 energy bands, detection probabilities etc.
In addition it contains a column named "seq" which is just an 
auto_increment sequential number, and is the unique way to identify a 
given source (there is a further complication due to the fact some records 
are redundant detections of the same source, but since they are not used 
in building the G's you'd better ignore this).

t1 might be e.g. a table of optical sources, with sky coordinates, 
magnitudes, and, of course, its own "seq". There may be different tables 
of optical sources (t1, t2, t3). 

Or t4 can be a table of radio sources, with sky coordinates, fluxes, and 
its "seq".

Or t5 can be a table of URLs into external astronomical sites like SIMBAD 
or NED (if you know them), again with sky coordinates and a "seq".

Some of all these tables have their own identifiers, but sometimes these 
aren't unique, or aren't numeric. So in general these tables MAY have an 
UNIQUE PRIMARY index which can be either the original identifier, or some 
combination of original columns (for instance a source identifier and a 
field identifier) and an unique auxiliary key which is my auto_increment 
seq (numeric, built at data ingestion). This again is mostly irrelevant to 
you. The point is that each table has an unique auto_increment seq.

A "G" table will simply contain the "seq's" in the "member tables". For 
ease of use the column names in G will be the table names of the member 
tables. I indicated them as t0 t1 t2 ... that their actual names are e.g. 
"nov06", "d1t3", "ukidss", "radio", "simbad" is irrelevant.

So a record in G may contain for instance :

 - its own seq 253719 (do not be worried by the fact the number is large 
   there are lots of gaps for records removed during construction)
 - the seq in the X-ray table (t0) : 1521
 - the seq in an optical table (t1) : 1229
 - the seq in another optical table (t2) : 42168
 - the seq in the radio table (t3) : null
 - the seq in an IR table (t4) : 9

Another record with seq 260429 can have the same t0=1521, but e.g. the seq 
in the IR table t4=11, and all other t1 t2 t3 null.

Etc. etc.

Essentially G says that X-ray source 1521 can have up to 2 (or 1 or 7 or 
whatever) potential counterparts, one is optical t2=1229 which is the same 
as optical t3=42168 and the same as IR t4=9 ; the other is only IR t4=11, 
etc. etc.

All associations are pre-computed via some sort of other (proximity) 
analysis.

> You also mention other "service columns". What kinds of information are 
> you keeping in those?

Information which is irrelevant to the present discussion, except for a 
marginal point (see below). Like for instance a numeric rank which says 
that the association 1521/1229/42168/null/9 is preferred, and the 
association 1521/null/null/null/11 is unlikely, or to be rejected. Or 
flags produced during the identification. Or the chance probabilities that 
the association of a source in t0 and t1 or t2 is real considered the 
distance and the density of objects having a given magnitude.

> > We have different versions of G corresponding to different sets
> > of member tables and different association criteria.

> "Each G" ?  Again, that makes the concept of what a G really is more 
> confusing to me. I understand databases and I know more than a little 
> about stellar cartography, cosmology, and physics. Please don't hold 
> back.

Each G in the sense of different versions. I may have one G starting from 
a t0 with X-ray sources in an area of the sky, and another with a t0 in 
another area or coming from a different analysis. Or even with the same t0 
I may have a G whose members are t0 t1 t2 t3 (identification done two 
years ago with tables available then) and a G whose members are t0 t1 t5 
t6 t7 t8 t9 t10 (identification done now, with more tables, or with newer 
versions of some tables). Again this is irrelevant for MY PROBLEM, except 
for the following statement :

 - I have no problems with the SELECT statements I reported above for
   whatever number of t0...tn member tables. Neither if I issue the
   statement as such, nor if I encapsulate it in a CREATE VIEW

 - I start encounter problems if I want to join the VIEW created with the 
   above statement with something else, when n is slighly large (e.g.  a G 
   with 11 members), but if I use ALGORITHM=TEMPTABLE the problem goes 
   away, at least for n=26 (my largest G).

 - I have a problem again for a more complex join, which arises around
   n=20. 

   Actually it's look like further experimenting shows a solution, but
   I'm still unsure whether that gives the same results of the old one :

   such a solution is : replace all LEFT JOIN by STRAIGHT_JOIN

   Maybe you'd just want to comment on that ?

> Please do explain the data problems you are trying to solve in some more 
> detail as I may be able to help you to design a less cumbersome method 
> of achieving the same goals.
 
I thank you for your offer. However we do not want to revolutionize all 
our system and interfaces (Shawn, if you are interested I can give you 
privately more specific pointers), which has been working well since some 
years and used inside our consortium, and has just "gone public" for some 
datasets.

The "data problem" is likely to have no or little relation with the 
astronomical nature of the data, or the background explained above (which 
howver I hope was of some interest), but can be summarized as follows. 
Note the problems occur ONLY in the latest steps, which are those of 
lesser importance for the user (or important for less users) :

 - there are n tables (all with their "seq" column), which contain
   several columns, of which some are particularly interesting, and
   some other are less interesting

 - these tables are associated via a "G" table which links all the seqs
   of associated objects. This is precomputed once forever.

 - the majority of the users will be interested in a small number of 
   interesting columns taken from some of the member tables (or 
   expressions thereof). Howevever we wanted to screen them from entering 
   in a SELECT statement columns explicitly by names like nov06.seq, 
   nov06.fluxb, w1t3.magip or even 
   dist(nov06.ra_corr,nov06.dec_corr,w1t3.ra,w1t3.decl)*3600 (where dist 
   is an UDF) ... and I have even worse expressions.

   So we originally devised our own java interface to somehow hide the
   expansion of such aliases, and later embedded the aliases in a
   CREATE VIEW.

   The "FROM" of such statements is the one listed above, involving
   G and a sequence of left joins on t0 to tn.

   This works nicely and fast in both the old (no view) and new (with
   view) way.

 - a minority of users could be interested in accessing also some of
   the columns in the member tables, which are NOT listed in our 
   non-hidden list or are NOT members of the view.

   In the old arrangement, we simply had a check box which enabled
   "view member also" and un-hided the hidden columns in a menu, Also
   the user could simply type their name (say nov06.snrcd) in the
   SELECT if he knew the name.

   Such statement was the SAME involving G and a sequence of left joins on 
   t0 to tn. It is just a matter of listing more columns in the 
   "select" before the "from".

   However if the new arrangement uses a view V to hide the entire
   statement, if I WANT USERS TO BE ABLE ALSO TO SEE HIDDEN COLUMNS
   (if you want THIS is the data problem !!!) I have to explicitly
   repeat the list of joins. I cannot issue

     select nov06.snrcd from V

   because nov06.snrcd was not included in the definition of the view
   (and there are too many columns in the n members t0...tn to be all
   named in a viable view). So I generate a statement like that listed
   at about 2/3 of the post which started the thread

   SELECT 
    list of (V.colname and ti.colname with i chosen among 0 and n)
   FROM
    ( G left join t0 on G.t0=t0.seq
        left join t1 on G.t1=t1.seq
        ...
        left join tn on G.tn=tn.seq    
    )   left join V  on G.seq=V.seq 

   Again this works provided one uses ALGORITHM=TEMPTABLE in the CREATE 
   VIEW. Despite the fact some redundant queries are issued.

 - a real minority of users (me only ? for debugging purposes ? it 
   happened me once that I detected and fixed a typo that way )
   could be interested in accessing not only the interesting columns
   in the view AND the hidden columns in the members, but also
   columns in a FURTHER TABLE tk (e.g. which is not yet a member, or
   is a new release of some member which I have to evaluate whether 
   to replace). This involves a more complex join ... which is listed
   towards the end of the original post and I won't repeat here

   Such join involves one further view which in turn is a join of
   the G and a correlation table between tk and the X-ray table t0
   (essentially listing all seqs in t0 and tk whose object are closer
   than a predefined sky distance).

   It is this latter statement (of rare use, hence my reluctance to
   revolutionize all the rest) which enters the "statistics state"
   forever for large number of members.
   
   However, as I said, replacing LEFT JOIN by STRAIGHT_JOIN seems
   to solve it.

Now coming to more general issues, I have read (or re-read) most of the 
manual pages you quoted, but your explanation below is what condenses more 
effectively what happens.

> EXPLAIN SELECT simply stops a normal SELECT statement from actually 
> performing the data retrieval steps and shows us [...] 

I used in the past EXPLAIN SELECT mainly to check whether introducing an 
index (or sometimes a parenthesized order) improved the performance, 
although why it did it was sort of black magic.

> It is during this optimization phase that most of your CPU time is being 
> used as the engine will work many permutations of joining one table to 
> another until it reaches a decision about which plan is "less expensive" 
> that all of the others.

So essentially this matches what a lost Google reference said, that in 
case of many joins an excessive times may be spent checking up to n! 
combinations unless one somehow "forces the order".

What I gather from some hints in the documentation (and my very rough and 
quick test) is that STRAIGHT_JOIN might be the way to force the order (but 
will it have some unpleasant side effects or is it safe ?)

Thanks again.

-- 
-----------------------------------------------------------------------
Lucio Chiappetti - INAF/IASF - via Bassini 15 - I-20133 Milano (Italy)
For more info : http://www.iasf-milano.inaf.it/~lucio/personal.html
-----------------------------------------------------------------------
()  ascii ribbon campaign - against html mail 
/\                          http://arc.pasp.de/
Thread
excessive time spent in "statistics" statusLucio Chiappetti31 Aug
  • Re: excessive time spent in "statistics" statusShawn Green1 Sep
    • Re: excessive time spent in "statistics" statusLucio Chiappetti3 Sep
      • Re: excessive time spent in "statistics" statusShawn Green3 Sep
        • Re: excessive time spent in "statistics" statusLucio Chiappetti4 Sep
          • Re: excessive time spent in "statistics" statusLucio Chiappetti4 Sep
            • Re: excessive time spent in "statistics" statusShawn Green4 Sep
              • Re: excessive time spent in "statistics" statusLucio Chiappetti5 Sep
        • Re: excessive time spent in "statistics" statusJeremy Cole14 Sep
          • Re: excessive time spent in "statistics" statusLucio Chiappetti14 Sep
          • Re: excessive time spent in "statistics" statusPete Harlan17 Sep
            • Re: excessive time spent in "statistics" statusBaron Schwartz17 Sep