List: Commits « Previous MessageNext Message » From: Øystein Grøvlen Date: December 3 2010 10:20am Subject: Re: bzr commit into mysql-next-mr-bugfixing branch (tor.didriksen:3236) WL#1393 View as plain text
```Hi,

After thinking a bit more about it, I think my formula for pq_cost was
wrong.  See inline.

On 12/ 2/10 03:39 PM, Øystein Grøvlen wrote:
>>> I am still not convinced that the two usages of PQ_slowness actually
>>> represents the same constant. What make PQ_slowness applicable in
>>> both contexts? AFAICT, the value of PQ_slowness is based on scenarios
>>> where only one of the formulas apply.
>>
>
> OK. I will try. Maybe the problem is not PQ_slowness, but the formulas
> where it is used.
>
> What is compared by the unit, is, AFAICT, the cost when all rows fit in
> sort buffer. In that case, the cost of sort merge is
>
> sm_cost= q*n*log(n)
>
> where n is total number of rows and q is a constant specific to qsort.
>
> The cost of using a priority queue is
>
> pq_cost= p*n*log(n) + q*l*log(l)

The above formula is wrong.  Since there are only l elements in the
queue, each attempt at inserting is O(log(l)). So I guess the formula
should be:

pq_cost= p*n*log(l) + q*l*log(l) = (p*n + q*l)*log(l),

which makes things even more complicated:

pq_cost/sm_cost= (p/q + l/n)*log(l)/log(n)

pq_cost/sm_cost < 1 <=>

(p/q + l/n)*log(l)/log(n) < 1 <=>

p/q + l/n < log(n)/log(l) <=>

log(n)/log(l) - l/n > p/q

Correct me, if I am wrong, but this does not seem to be easy to simplify
any further.

If we look at the cost when stripping off add-on fields, the code contains:

+      const double pq_cost=
+        (PQ_slowness * num_rows + param->num_keys_per_buffer) *
+        log(param->max_rows + 1.0) / TIME_FOR_COMPARE_ROWID +
+        param->max_rows * table->file->scan_time();
+

Since param->num_keys_per_buffer is set to log(param->max_rows + 1.0) in
the start of this function, using the terminology above, the main-memory
work in this formula is equivalent to:

(PQ_slowness * n + l) * log(l)  (assuming that +1 is not significant)

We see that this is equivalent to my revised pq_cost formula above, when
q=1 and p=PQ_slowness.  So this formula seems to be theoretically sound.

I suggest that the same strategy is used in the first cost comparison
when all rows fits in memory.  That is, setting

const double pq_cost=
(PQ_slowness * num_rows + param->num_keys_per_buffer) *
log(param->num_keys_per_buffer)

and

const double sm_cost= num_rows * log(num_rows);

and then compare those cost, instead of the current "short-cut".

The PQ_slowness value probably needs to be revisited in that case.

--
Øystein

>
> where l is the limit value, p is a constant specific to insertion into
> the priority queue and n and q is a before.
>
> So,
>
> pq_cost/sm_cost = p/q + (l*log(l))/(n*log(n))
>
> The priority queue should be preferred when pq_cost/sm_cost < 1, which
> gives:
>
> l/n * log(l)/log(n) < 1 - p/q <=>
>
> n/l * log(n)/log(l) > 1/(1-p/q) <=>
>
> n/l * log(n)/log(l) > q/(q-p)
>
> It seems that in the code the log(n)/log(l) factor is ignored. Since in
> that case, you get:
>
> n/l > q/(q-p) <=> l < n/PQ_slowness, where PQ_slowness=q/(q-p).
>
> So the main question is then: Is it OK to ignore the log(n)/log(l) factor?
>
> In addition, I am not convinved that q/(q-p) is the appropriate constant
> for the other formula. However, I will need to investigate this a bit
> further, before I am quite sure.
>
```