After thinking a bit more about it, I think my formula for pq_cost was
wrong. See inline.
On 12/ 2/10 03:39 PM, Øystein Grøvlen wrote:
>>> I am still not convinced that the two usages of PQ_slowness actually
>>> represents the same constant. What make PQ_slowness applicable in
>>> both contexts? AFAICT, the value of PQ_slowness is based on scenarios
>>> where only one of the formulas apply.
>> please elaborate.
> OK. I will try. Maybe the problem is not PQ_slowness, but the formulas
> where it is used.
> What is compared by the unit, is, AFAICT, the cost when all rows fit in
> sort buffer. In that case, the cost of sort merge is
> sm_cost= q*n*log(n)
> where n is total number of rows and q is a constant specific to qsort.
> The cost of using a priority queue is
> pq_cost= p*n*log(n) + q*l*log(l)
The above formula is wrong. Since there are only l elements in the
queue, each attempt at inserting is O(log(l)). So I guess the formula
pq_cost= p*n*log(l) + q*l*log(l) = (p*n + q*l)*log(l),
which makes things even more complicated:
pq_cost/sm_cost= (p/q + l/n)*log(l)/log(n)
pq_cost/sm_cost < 1 <=>
(p/q + l/n)*log(l)/log(n) < 1 <=>
p/q + l/n < log(n)/log(l) <=>
log(n)/log(l) - l/n > p/q
Correct me, if I am wrong, but this does not seem to be easy to simplify
If we look at the cost when stripping off add-on fields, the code contains:
+ const double pq_cost=
+ (PQ_slowness * num_rows + param->num_keys_per_buffer) *
+ log(param->max_rows + 1.0) / TIME_FOR_COMPARE_ROWID +
+ param->max_rows * table->file->scan_time();
Since param->num_keys_per_buffer is set to log(param->max_rows + 1.0) in
the start of this function, using the terminology above, the main-memory
work in this formula is equivalent to:
(PQ_slowness * n + l) * log(l) (assuming that +1 is not significant)
We see that this is equivalent to my revised pq_cost formula above, when
q=1 and p=PQ_slowness. So this formula seems to be theoretically sound.
I suggest that the same strategy is used in the first cost comparison
when all rows fits in memory. That is, setting
const double pq_cost=
(PQ_slowness * num_rows + param->num_keys_per_buffer) *
const double sm_cost= num_rows * log(num_rows);
and then compare those cost, instead of the current "short-cut".
The PQ_slowness value probably needs to be revisited in that case.
> where l is the limit value, p is a constant specific to insertion into
> the priority queue and n and q is a before.
> pq_cost/sm_cost = p/q + (l*log(l))/(n*log(n))
> The priority queue should be preferred when pq_cost/sm_cost < 1, which
> l/n * log(l)/log(n) < 1 - p/q <=>
> n/l * log(n)/log(l) > 1/(1-p/q) <=>
> n/l * log(n)/log(l) > q/(q-p)
> It seems that in the code the log(n)/log(l) factor is ignored. Since in
> that case, you get:
> n/l > q/(q-p) <=> l < n/PQ_slowness, where PQ_slowness=q/(q-p).
> So the main question is then: Is it OK to ignore the log(n)/log(l) factor?
> In addition, I am not convinved that q/(q-p) is the appropriate constant
> for the other formula. However, I will need to investigate this a bit
> further, before I am quite sure.