Fabricio Mota <fabricio.mota@stripped> wrote on 11/22/2005 11:07:51 PM:
> ---------- Forwarded message ----------
> From: Fabricio Mota <fabricio.mota@stripped>
> Date: 22/11/2005 23:06
> Subject: Re: DDE - Distributed indexes
> To: "Peter B. Volk" <peter.benjamin.volk@stripped>
>
> Peter,
>
> I've spend many neurons (hahahaha) thinking about it, too, and the only
> concrete conclusion I've got is: it will give us a big headache.
> The problem is several: degrading local performance, degrading GLOBAL
> performance (a lot, I imagine), unavailability and (bad) inconsistences
> during eventual whole fails, and other thing that has disturbed me a
little,
> that is: it may violate the "Principle of Reusing What MySQL Has Already
> Made".
>
> So, I've rereel the thought to the start of all, and I said to myself:
> "Let's think about Database concepts. How important is an index? Why
does
> the people create index on their tables?
> And myself answered to me: "people do that to improve the performance
for
> only SELECT commands."
> And I asked to myself again "When a global select being requested from a
LDD
> table, what would the system do?"
> And myself answered to me again:
> "The master server forwards the requested command to all servers; they
will
> process it as local command, and will reply their (partial) datasets to
the
> master server. The master server then will merge the results into a only
> dataset, by means of a *union all *command."
>
> And I started to ask to myself, and to you, too: Does a *union all* use
> indexes?
>
> Second point: I started to think with the reverse way, too (when we want
a
> solution, we search it through any ways :) ). "Independently of how
costly
> would it be, how interesting is to us keeping a global index?"
> That is another question good to make us think, just because right now
I'm
> starting to think that the indexes benefits wouldn't work so good for
DDS.
> The problem is that, as each row has its independent reference, it could
> take the system to retrieve row by row. That is not efficiently, is it?
>
> Maybe I am having difficulty to catch the point of need to use global
> indexes.
>
> Well, in the other hand, I'm sure that there may to be new hypothesis
that
> my reasoning didn't reach. And about doing it, as I've seen, you are
good.
>
> Waiting for response.
>
> FM
>
> 2005/11/22, Peter B. Volk <peter.benjamin.volk@stripped>:
> >
> > Hello Fabricio,
> >
> >
> >
> > I've been spending many showers on trying to find a solution how to
> > distribute the indexes of LDD tables to the whole cluster to do query
> > optimizations. Well what came out is that this is a hard one. My
> > idea would
> > be to create a protocol between the server to transport the commands
on
> > the
> > local index to the global index. The server would have 2 physical per
> > logical index on a LDD table. A local index and a global Index. An
I/U/D
> > command would first modify the local index and then distribute the
index
> > command (I/D) to all servers so they can put it into their own local
> > index.
> > A SELECT statement on an index would first look into the global index
> > structure. If the global index structure indicates that the rows are
> > distributed on several servers then the sever would query the other
> > servers
> > (best case: query only the servers that have o potential row). The
queried
> > servers would use their local indexes to look up the local data.
> >
> >
> >
> > How to replicate the commands? Well in general there are only 2
actions on
> >
> > an index (if we assume normal b* trees and not R or R* trees or other
> > special indexes): Insert and delete. These command should be packed
into a
> > few byte and then do a kind of a multicast among the servers. The
Global
> > Index should only have the server_id and the key in the index
structure.
> > So
> > a SELECT statement would always end up in a index lookup of the global
> > index->query all servers (inclusive the init server)-> collect
data->do
> > additional aggregation, grouping etc. forward to end-user.
> >
> >
> >
> > What do you think?
> >
> >
> >
> > Peter
> >
> >
>
>
> --
>
> Sem mais,
>
> Fabricio Mota
> Oda Mae Brown - Aprecie sem moderação.
> http://www.odamaebrown.com.br
>
> --
>
> Sem mais,
>
> Fabricio Mota
> Oda Mae Brown - Aprecie sem moderação.
> http://www.odamaebrown.com.br
A global (cluster-wide) index could be useful with one additional piece of
information, the ID of the server on which the row exists. Maintenance of
a global index can occur similarly to replication (an INDEX update would
be replicated from a node server to the cluster server instead of or in
addition to a DATA update event as the cluster server may or may not have
the actual data on it)
There are many queries that can be answered directly from an index so a
global index could allow response to certain queries directly from the
master (cluster) server. It would also allow you to get a confidence
figure describing how complete your results should be.
What do I mean by confidence figure? If a query is written so that it uses
at least one indexed column, the global index will be able to identify
which servers have data matching a condition on that column. You can
respond with the # of rows found, # of servers containing data, and the #
of servers responding with results. If the last two numbers do not match,
you know you only have partial results (less than 100% confidence). For
queries that use no indexed columns at all, then every server from the
cluster must check its data for matching conditions. Again if any server
fails to respond, you have less than complete confidence in your results.
This also implies that if you want to optimize cluster response, not only
global indexes but a supervisory query parser would be useful. What
happens when data from tables on one server is to be JOINED with data from
tables on another server? In this case, some machine (a node server or the
cluster master) must perform the actual JOIN. The flow of data in this
case tells us that for JOINs that can use indexes, a global index would be
useful in reducing the quantity of cross traffic between servers.
That last train of thought brings up an interesting question: If I run a
query that needs to JOIN two or more tables within the same cluster like
this
SELECT cust.name
, sales.tx_id
, sum(items.qty) as units
, sum(items.qty * items.price) as grand_total
FROM salestickets sales
INNER JOIN transactions tx
on sales.tx_id = tx.id
INNER JOIN customers cust
on tx.customers_id = cust.id
INNER JOIN lineitems items
on items.salestickets_id = sales.id
WHERE cust.id=1234356
GROUP BY cust.name, sales.tx_id;
Assuming that servers RED, BLUE, GOLD, BROWN, and BLACK are in the cluster
and that only RED, BLUE and GOLD are the servers that contain any of these
four tables, what kind of query process would we need to implement in
order to efficiently answer this problem? Should we modify this query or
its results to include some column to identify the server(s) generating
the data present in the result? I ask because if those 3 servers are
geographically separate, it would make sense that each one could have a
different customer with the ID of 12345. We could get 3 sets of rows back
(one from each server) plus possibly other sets of rows generated by the
combinatons of the different tables present in the different servers. This
query would return erroneous results if it tried to combine rows from
different servers (by joining the salestickets table from RED with the
customers table of BLUE, for instance) because the data on each server is
internally consistent only within that server. Other queries may NEED to
combine data from separate servers in order to respond correctly. I think
this means that various "hints" need to be available to help modify the
behavior of cluster-level queries. One hint could eliminate
cross-sever-joins. Another could be to modify the results so that
duplicate responses from multiple servers would be removed (like UNION vs.
UNION ALL but on a cluster scale)
I know I just gave you a lot to think about. Sorry for the headache ;-)
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine