Peter and Shawn,
Ok, boys! You have the reason! But this does not mean that we don't have
problems! hahaha
Lets bit for bit. See above:
2005/11/23, Peter B. Volk <peter.benjamin.volk@stripped>:
>
> Hey,
>
>
> Exactly what I was going to say :) It's all about optimization. To your
> concerns that it may violate to reuse code that has already been written:
> The local and global indexes would be handled as normal index structures
> with a bit of overhead. a Function could look like this:
>
>
>
> insertIntoIndex_DDE(.......){
>
>
>
> //a bit of overhead for the DDE cluster
>
>
>
> insertIntoIndex_MySql(....)///this is the function that was written by
> MySql
>
>
>
> }
>
>
>
> as you see there would be no violation to the principle of violating
> MySqls
> principles.
HAHAHA, Peter... Cool... You winned me, because you have the advantage to
know the MySQL architecture better than me! :P
Greetings,
>
> Peter
>
>
> ----- Original Message -----
> From: <SGreen@stripped>
> To: "Fabricio Mota" <fabricio.mota@stripped>
> Cc: <internals@stripped>
> Sent: Wednesday, November 23, 2005 4:54 PM
> Subject: Re: Fwd: DDE - Distributed indexes
>
>
> Fabricio Mota <fabricio.mota@stripped> wrote on 11/22/2005 11:07:51 PM:
>
> > ---------- Forwarded message ----------
> > From: Fabricio Mota <fabricio.mota@stripped>
> > Date: 22/11/2005 23:06
> > Subject: Re: DDE - Distributed indexes
> > To: "Peter B. Volk" <peter.benjamin.volk@stripped>
> >
> > Peter,
> >
> > I've spend many neurons (hahahaha) thinking about it, too, and the only
> > concrete conclusion I've got is: it will give us a big headache.
> > The problem is several: degrading local performance, degrading GLOBAL
> > performance (a lot, I imagine), unavailability and (bad) inconsistences
> > during eventual whole fails, and other thing that has disturbed me a
> little,
> > that is: it may violate the "Principle of Reusing What MySQL Has Already
> > Made".
> >
> > So, I've rereel the thought to the start of all, and I said to myself:
> > "Let's think about Database concepts. How important is an index? Why
> does
> > the people create index on their tables?
> > And myself answered to me: "people do that to improve the performance
> for
> > only SELECT commands."
> > And I asked to myself again "When a global select being requested from a
> LDD
> > table, what would the system do?"
> > And myself answered to me again:
> > "The master server forwards the requested command to all servers; they
> will
> > process it as local command, and will reply their (partial) datasets to
> the
> > master server. The master server then will merge the results into a only
> > dataset, by means of a *union all *command."
> >
> > And I started to ask to myself, and to you, too: Does a *union all* use
> > indexes?
> >
> > Second point: I started to think with the reverse way, too (when we want
> a
> > solution, we search it through any ways :) ). "Independently of how
> costly
> > would it be, how interesting is to us keeping a global index?"
> > That is another question good to make us think, just because right now
> I'm
> > starting to think that the indexes benefits wouldn't work so good for
> DDS.
> > The problem is that, as each row has its independent reference, it could
> > take the system to retrieve row by row. That is not efficiently, is it?
> >
> > Maybe I am having difficulty to catch the point of need to use global
> > indexes.
> >
> > Well, in the other hand, I'm sure that there may to be new hypothesis
> that
> > my reasoning didn't reach. And about doing it, as I've seen, you are
> good.
> >
> > Waiting for response.
> >
> > FM
> >
> > 2005/11/22, Peter B. Volk <peter.benjamin.volk@stripped>:
> > >
> > > Hello Fabricio,
> > >
> > >
> > >
> > > I've been spending many showers on trying to find a solution how to
> > > distribute the indexes of LDD tables to the whole cluster to do query
> > > optimizations. Well what came out is that this is a hard one. My
> > > idea would
> > > be to create a protocol between the server to transport the commands
> on
> > > the
> > > local index to the global index. The server would have 2 physical per
> > > logical index on a LDD table. A local index and a global Index. An
> I/U/D
> > > command would first modify the local index and then distribute the
> index
> > > command (I/D) to all servers so they can put it into their own local
> > > index.
> > > A SELECT statement on an index would first look into the global index
> > > structure. If the global index structure indicates that the rows are
> > > distributed on several servers then the sever would query the other
> > > servers
> > > (best case: query only the servers that have o potential row). The
> queried
> > > servers would use their local indexes to look up the local data.
> > >
> > >
> > >
> > > How to replicate the commands? Well in general there are only 2
> actions on
> > >
> > > an index (if we assume normal b* trees and not R or R* trees or other
> > > special indexes): Insert and delete. These command should be packed
> into a
> > > few byte and then do a kind of a multicast among the servers. The
> Global
> > > Index should only have the server_id and the key in the index
> structure.
Peter, your Idea is interesting, and seems like a low cost message. So,
to do that with safety, we must assume a reliable multicast. Wow, our work
is growing up!
hehehe
> > So
> > a SELECT statement would always end up in a index lookup of the global
> > index->query all servers (inclusive the init server)-> collect
data->do
> > additional aggregation, grouping etc. forward to end-user.
I confess I haven't seen this GI (Global Indexes) application in the first
email. In fact, you answered my question before I do that, hahahaha
Really, it could be very useful. However... (see above)
> >
> >
> >
> > What do you think?
> >
> >
> >
> > Peter
> >
> >
>
>
> --
>
> Sem mais,
>
> Fabricio Mota
> Oda Mae Brown - Aprecie sem moderação.
> http://www.odamaebrown.com.br
>
> --
>
> Sem mais,
>
> Fabricio Mota
> Oda Mae Brown - Aprecie sem moderação.
> http://www.odamaebrown.com.br
*A global (cluster-wide) index could be useful with one additional piece of
information, the ID of the server on which the row exists. Maintenance of
a global index can occur similarly to replication (an INDEX update would
be replicated from a node server to the cluster server instead of or in
addition to a DATA update event as the cluster server may or may not have
the actual data on it).*
Don't you (Peter and Shawn) think it could make the process slow? That's
because despite the message exchange - such as low-cost reliable multicast -
to be possible, the user-thread blocking could remain until all GI to be
updated. Or do you think it would not be necessary?
*There are many queries that can be answered directly from an index so a
global index could allow response to certain queries directly from the
master (cluster) server. It would also allow you to get a confidence
figure describing how complete your results should be.*
I've got an idea: what do you (two) think about to mantain only GI instead
of any LI (Local Indexes) for LDD tables?
Such as: an user requesting for index involving columns {C1, C2 ... Cn} in a
LDD table, could be replaced with a physical index on {L, C1, C2 ... Cn},
considering that L is the local (or server) column. Opinions?
*What do I mean by confidence figure? If a query is written so that it uses
at least one indexed column, the global index will be able to identify
which servers have data matching a condition on that column. You can
respond with the # of rows found, # of servers containing data, and the #
of servers responding with results*.
That's good too, because we might save still more processing, taking local
processing on and on. Nice!
*If the last two numbers do not match,
you know you only have partial results (less than 100% confidence).*
And if the select have not required an ordering criteria (or the required
order can be satisfied within), and we assume an deliver-by-demand, we could
keep delivering the available data, until he request a record more than
that's available...
Does it make sense?
*For queries that use no indexed columns at all, then every server from the
cluster must check its data for matching conditions. Again if any server
fails to respond, you have less than complete confidence in your results.
*
That's inevitable.
*This also implies that if you want to optimize cluster response, not only
global indexes but a supervisory query parser would be useful. What
happens when data from tables on one server is to be JOINED with data from
tables on another server? In this case, some machine (a node server or the
cluster master) must perform the actual JOIN. The flow of data in this
case tells us that for JOINs that can use indexes, a global index would be
useful in reducing the quantity of cross traffic between servers.
*
Well, about the joining, I have already forecasted it in section
3.5.7.2 (despite
not to think no much about indexes). When a join involves 2 or more LDD
tables, a cascade subquerying should be propagated to the cluster. That
should be a kind of recursivity.
But as you and Peter proposed me, and GI could save a lot of job an traffic
in network. I'm starting to be convinced of that. :)
*That last train of thought brings up an interesting question: If I run a
query that needs to JOIN two or more tables within the same cluster like
this
SELECT cust.name
, sales.tx_id
, sum(items.qty) as units
, sum(items.qty * items.price) as grand_total
FROM salestickets sales
INNER JOIN transactions tx
on sales.tx_id = tx.id
INNER JOIN customers cust
on tx.customers_id = cust.id
INNER JOIN lineitems items
on items.salestickets_id = sales.id
WHERE cust.id=1234356
GROUP BY cust.name, sales.tx_id;
*
*Assuming that servers RED, BLUE, GOLD, BROWN, and BLACK are in the cluster
and that only RED, BLUE and GOLD are the servers that contain any of these
four tables, what kind of query process would we need to implement in
order to efficiently answer this problem?*
Well, we could use a coloured-index at all! hahahaha, sorry, I could not
help myself!! :)
* Should we modify this query or
its results to include some column to identify the server(s) generating
the data present in the result? I ask because if those 3 servers are
geographically separate, it would make sense that each one could have a
different customer with the ID of 12345. We could get 3 sets of rows back
(one from each server) plus possibly other sets of rows generated by the
combinatons of the different tables present in the different servers. This
query would return erroneous results if it tried to combine rows from
different servers (by joining the salestickets table from RED with the
customers table of BLUE, for instance) because the data on each server is
internally consistent only within that server. *
**
Well, I don't think so. That's because, in the moment I define a table as
LDD, then:
1. it is understood as a global table, and only RDD or LDD tables can be
referred by it through a foreign key (3.4.1.2). So, the system will allow
(eventual) cross-servers foreign keys. So, cross-servers inconsistences, in
principle, will never happen. (considering no faults).
*2. *If I do *not* define the LOCAL column as part of a unique/primary key,
then the system will *never* allow multiple values, neither they being
placed in different servers. (3.4.1.1).
So, don't you think that error could be avoided by system consistance
management?
**
*Other queries may NEED to
combine data from separate servers in order to respond correctly. I think
this means that various "hints" need to be available to help modify the
behavior of cluster-level queries. One hint could eliminate
cross-sever-joins. *
If I am not talking bullshit, That's a (typical) case for 3.5.7.2 (maybe
could you and Peter opine about it?)
*Another could be to modify the results so that
duplicate responses from multiple servers would be removed (like UNION vs.
UNION ALL but on a cluster scale)
*
I confess I did not understand it...
*I know I just gave you a lot to think about. Sorry for the headache ;-)
*
But the heaaches are important. So, if they don't exist, who will feed the
poor medicine laboratories owners? :)
*Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine
*
--
Sem mais,
Fabricio Mota
Oda Mae Brown - Aprecie sem moderação.
http://www.odamaebrown.com.br