From: Alex Esterkin Date: May 12 2009 6:03pm Subject: Re: MySQL Reengineering Project List-Archive: http://lists.mysql.com/internals/36643 Message-Id: <81f5410f0905121103q16d67863o500ccdadbe269bb4@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Let me play a devil's advocate role and throw in my grumpy 2 cents. First, let us clarify what exactly constitutes the public handler API. It is much more than the set of functions defined in the sql/handler.h header file. Presently, you have to include the entire transitive closure of all public methods of all classes used as arguments. There are many storage engines out there, especially the transactional and the column-oriented ones, that use THD member variables and member functions. Download and take a look at the Infobright Community Edition source code to see what I mean. Second, the entire MySQL code based is glued in hundreds of different ways by using the same classes and structures. In software architecture terms, MySQL has an absolutely monolithic domain model. To decouple different processing areas, you will have to create independent domain models for each, so that you could consistently implement separation of concerns. Presently, MySQL lacks separation of concerns: the same parse tree is used at every stage of query processing. On the other hand, in Postgres, a parse tree is represented by a Query structure (a tree of C structures), the query optimizer generates and considers a bunch of throwaway Path tree structures, then the query planner generates a tree of Plan nodes (Plan tree). The query execution is a state machine; Init...(_) functions generate various PlannerInfo structures/nodes, which are ultimately responsible for runtime query execution. In the Postgres world, only Query tree and Plan tree need to be copyable and serializable. This makes sorce tree organization simpler and cleaner. Without implementing such separation of concerns, MySQL architecture will remain monolithic. [To be fair, Postgres code is monolithic as well, but in a very different way] Third, for many storage engines, there will be extra software engineering cost of chasing your refactoring changes, especially if some of the changes go against storage engine's architectural assumptions. I wonder if the MySQL refactoring news may indicate a tug of war between US based and EU based Sun / MySQL teams? We already have one MySQL refactoring project under way? It is called Drizzle. Is this a "Drizzle Reloaded" project? By the way, Drizzle can be used as an illustration of how easy it is to go too far: views and prepared statements are no longer supported in Drizzle. Regards, Alex Esterkin On Tue, May 12, 2009 at 12:39 PM, Jay Pipes wrote: > > Jay Pipes wrote: >> >> Mats Kindahl wrote: >>> >>> Alaric Snell-Pym wrote: >>>> >>>> Excellent news! >>>> >>>> One word of warning, though: make sure it's a series of small steps. >>>> It's far too easy, with this sort of thing, ending up going off on >>>> huge yak-shaving tangents. By all means take lots of small steps >>>> towards a lofty distant goal, but make sure each step is useful in its >>>> own right (even if just by allowing other steps to happen), or you can >>>> get lost on a branch that will never merge ;-) >>> >>> Yes, we don't want to do the work in macro steps (at least not at this = point), >>> we want to proceed carefully. >>> >>>> I see that a few macroscopic tasks have appeared on the Forge already, >>>> but I'd like to add something I think could be changed for the better, >>>> on a grass-roots level throughout the codebase: >>>> >>>> I see a lot of methods that are called with arguments and return a >>>> value, but most of their input and output is actually through member >>>> fields of the object - not that the method is operating on its object >>>> per se, but that the caller actually puts things into the member >>>> fields, then calls the method, then inspects the results in member >>>> fields. For example, in the storage engine API, update_row is called >>>> with a buffer in unireg format, which it almost universally ignores >>>> and instead uses the array of Field objects set up in the handler >>>> object by the caller. And we spent some time in debugging our storage >>>> engine - it would return rows fine when you did selects on the table, >>>> but when you did certain types of join, it would fail to return any >>>> rows, despite our logging clearly showing we'd returned rows to MySQL >>>> - because it seems that sometimes MySQL not only looks at the return >>>> value of rnd_next, but also checks the 'status' member of the table >>>> object to see if the current row of the table is valid or not. So our >>>> rnd_next had to assign success/failure to table->status as well as >>>> returning success, and then everything worked OK. Doh. >>>> >>>> Making the inputs and outputs of every method/function explicit, >>>> rather than sneaking stuff in and out via members, will make the >>>> calling interfaces between things a lot easier to read, which will >>>> reduce the chances of developers working on a module introducing bugs! >>>> Plus, it'll simplify the classes a lot, and make them easier to read, >>>> as they will end up with only members that really relate to the actual >>>> domain object - eg, the table or whatever - rather than members that >>>> are part of the calling protocol of particular operations on the >>>> class. Less short-term mutable state in classes means they can be >>>> shared between threads in more and more contexts, too, as function >>>> arguments and return values live only on the thread-local stacks! >>> >>> Yes; the fact that the handler interface doesn't really honor the argum= ents has >>> been a major bummer for me several times. This is actually because some= engines >>> that we support internally ignore the argument and use the stored recor= ds >>> record[0] and record[1] instead, which means that every engine (with th= e >>> exception of a few) started doing that. So now you have to both pass th= e >>> argument and the record to make sure that all engines work. >>> >>> Just getting clear semantics on how this part of the handler interface = works, >>> and add assertion to weed out the bad usages, would simplify the code >>> significantly and improve the speed for all. >> >> ++ >> >>> However, do you know of any other interfaces that work this way? I am p= ersonally >>> not aware of any other, but then I don't know every corner of the code = like Serg >>> does. :) >> >> "External" interfaces? =A0See all the plugin "interfaces". =A0There's no= enforcement of types really at all. Just passing void *'s around. >> >> As for the internal interfaces, I would suggest cleaning up the class in= terfaces of THD, JOIN, and other major classes to enforce public accessors = and getters, protecting private member variables behind a clean API. =A0Thi= s would, eventually, make some of these classes semi-usable in public inter= faces. =A0Right now, the passing of the THD* everywhere, and THD having bas= ically a bunch of public member variables, means that there is no enforceme= nt of state changes through an interface. =A0This leads to serious problems= where the "internal" state of a THD is actually public and cannot be seen = as reliable for the lifetime of a session's requests. >> >> Another thing to think about in your refactoring efforts is detaching th= e THD from its current inheritance from Statement, Query_arena and ilink. = =A0Without doing this, and using encapsulation so that a THD can > > s/encapsulation/composition/ > >> have multiple Statements, it will be very difficult to work on any futur= e parallelization efforts. >> >> Cheers, >> >> Jay >> > > > -- > MySQL Internals Mailing List > For list archives: http://lists.mysql.com/internals > To unsubscribe: =A0 =A0http://lists.mysql.com/internals?unsub=3Daesterkin= @gmail.com >