From: Warren Young Date: May 5 2009 6:31am Subject: Re: Why Does SSQLS populate_* Function use Inefficient operator[]? List-Archive: http://lists.mysql.com/plusplus/8546 Message-Id: <49FFDD24.30902@etr-usa.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Chris Morgan wrote: > > Why does the ssqls.pl script use the inefficient operator [] (const char* > field) instead of operator [] (int i)? All the usual dynamic vs. static typing reasons, really. SSQLS did behave that way in v2 and earlier, but I changed it in v3.0.0 because that ties a program too closely to the DB schema, making it brittle in the face of schema changes. Example: Add a column to a table Foo, then run a program that queries "SELECT * FROM Foo" and places the result into a vector of Foo (an SSQLS) with Query::storein(). Your program asplode when it tries to assign a value to foo[n+1], because Foo contains only n fields. You have to change the definition of Foo and recompile the program to make it stop dumping core. There are other fun variants on this problem, if you think about it. One option you had to cope with such problems in v2 was to change the SSQLS definition in your program(s) and rebuild it/them every time you change the DB schema. That's fine in smaller systems where you can stop all the clients, change the schema, replace all the client binaries, and bring everything back up. It's totally impractical in larger systems, where you cannot bring everything down at the same time. The larger a system gets, the more likely you'll have to do incremental roll-outs. v3 fixes this because operator= for SSQLS now skips a field assignment when there is no field with a given name. You can get a tiny bit of this advantage with integer indexes if you test the index against the number of fields in the SSQLS each time, but that eats into your execution time advantage, and still doesn't give most the advantages of the lookup-by-name method. Another v2 coping strategy was to change all "SELECT *" and similar constructs to "SELECT exact,set,of,fields,supported,by,the,ssqls", so things like storein() don't care if new columns are added. But, it still doesn't protect against column deletions or renamings. You still have to change and rebuild the program to track such changes. You're not much better off than before, when you had to change the SSQLS definition and rebuild. And, it means more SQL in your C++, and more verbose SQL at that, something MySQL++ is supposed to fix. v3 allows you to keep loose queries like "SELECT *" because it won't try to populate fields that don't exist in the SSQLS, and fields that still exist in the SSQLS but don't exist in the DB will remain untouched. A side benefit of the new behavior is that you no longer have to declare separate SSQLS types for each and every sub-query you have on a given table. You can now say something like "SELECT foo,bar FROM qux", then store it in a qux instance, which has more than just those two fields. Inefficient, yes, but programmer time is often more expensive than computer time. And as with everything in C++, it's just an option for you to choose. You can still declare SSQLSes for each query subset, if you like. Perhaps you're now thinking about chastising me about how C++ is a statically typed language, and how foolish it is that MySQL++ should make you trade the advantages of static typing for the advantages of dynamic typing. You are using C++ for a reason; if you wanted dynamic typing, you know where to get it, no doubt. Before you ask me to don my Nomex underwear, please reflect that static typing comes from a single-program viewpoint. This viewpoint says that when a C++ compiler stops complaining about type issues, you have no more type issues. You get no such assurance when what you're really building is a distributed system of programs, as happens when you start using a client-server database. Now you have two separate type systems which can change independently, but which somehow have to be made compatible, and remain compatible as the system evolves. This is exactly where dynamically typed systems come into their strength. Since we had to invent a new type system in MySQL++, why not make it dynamic, since that's the right thing in a distributed system? Once the data gets inside your C++ code, away from MySQL++, you can keep all the benefits of static types, where the downsides are not as severe. > The Reference Manual ( > http://tangentsoft.org/mysql++/doc/html/refman/classmysqlpp_1_1Row.html#a23) > specifically warns not to do that. No, what it says is that operator[](int) is faster, not that the const char* variant should not be used. If there's a fast option and a slow option, the slow option should be removed unless there's at least one case where it can pull its own weight. I gave two cases above, and there are more besides. Maybe you still disagree. Okay, you have two options, as I see it. First, you can wait for MySQL++ v3.1, and then provide a patch to give SSQLS v2 an option to make it use index based assignments. Second, you can trawl back through the patch history of ssqls.pl known as custom.pl in v2 and earlier) to find where I changed the way assignments work and revert just that one change. The result should still work, although at least one of the examples (and thus dtest) will then fail, because that example shows off this feature. I imagine you can live with that. Now allow me to quiz you. In all this benchmarking of yours, is this assignment overhead you're worried about even 1% of the time it takes just to retrieve rows from the DB? I would think disk and IPC overheads would totally swamp the assignment time. Benchmarks trump intuition, so maybe you can show me I'm wrong? And if I do happen to be wrong about this, how much will it cost to let Moore's Law fix it instead of the programmers? If you take the standard 18 month doubling time, the law is: d = 547.5 * log2(p) giving time in days to achieve a higher relative power per buck, p. Say you have the worst-case overhead I expect, 1%. Compute power, p, has to be 1.01 times the current level to offset that overhead, so d is a little under 8. That means if the overhead is less than 1%, as I expect, you have but to wait a week or less and the problem solves itself. How does programmer time compare to hardware cost in your organization? Unless you're rolling this system out by the millions -- in which case the incremental rollout issues I brought up above should be biting you, changing the calculus -- it's probably a lot cheaper to just wait for the faster hardware.