Tuesday, March 1, 2011

Why is ORM considered good but "select *" considered bad?

Doesn't an ORM usually involve doing something like a select *?

If I have a table, MyThing, with column A, B, C, D, etc, then there typically would be an object, MyThing with properties A, B, C, D.

It would be evil if that object were incompletely instantiated by a select statement that looked like this, only fetching the A, B, not the C, D:

select A, B from MyThing /* don't get C and D, because we don't need them */

but it would also be evil to always do this:

select A, B, C, D /* get all the columns so that we can completely instantiate the MyThing object */

Does ORM make an assumption that database access is so fast now you don't have to worry about it and so you can always fetch all the columns?

Or, do you have different MyThing objects, one for each combo of columns that might happen to be in a select statement?

EDIT: Before you answer the question, please read Nicholas Piasecki's and Bill Karwin's answers. I guess I asked my question poorly because many misunderstood it, but Nicholas understood it 100%. Like him, I'm interested in other answers.


EDIT #2: Links that relate to this question:

http://stackoverflow.com/questions/18655/why-do-we-need-entity-objects

http://blogs.tedneward.com/2006/06/26/The+Vietnam+Of+Computer+Science.aspx, especially the section "The Partial-Object Problem and the Load-Time Paradox"

http://groups.google.com/group/comp.object/browse_thread/thread/853fca22ded31c00/99f41d57f195f48b?

http://www.martinfowler.com/bliki/AnemicDomainModel.html

http://database-programmer.blogspot.com/2008/06/why-i-do-not-use-orm.html

From stackoverflow
  • SELECT * is not bad. Did you ask whoever considered it to be bad "why?".

    Corey Trager : I'll ask myself: Is select * bad? Yes, if you only need columns A and B, don't pay the cost of fetching C and D columns also. Fetch only what you need.
    Ali A : Yes, but it's not universally "bad". It is "bad" in some cases! What if you later want C and D, should you not get them and cache them? No case is cut and dry, and so we certainly cannot say "Select * is bad". We can say "Select * is sometimes bad"
    le dorfier : The worst consequence of fetching C and D is that whoever comes after you has no idea what you intended to do with them. Put another way, here's also a YAGNI principle that this violates.
    Gordon Bell : select * is bad in stored procedures because it prevents the server from being able to pre-compile an execution plan.
  • Even ORMs need to avoid SELECT * to be effective, by using lazy loading etc.

    And yes, SELECT * is generally a bad idea if you aren't consuming all the data.

    So, do you have different kinds of MyThing objects, one for each column combo? – Corey Trager (Nov 15 at 0:37)

    No, I have read-only digest objects (which only contain important information) for things like lookups and massive collections and convert these to fully hydrated objects on demand. – Cade Roux (Nov 15 at 1:22)

    Corey Trager : So, do you have different kinds of MyThing objects, one for each column combo?
    Cade Roux : No, I have read-only digest objects (which only contain important information) for things like lookups and massive collections and convert these to fully hydrated objects on demand.
    BobbyShaftoe : +1 for you response comment Cade, this bit about digest objects should be in the answer. :)
  • I am not sure why you would want a partially hydrated object. Given a class of Customer with properties of Name, Address, Id. I would want them all to create a fully populated Customer object.

    The list hanging off of Customers called Orders can be lazily loaded when accessed though most ORMs. And NHibernate anyway allows you to do projections into other objects. So if you had say a simply customer list where you displayed the ID and Name, you can create an object of type CustomerListDisplay and project your HQL query into that object set and only obtain the columns you need from the database.

    Friends don't let friends premature optimize. Fully hydrate your object, lazy load it's associations. And then profile your application looking for problems and optimize the problem areas.

    Corey Trager : A partially hydrated object in those cases where you are only interested in a subset of the columns, and you don't want to pay the performance penalty of fetching all the columns.
    NotMyself : so then look into projections...
    gbjbaanb : often the 'performance hit' of fetching all columns is insignificant. The performance hit of repeated round trips to the DB is very significant. Lazy loading will only help in some scenarios, its not a magic bullet.
    Mitch Wheat : @gbjbaanb : it depends how deep the relationship heirarchy goes. It might not just be columns in a single table, but a sequence of joins to other tables.
    Dave Sherohman : Yes, but when you start talking "relationship hierarchy", you've moved past "fully hydrate your object" and into "lazy load its associations", unless you're talking specifically about deep table-per-subclass inheritance schemes, but most ORMs I've seen seem to discourage use of inheritance anyhow...
  • ORMs in general do not rely on SELECT *, but rely on better methods to find columns like defined data map files (Hibernate, variants of Hibernate, and Apache iBATIS do this). Something a bit more automatic could be set up by querying the database schema to get a list of columns and their data types for a table. How the data gets populated is specific to the particular ORM you are using, and it should be well-documented there.

    It is never a good idea to select data that you do not use at all, as it can create a needless code dependency that can be obnoxious to maintain later. For dealing with data internal to the class, things are a bit more complicated.

    A short rule would be to always fetch all the data that the class stores by default. In most cases, a small amount of overhead won't make a huge difference, so your main goal is to reduce maintenance overhead. Later, when you performance profiling of the code, and have reason to believe that it may benefit from adjusting the behavior, that is the time to do it.

    If I saw an ORM make SELECT * statements, either visibly or under its covers, then I would look elsewhere to fulfill my database integration needs.

    Florin : I don't find that your answer addresses the question. Even so ORM may employ methods to have knowledge of the data requested from the database, how else are you to conclusively populate your ORM object model without a SELECT statement whether it is a prepared statement or something else?
    Adam K. Johnson : How the data is loaded into the class is an implementation-specific detail of the ORM. In iBATIS for example, it has XML files that contain a SQL statements or procedures, column data types, and then column-to-class mappings to specify which column matches which specific variable in the class.
    Corey Trager : @Florin, I understand you. @doktaru, I don't understand you. Forget the "select *" part of the question. Instead, explain how MyThing behaves when sometimes its fully hydrated and sometimes partially.
    le dorfier : For me the essential difference is whether the class drives the interface, or the table drives the interface. If the class wants columns the table doesn't yet have, or then the table probably needs adjustment (or a join or ..) The class shouldn't ever deal with columns it doesn't require or know.
  • There are two separate issues to consider.

    To begin, it is quite common when using an ORM for the table and the object to have quite different "shapes", this is one reason why many ORM tools support quite complex mappings.

    A good example is when a table is partially denormalised, with columns containing redundant information (often, this is done to improve query or reporting performance). When this occurs, it is more efficient for the ORM to request just the columns it requires, than to have all the extra columns brought back and ignored.

    The question of why "Select *" is evil is separate, and the answer falls into two halves.

    When executing "select *" the database server has no obligation to return the columns in any particular order, and in fact could reasonably return the columns in a different order every time, though almost no databases do this.

    Problem is, when a typical developer observes that the columns returned seem to be in a consistent order, the assumption is made that the columns will always be in that order, and then you have code making unwarranted assumptions, just waiting to fail. Worse, that failure may not be fatal, but may simply involve, say, using Year of Birth in place of Account Balance.

    The other issue with "Select *" revolves around table ownership - in many large companies, the DBA controls the schema, and makes changes as required by major systems. If your tool is executing "select *" then you only get the current columns - if the DBA has removed a redundant column that you need, you get no error, and your code may blunder ahead causing all sorts of damage. By explicitly requesting the fields you require, you ensure that your system will break rather than process the wrong information.

    Corey Trager : Bevan - What does that mean concretely, that an "object" has different "shapes"? Like, if the language is Java/C#/C++, is there just one "class MyThing" but sometimes its GetSomeStringAttr() returns null?
    le dorfier : To me it sounds like an Object only finally has a sharply defined definition with respect to a certain context; and different contexts in the application may need different shapes (sets of properties and methods) for different contexts.
    Bevan : I was referring to the way that good design can end up at quite different results for objects and tables. A trivial example: an object might end up with a property Date, where the table has Date, day, month, year, dayOfWeek, dayOfCalendarYear & dayOfFinancialYear for reporting purposes.
  • In my limited experience, things are as you describe--it's a messy situation and the usual cop-out "it depends" answer applies.

    A good example would be the online store that I work for. It has a Brand object, and on the main page of the Web site, all of the brands that the store sells are listed on the left side. To display this menu of brands, all the site needs is the integer BrandId and the string BrandName. But the Brand object contains a whole boatload of other properties, most notably a Description property that can contain a substantially large amount of text about the Brand. No two ways about it, loading all of that extra information about the brand just to spit out its name in an unordered list is (1) measurably and significantly slow, usually because of the large text fields and (2) pretty inefficient when it comes to memory usage, building up large strings and not even looking at them before throwing them away.

    One option provided by many ORMs is to lazy load a property. So we could have a Brand object returned to us, but that time-consuming and memory-wasting Description field is not until we try to invoke its get accessor. At that point, the proxy object will intercept our call and suck down the description from the database just in time. This is sometimes good enough but has burned me enough times that I personally don't recommend it:

    • It's easy to forget that the property is lazy-loaded, introducing a SELECT N+1 problem just by writing a foreach loop. Who knows what happens when LINQ gets involved.
    • What if the just-in-time database call fails because the transport got flummoxed or the network went out? I can almost guarantee that any code that is doing something as innocuous as string desc = brand.Description was not expecting that simple call to toss a DataAccessException. Now you've just crashed in a nasty and unexpected way. (Yes, I've watched my app go down hard because of just that. Learned the hard way!)

    So what I've ended up doing is that in scenarios that require performance or are prone to database deadlocks, I create a separate interface that the Web site or any other program can call to get access to specific chunks of data that have had their query plans carefully examined. The architecture ends up looking kind of like this (forgive the ASCII art):

    Web Site:         Controller Classes
                         |
                         |---------------------------------+
                         |                                 |
    App Server:       IDocumentService               IOrderService, IInventoryService, etc
                      (Arrays, DataSets)             (Regular OO objects, like Brand)
                         |                                 |
                         |                                 |
                         |                                 |
    Data Layer:       (Raw ADO.NET returning arrays, ("Full cream" ORM like NHibernate)
                       DataSets, simple classes)
    

    I used to think that this was cheating, subverting the OO object model. But in a practical sense, as long as you do this shortcut for displaying data, I think it's all right. The updates/inserts and what have you still go through the fully-hydrated, ORM-filled domain model, and that's something that happens far less frequently (in most of my cases) than displaying particular subsets of the data. ORMs like NHibernate will let you do projections, but by that point I just don't see the point of the ORM. This will probably be a stored procedure anyway, writing the ADO.NET takes two seconds.

    This is just my two cents. I look forward to reading some of the other responses.

    petr k. : one of the high quality answers.. +1
    chadmyers : yeah, no doubt. +1 just for the dang ASCII art architecture diagram, lol. Very nice indeed.
    Mitch Wheat : +1. Very good points raised.
    Mitch Wheat : I'd vote this answer up twice if I could!
    le dorfier : I think collections of codes, short-descriptions, etc. are perfectly defensible as legitimate classes (subclassing a collection) on their own merits, and there's a conceptually legitimate reason for calling them out.
    Booji Boy : good and solid practical advice. *LOVE* the ASCII art.
  • Select * is bad for a couple of reasons:

    1. Some columns might be costly to retrieve, so if you don't need them you're wasting time

    2. If you write code with select * and you're expecting x number of columns in a given order, and the table changes, you could potentially break the code. If you explicitly ask for the columns you need then the code is less fragile.

    Corey Trager : Sorry, the question was poorly worded. Read Nicholas's answer who understood my question exactly in the way I intended it. How would you respond to him?
  • People use ORM's for greater development productivity, not for runtime performance optimization. It depends on the project whether it's more important to maximize development efficiency or runtime efficiency.

    In practice, one could use the ORM for greatest productivity, and then profile the application to identify bottlenecks once you're finished. Replace ORM code with custom SQL queries only where you get the greatest bang for the buck.

    SELECT * isn't bad if you typically need all the columns in a table. We can't generalize that the wildcard is always good or always bad.

    edit: Re: doofledorfer's comment... Personally, I always name the columns in a query explicitly; I never use the wildcard in production code (though I use it when doing ad hoc queries). The original question is about ORMs -- in fact it's not uncommon that ORM frameworks issue a SELECT * uniformly, to populate all the fields in the corresponding object model.

    Executing a SELECT * query may not necessarily indicate that you need all those columns, and it doesn't necessarily mean that you are neglectful about your code. It could be that the ORM framework is generating SQL queries to make sure all the fields are available in case you need them.

    Corey Trager : You completely understood the intent of my question. Thanks.
    le dorfier : I think using "SELECT *" indicates your decision that you do, in fact, need all the columns in the table. If that's not true, then it's misleading. It also suggests the possibility that you didn't think about it very hard (unless you leave a comment asserting.)
  • Linq to Sql, or any implementation of IQueryable, uses a syntax which ultimately puts you in control of the selected data. The definition of a query is also the definition of its result set.

    This neatly avoids the select * issue by removing data shape responsibilities from the ORM.

    For example, to select all columns:

    from c in data.Customers
    select c
    

    To select a subset:

    from c in data.Customers
    select new
    {
      c.FirstName,
      c.LastName,
      c.Email
    }
    

    To select a combination:

    from c in data.Customers
    join o in data.Orders on c.CustomerId equals o.CustomerId
    select new
    {
      Name = c.FirstName + " " + c.LastName,
      Email = c.Email,
      Date = o.DateSubmitted
    }
    
    Corey Trager : But.... what the heck is the object you have created? Are they both "Customer" domain objects, even though they have different properties?
    Bryan Watts : The new object is an anonymous type, generated by the compiler with the defined schema. It's as if you had defined it yourself, except you don't have to manage the class artifact. See http://msdn.microsoft.com/en-us/library/bb397696.aspx for more information.
    Bryan Watts : The object represents a projection, which by definition is not a domain object and cannot be confused with one. If you need a full customer, select that, otherwise select something else. "Sparsely populated objects" seem like an anti-pattern which compromises the domain for technical concerns.
    le dorfier : And you end up with technical concerns trumping domain concerns, and the application is being designed from technology back to the users rather than properly in the other direction. But maybe the programmers are a little happier.
    Bryan Watts : Those are the growing pains of paradigm shifts. It takes strong developers to flex the IQueryable abstraction. The point is that query definitions can be part of the domain, not just a "please implement this within your application" hook point.
  • SELECT * is a strong indication you don't have design control over the scope of your application and its modules. One of the major difficulties in cleaning up someone else's work is when there is stuff in there that is for no purpose, but no indication what is needed and used, and what isn't.

    Every piece of data and code in your application should be there for a purpose, and the purpose should be specified, or easily detected.

    We all know, and despise, programmers who don't worry too much about why things work, they just like to try stuff until the expected things happen and close it up for the next guy. SELECT * is a really good way to do that.

  • The case you describe is a great example of how ORM is not a panacea. Databases offer flexible, needs-based access to their data primarily through SQL. As a developer, I can easily and simply get all the data (SELECT *) or some of the data (SELECT COL1, COL2) as needed. My mechanism for doing this will be easily understood by any other developer taking over the project.

    In order to get the same flexibility from ORM, a lot more work has to be done (either by you or the ORM developers) just to get you back to the place under the hood where you're either getting all or some of the columns from the database as needed (see the excellent answers above to get a sense of some of the problems). And all this extra stuff is just more stuff that can fail, making an ORM system intrinsically less reliable than straight SQL calls.

    This is not to say that you shouldn't use ORM (my standard disclaimer is that all design choices have costs and benefits, and the choice of one or the other just depends) - knock yourself out if it works for you. I will say that I truly don't understand the popularity of ORM, given the amount of extra un-fun work it seems to create for its users. I'll stick with using SELECT * when (wait for it) I need to get every column from a table.

  • If you feel the need to encapsulate everything within an object, but need something with a small subset of what is contained within a table - define your own class. Write straight sql (within or without the ORM - most allow straight sql to circumvent limitations) and populate your object with the results.

    However, I'd just use the ORMs representation of a table in most situations unless profiling told me not to.

0 comments:

Post a Comment