Sunday, March 6, 2011

How to design a database schema to support tagging with categories?

I am trying to so something like Database Design for Tagging, except each of my tags are grouped into categories.

For example, let's say I have a database about vehicles. Let's say we actually don't know very much about vehicles, so we can't specify the columns all vehicles will have. Therefore we shall "tag" vehicles with information.

1. manufacture: Mercedes
   model: SLK32 AMG
   convertible: hardtop

2. manufacture: Ford
   model: GT90
   production phase: prototype

3. manufacture: Mazda
   model: MX-5
   convertible: softtop

Now as you can see all cars are tagged with their manufacture and model, but the other categories don't all match. Note that a car can only have one of each category. IE. A car can only have one manufacturer.

I want to design a database to support a search for all Mercedes, or to be able to list all manufactures.

My current design is something like this:

vehicles
  int vid
  String vin

vehicleTags
  int vid
  int tid

tags
  int tid
  String tag
  int cid

categories
  int cid
  String category

I have all the right primary and foreign keys in place, except I can't handle the case where each car can only have one manufacturer. Or can I?

Can I add a foreign key constraint to the composite primary key in vehicleTags? IE. Could I add a constraint such that the composite primary key (vid, tid) can only be added to vehicleTags only if there isn't already a row in vehicleTags such that for the same vid, there isn't already a tid in the with the same cid?

My guess is no. I think the solution to this problem is add a cid column to vehicleTags, and make the new composite primary key (vid, cid). It would look like:

vehicleTags
  int vid
  int cid
  int tid

This would prevent a car from having two manufacturers, but now I have duplicated the information that tid is in cid.

What should my schema be?

Tom noticed this problem in my database schema in my previous question, How do you do many to many table outer joins?

EDIT
I know that in the example manufacture should really be a column in the vehicle table, but let's say you can't do that. The example is just an example.

From stackoverflow
  • I think your solution is to simply add a manufacturer column to your vehicles table. It's an attribute that you know all the vehicles will have (i.e. cars don't spontaneously appear by themselves) and by making it a column in your vehicle table you solve the issue of having one and only one manufacturer for each vehicle. This approach would apply to any attributes that you know will be shared by all vehicles. You can then implement the tagging system for the other attributes that aren't universal.

    So taking from your example the vehicle table would be something like:

    vehicle
      vid
      vin
      make
      model
    
    Pyrolistical : You are focusing on the example and not my problem. My example does not completely reflect my problem.
  • One way would be to slightly rethink your schema, normalising tag keys away from values:

    vehicles
      int vid
      string vin
    
    tags
      int tid
      int cid
      string key
    
    categories
      int cid
      string category
    
    vehicleTags
      int vid
      int tid
      string value
    

    Now all you need is a unique constraint on vehicleTags(vid, tid).

    Alternatively, there are ways to create constraints beyond simple foreign keys: depending on your database, can you write a custom constraint or an insert/update trigger to enforce vehicle-tag uniqueness?

    Pyrolistical : what is "string key" under tags?
    Pyrolistical : this wouldn't work. If I wanted to change "softtop" to "soft top" I would have to update all the values in the vehicleTags, and vehicleTags is huge!
    Dan Vinton : @Pyrolistical, when you say that a tag is like this: manufacture: Mercedes 'manufacture' is the tag key and 'Mercedes' is the tag value. From the sample data you supplied it looks like the key parts are duplicated and so can be normalised into their own table.
  • This is yet another variation on the Entity-Attribute-Value design.

    A more recognizable EAV table looks like the following:

    CREATE TABLE vehicleEAV (
      vid        INTEGER,
      attr_name  VARCHAR(20),
      attr_value VARCHAR(100),
      PRIMARY KEY (vid, attr_name),
      FOREIGN KEY (vid) REFERENCES vehicles (vid)
    );
    

    Some people force attr_name to reference a lookup table of predefined attribute names, to limit the chaos.

    What you've done is simply spread an EAV table over three tables, but without improving the order of your metadata:

    CREATE TABLE vehicleTag (
      vid         INTEGER,
      cid         INTEGER,
      tid         INTEGER,
      PRIMARY KEY (vid, cid),
      FOREIGN KEY (vid) REFERENCES vehicles(vid),
      FOREIGN KEY (cid) REFERENCES categories(cid),
      FOREIGN KEY (tid) REFERENCES tags(tid)
    );
    
    CREATE TABLE categories (
      cid        INTEGER PRIMARY KEY,
      category   VARCHAR(20) -- "attr_name"
    );
    
    CREATE TABLE tags (
      tid        INTEGER PRIMARY KEY,
      tag        VARCHAR(100) -- "attr_value"
    );
    

    If you're going to use the EAV design, you only need the vehicleTags and categories tables.

    CREATE TABLE vehicleTag (
      vid         INTEGER,
      cid         INTEGER,     -- reference to "attr_name" lookup table
      tag         VARCHAR(100, -- "attr_value"
      PRIMARY KEY (vid, cid),
      FOREIGN KEY (vid) REFERENCES vehicles(vid),
      FOREIGN KEY (cid) REFERENCES categories(cid)
    );
    

    But keep in mind that you're mixing data with metadata. You lose the ability to apply certain constraints to your data model.

    • How can you make one of the categories mandatory (a conventional column uses a NOT NULL constraint)?
    • How can you use SQL data types to validate some of your tag values? You can't, because you're using a long string for every tag value. Is this string long enough for every tag you'll need in the future? You can't tell.
    • How can you constrain some of your tags to a set of permitted values (a conventional table uses a foreign key to a lookup table)? This is your "softtop" vs. "soft top" example. But you can't make a constraint on the tag column because that constraint would apply to all other tag values for other categories. You'd effectively restrict engine size and paint color to "soft top" as well.

    SQL databases don't work well with this model. It's extremely difficult to get right, and querying it becomes very complex. If you do continue to use SQL, you will be better off modeling the tables conventionally, with one column per attribute. If you have need to have "subtypes" then define a subordinate table per subtype (Class-Table Inheritance), or else use Single-Table Inheritance. If you have an unlimited variation in the attributes per entity, then use Serialized LOB.

    Another technology that is designed for these kinds of fluid, non-relational data models is a Semantic Database, storing data in RDF and queried with SPARQL. One free solution is Sesame.

    Pyrolistical : This is why SO is so awesome. I always felt I was reinventing the wheel, but before SO, I had no good way to ask if I was... Although you didn't exactly answer my question, it is definitely in the right direction. Thanks.
  • I needed to solve this exact problem (same general domain and everything — auto parts). I found that the best solution to the problem was to use Lucene/Xapian/Ferret/Sphinx or whichever full-text indexer you prefer. Much better performance than what SQL can offer.

    These days, I almost never end up building a database-backed web app that doesn't involve a full-text indexer. This problem and the general issue of search just come up way too often to omit indexers from your toolbox.

0 comments:

Post a Comment