Wednesday, April 13, 2011

What is the best data structure for this in-memory lookup table?

I need to store a lookup table as an instance member in one of my classes. The table will be initialized when the object is constructed. Each "row" will have 3 "columns":

StringKey (e.g., "car")
EnumKey (e.g., LookupKeys.Car)
Value (e.g, "Ths is a car.")

I want to pick the data structure that will yield the best performance for doing lookups either by the StringKey or the EnumKey.

It's kind of awkward having 2 keys for the same dictionary value. I've never encountered this before, so I'm wondering what the norm is for this type of thing.

I could make a Key/Value/Value structure instead of Key/Key/Value, but I'm wondering what type of performance impact that would have.

Am I thinking about this all wrong?

From stackoverflow
  • Well ... "Wrong" is a harsh way of putting it. I think that because the most common dictionary is "single key to value", and a lot of effort goes into providing efficient data structures for that (maps), it's often best to just use two of those, sharing the memory for the values if at all possible.

  • You have two hashmaps.

    • One from StringKey to value.

    • One from EnumKey to value.

    You do not have to duplicate all the Value instances, those objects can be shared between the two hashmaps.

    If it's a LOT of items, you might want to use two treemaps instead of two hashmaps. But the essential principle ("Share the Values") applies to both structures. One set of Values with two maps.

    vg1890 : OK - so in my example, the "value instances" are just strings. I'll make 2 dictionaries (one with StringKey, one with EnumKey) whose values contain the same string reference variable. Does that sound right?
    S.Lott : Precisely. In Python that's all there is to it. In Java, there's a string.intern() which assures that all intern()'d strings are reduced to a common string pool, eliminating some possible redundancy.
    vg1890 : I'm using C#...do you know if .NET will make a copy of the string when I add it to each dictionary?
    S.Lott : Add a *reference* to each dictionary. String exists once. Many references to one string.
    vg1890 : Got it. The dictionaries get their own *references* to the string, but they all point to the same string object. string s = "joe"; dct1.Add("key", s); -- even though the parameter is being passed is called s, dct1.Add gets its own reference to "joe". Thanks!
  • Is it really necessary to key into the same structure with both types of key? You probably don't need to rebuild a complex data structure yourself. You could do some sort of encapsulation for the lookup table so that you really have two lookup tables if memory is not an issue. You could use this encapsulating structure to simulate being able to pull out the value from the "same" structure with either type of key.

    OR

    If there is some way to map between the enum value and the string key you could go that route with only having one type of lookup table.

  • LINQ's ILookup(TKey, TElement) interface may help. Assuming your Dictionary is something like:

    Dictionary<carKey, carValue> cars;
    

    You could use:

    ILookUp<carValue, carKey> lookup = cars.ToLookup(x => x.Value, x => x.Key);
    

    (...actually I think I might have slightly misread the question - but an ILookUp might still fit the bill, but the key/value set might need to be the key and the enum.)

  • If every value is guaranteed to be accessible by both types of keys, another idea would be to convert one type of key to another. For example:

    public Value getValue(String key)
    {
        dictionary.get(key); // normal way
    }
    
    public Value getValue(Enum enumKey)
    {
        String realKey = toKey(enumKey);
        getValue(realKey); // use String key
    }
    

    You could have your Enum implement a toKey() method that returns their String key, or maybe have another dictionary that maps Enum values to the String counterparts.

c# Drag & drop in listview

I've got a listbox from which I'm dragging into the ListView. Now I have groups in the ListView so when the item from the ListView is dropped at the point of the listviewgroup it has to add it under that group.

This is the code which handles the drop.

    private void lstvPositions_DragDrop(object sender, DragEventArgs e)
    {

        var group = lstvPositions.GetItemAt(e.X, e.Y);
        var item = e.Data.GetData(DataFormats.Text).ToString();
        lstvPositions.Items.Add(new ListViewItem {Group = group.Group, Text = item});

    }

I didn't find a function that could give the groupitem, so I used GetItemAt from which I also have access to the listviewgroup.

But GetItemAt always returns null.

Am I doing something wrong? Is there a better way to accomplish this?

From stackoverflow
  • First, I assume you're using a ListView, not a ListBox, as ListBox does not contain a GetItemAt member.

    To solve your problem, convert the point to local coordinates:

    private void lstvPositions_DragDrop(object sender, DragEventArgs e)
    {
       var localPoint = lstvPositions.PointToClient(new Point(e.X, e.Y));
       var group = lstvPositions.GetItemAt(localPoint.X, localPoint.Y);
       var item = e.Data.GetData(DataFormats.Text).ToString();
       lstvPositions.Items.Add(new ListViewItem {Group = group.Group, Text = item});
    }
    
    Gerbrand : Okay that worked.
  • Did that solution worked for u?

    because if u drop the in blank space in ListView

    lstvPositions.GetItemAt(..) will return nothing

    Gerbrand : This works good in my app. In my app there can't be blanks dropped. So I don't have this problem.

What are my binding options for a self hosted cross domain WCF service with remote thick clients?

I'm trying to build a WCF self hosted service (eventually in a windows service) that will receive binary and text base messages from remote thick clients that have no accounts on my hosted machine. I'm trying to figure out both my binding options and security options, and in reading the patterns and practices guides, my head has completely spun around at least once.

The clients would be authenticated against a custom SQL based method, so I'd like to be able to pass that info in the initial login request and then set an authorization token of some kind. (This part of the problem is probably outside the scope of the question, but I included it in case it might make a difference.)

Any thoughts at all would be very helpfull.

Ryan

From stackoverflow
  • The choice of binding and security option depends on the usage of your WCF service. Is it just for your rich client or are you planning to expose it to the world as API? If it's just for your rich app, does it run on LAN or over untrusted, unreliable Internet?

    With WCF you can configure the service to expose multiple endpoints with different bindings, for example both SOAP and REST. In general, I'd start with something stateless and lightweight like basicHttpBinding and webHttpBinding, passing user and password on every request. Once you have that up and running you can optimize cache authentication, provide binary endpoint etc.. only if it actually helps.

  • There's no need to have just one binding. Having said that if it's self hosted you're "on your own" here. I've never looked at what's involved.

Django Admin SelectMultiple Widget

In my model I have many to many relationship between 2 tables Users and Groups. In the admin interface I see the SelectMultiple widget for Groups. Actually, I am using filter_horizontal, I see the available groups and the selected groups in 2 separate lists. Is it possible to filter the list of available groups that I can see (based on some criteria). I do not want to show all the groups in the groups table. Thank you

From stackoverflow
  • In your form class, you can specify a custom queryset for the group-field, which then determines which Group-instances are available in the form:

    class UserForm(forms.ModelForm):
        # override the default groups field
        groups = forms.ModelMultipleChoiceField(
            queryset=Groups.objects.filter(YOUR_CONDITIONS),
            widget=forms.SelectMultiple,
        )
    
        class Meta:
            model = User
    
  • How do I change the group widget that I see in the admin

Searching for a string 'somewhere' in a database

Here's my problem: I'm looking at someone's Postgresql based database application for the first time and trying to find what is causing certain warnings/errors in the system's logfile. I don't know anything about the database schema. I don't know anything about the source code. But I need to track down the problem.

I can easily search for string contents of text file based code like php and perl, using the UNIX command 'grep'; even for compiled binaries I can use use the UNIX commands 'find' and 'strings'.

My problem is that some of the text produced in the logfile comes from the database itself. Checking the error logfile for the database yields nothing useful as there are no problems with the queries used by the application.

What I would like to do is exhaustively search all of the columns and all of the tables of the database for an string. Is this possible, and how?

Thanks in advance for any pointers. The environment used is Postgresql 8.2, but it would be useful to know how to do this in other flavors of relational databases as well.

From stackoverflow
  • I am not familiar with Postgresql, but I would think that, like SQL Server, it has meta-data tables/views that describe the schema of the database (for SQL Server 2005+, I'd be referring you to sys.tables and sys.columns). The idea would be to generate a series of ad-hoc queries based on the table schema, each one finding matches in a particular table/field combination and pumping matches into a "log" table.

  • I've used variants of this in the past.

  • It may not be optimal, but since I already know how to grep a text file I would just covert the database to a text file and grep that. Converting the database to a text file in this case would mean dumping the data using pg_dump.

    The quickest/easiest/most efficient way isn't always elegant...

    Joshua Berry : Quick and dirty: I like it. This solution works well for those of us more familiar with regex than SQL queries and functions. thanks!

What happens to an existing DB2 view, if the table is dropped ?

If we have created a view on an existing DB2 table and then drop the table. What will happen to the view ?

From stackoverflow
  • The view becomes invalid/inoperative. Attempts to select from it will fail.

    To try it:

    create table TEST_TABLE (
    TEST_COL INTEGER
    );
    
    INSERT INTO TEST_TABLE VALUES(1);
    
    SELECT * FROM TEST_TABLE;
    
    create view TEST_VIEW AS
    SELECT * FROM TEST_TABLE;
    
    SELECT * FROM TEST_VIEW;
    
    DROP TABLE TEST_TABLE;
    
    SELECT * FROM TEST_VIEW;
    

    The last statement gives the error:

    [IBM][CLI Driver][DB2/NT] SQL0575N  View or materialized query table
    "TEST_VIEW" cannot be used because it has been marked inoperative.
    SQLSTATE=51024
    
  • When a view is invalidated, as shown in the above example, DB2 will allow you to recreate that view without dropping it first. This makes it possible to re-run your view DDL files (or simply dump the TEXT column of SYSCAT.VIEWS and execute that).

Does the order of columns in a WHERE clause matter?

Hi,

Does the order of the columns in a WHERE clause effect performance?

e.g.

Say I put a column that has a higher potential for uniqueness first or visa versa?

From stackoverflow
  • With a decent query optimiser: it shouldn't.

    But in practice, I suspect it might.

    You can only tell for your cases by measuring. And the measurements will likely change as the distribution of data changes in the database.

  • If you are ANDing conditions the first not true will return false, so order can affect performance.

    Harry : I don't see why this was downvoted +1
    SpoonMeiser : Upvoting because you think a downvote was unfair, rather than because you believe the answer merits an upvote isn't great practice. I believe that a query optimiser will re-arrange the rules, rather than just employ lazy evaluation.
    mwigdahl : SpoonMeiser is correct; the optimizer (at least for SQL Server) uses more complex logic than simple C++-style evaluation.
  • It all depends on the DBMS, query optimizer and rules, but generally it does affect performance.

    If a where clause is ordered such that the first condition reduces the resultset significantly, the remaining conditions will only need to be evaluated for a smaller set. Following that logic, you can optimize a query based on condition order in a where clause.

  • For Transact-SQL there is a defined order of evaluation for the condition of the WHERE clause. The optimizer may be able to detect when the order may be rearranged and still be semantically equivalent, but I suspect that the transformations that it applies are relatively simple and it will be possible to construct a condition that performs suboptimially based on the ordering and grouping of the operators. Simplifying your search condition should improve the ability of the optimizer to handle it.

    Ex:

     WHERE (a OR b) AND (b OR c)
    

    could be simplified to

     WHERE b OR (a AND c)
    

    Clearly in this case if the query can be constructed to find if b holds first it may be able to skip the evaluation of a and c and thus would run faster. Whether the optimizer can do this simple transformation I can't answer (it may be able to), but the point is that it probably can't do arbitrarily complex transformations and you may be able to effect query performance by rearranging your condition.

    EDIT: With regard to your question about ordering based on uniqueness, I would assume that the any hints you can provide to the optimizer based on your knowledge (actual, not assumed) of the data couldn't hurt. Pretend that it won't do any optimization and construct your query as if you needed to define it from most to least selective, but don't obsess about it until performance is actually a problem.

    Quoting from the reference above:

    The order of precedence for the logical operators is NOT (highest), followed by AND, followed by OR. The order of evaluation at the same precedence level is from left to right. Parentheses can be used to override this order in a search condition. For more information about how the logical operators operate on truth values, see AND, OR, and NOT.

    Lieven : +1. The best explanation of "it depends" I've seen.
  • Unless I have missed something here, this question is not about the Query Optimizers interpretation of the precedence of logical operators but rather how the ordering of columns in the Where clause, based on selectivity, affects the query plan that is produced.

    The query optimizer will determine the most efficient way to select the data you have requested, irrespective of the ordering of the SARGS defined in the WHERE clause.

    The ordering is therefore determined by factors such as the selectivity of the column in question (which SQL Server knows based on statistics) and whether or not indexes can be used.

  • For SQL Server 2000 / 20005 / 2008, the query optimizer usually will give you identical results no matter how you arrange the columns in the WHERE clause. Having said this, over the years of writing thousands of T-SQL commands I have found a few corner cases where the order altered the performance. Here are some characteristics of the queries that appeared to be subject to this problem:

    1. If you have a large number of tables in your query (10 or more).

    2. If you have several EXISTS, IN, NOT EXISTS, or NOT IN statements in your WHERE clause

    3. If you are using nested CTE's (common-table expressions) or a large number of CTE's.

    4. If you have a large number of sub-queries in your FROM clause.

    Here are some tips on trying to evaluate the best way to resolve the performance issue quickly:

    1. If the problem is related to 1 or 2, then try reordering the WHERE clause and compare the sub-tree cost of the queries in the estimated query plans.

    2. If the problem is related to 3 or 4, then try moving the sub-queries and CTE's out of the query and have them load temporary tables. The query plan optimizer is FAR more efficient at estimating query plans if you reduce the number of complex joins and sub-queries from the body of the T-SQL statement.

    3. If you are using temporary tables, then make certain you have specified primary keys for the temporary tables. This means avoid using SELECT INTO FROM to generate the table. Instead, explicitly create the table and specify a primary KEY before using an INSERT INTO SELECT statement.

    4. If you are using temporary tables and MANY processes on the server use temporary tables as well, then you may want to make a more permanent staging table that is truncated and reloaded during the query process. You are more likely to encounter disk contention issues if you are using the TempDB to store your working / staging tables.

    5. Move the statements in the WHERE clause that will filter the most data to the beginning of the WHERE clause. Please note that if this is your solution to the problem, then you will probably have poor performance again down the line when the query plan gets confused again about generating and picking the best execution plan. You are BEST off finding a way to reduce the complexity of the query so that the order of the WHERE clause is no longer relevant.

    I hope you find this information helpful. Good luck!