Code Answer: 03/04/11

Friday, March 4, 2011

Ticking function grapher

Hello everyone,

I am trying to figure out the following problem. I am building Yet another math function grapher, The function is drawn on its predefined x,y range, that's all good.

Now I am working on the background and the ticking of X, Y axes (if any axes are shown).

I worked out the following. I have a fixed width of 250 p The tick gap should be between 12.5 and 50p.

The ticks should indicate either unit or half unit range, by that i mean the following.

x range (-5, 5): one tick = 1

x range (-1, 1): one tick = 0.5 or 0.1 depending on the gap that each of this option would generate.

x range (0.1, 0.3): 0.05

Given a Xrange How would you get the number of ticks between either full or half unit range ?

Or maybe there are other way to approach this type of problems.

From stackoverflow

Using deltaX

if deltax between 2 and 10 half increment if deltax between 10 and 20 unit increment if smaller than 2 we multiply by 10 and test again if larger than 20 we divide Then we get the position of the first unit or half increment on the width using xmin.

I still need to test this solution.
One way to do this would be to "normalise" the difference between the minimum and maximum and do a case distinction on that value. In python:
```
delta = maximum - minimum
factor = 10**math.ceil(math.log(delta,10))  # smallest power of 10 greater than delta
normalised_delta = delta / factor           # 0.1 <= normalised_delta < 1
if normalised_delta/5 >= 0.1:
  step_size = 0.1
elif normalised_delta/5 >= 0.05:
  step_size = 0.05
elif normalised_delta/20 <= 0.01:
  step_size = 0.01
step_size = step_size * factor
```
The above code assumes you want the biggest possible gap. For the smallest you would use the following if:
```
if normalised_delta/20 == 0.005:
  step_size = 0.005
elif normalised_delta/20 <= 0.01:
  step_size = 0.01
elif normalised_delta/5 >= 0.05:
  step_size = 0.05
```
Besides the possibility that there are more than one suitable values, there is also the somewhat worrisome possibility that there are none. Take for example the range [0,24] where a gap of 12.5p would give a step size of 1.2 and a gap of 50p would give step size 4.8. There is no "unit" or "half unit" in between. The problem is that the difference between a gap of 12.5p and one of 50p is a factor 4 while the difference between 0.01 and 0.05 is a factor 5. So you will have to widen the range of allowable gaps a bit and adjust the code accordingly.

Clarification of some of the magic numbers: divisions by 20 and 5 correspond to the number of segments with the minimal and maximal gap size, respectively (ie. 250/12.5 and 250/50). As the normalised delta is in the range [0.1,1), you get that dividing it by 20 and 5 gives you [0.005,0.05) and [0.02,0.2), respectively. These ranges result in the possible (normalised) step sizes of 0.005 and 0.01 for the first range and 0.05 and 0.1 for the second.

coulix : Thanks ! the factor = 10**math.ceil(math.log(delta,10)) did the tric !
You might want to take a look at Jgraph, which solves a complementary problem: it is a data grapher rather than a function grapher. But there are a lot of things in common such as dealing with major and minor tick marks, axis labels, and so on and so forth. I find the input language a little verbose for my taste, but Jgraph produces really nice technical graphs. There are a lot of examples on the web site and probably some good ideas you could steal.

And you know what they say: talent imitates, but genius steals :-)

This seems to do what i was expecting.

import math

def main(): getTickGap(-1,1.5)

def next_multiple(x, y): return math.ceil(x/y)*y

def getTickGap(xmin, xmax): xdelta = xmax -xmin width = 250 # smallest power of 10 greater than delta factor = 10**math.ceil(math.log(xdelta,10)) # 0.1 <= normalised_delta < 1 normalised_delta = xdelta / factor print("normalised_delta", normalised_delta)

# we want largest gap
if normalised_delta/4 >= 0.1:
  step_size = 0.1
elif normalised_delta/4 >= 0.05:
  step_size = 0.05
elif normalised_delta/20 <= 0.01:
  step_size = 0.01
step_size = step_size * factor


##    if normalised_delta/20 == 0.005:
##      step_size = 0.005
##    elif normalised_delta/20 <= 0.01:
##      step_size = 0.01
##    elif normalised_delta/4 >= 0.05:
##      step_size = 0.05
##    step_size = step_size * factor
print("step_size", step_size)
totalsteps = xdelta/step_size
print("Total steps", totalsteps)
print("Range [", xmin, ",", xmax, "]")

firstInc = next_multiple(xmin, step_size)
count = (250/xdelta)*(firstInc - xmin)
print("firstInc ", firstInc, 'tick at ', count)
print("start at ", firstInc - xmin, (width/totalsteps)*(firstInc - xmin))
inc = firstInc

while (inc <xmax):
    inc += step_size
    count += (width/totalsteps)
    print(" inc", inc, "tick at ", count)

if name == "main": main()

On range -1, 0

i get

normalised_delta 1.0
step_size 0.1
Total steps 10.0
Range [ -1 , 0 ]
firstInc  -1.0 tick at  0.0
start at  0.0 0.0
 inc -0.9 tick at  25.0
 inc -0.8 tick at  50.0
 inc -0.7 tick at  75.0
 inc -0.6 tick at  100.0
 inc -0.5 tick at  125.0
 inc -0.4 tick at  150.0
 inc -0.3 tick at  175.0
 inc -0.2 tick at  200.0
 inc -0.1 tick at  225.0
 inc -1.38777878078e-16 tick at  250.0
 inc 0.1 tick at  275.0

How come the second line from bottom get this number ????

mweerden : This is due to the inaccuracies of floating-point numbers and operations on computers. Specifically, 0.1 does not have a precise representation and with + you keep adding the error. If you use -1.0+9*0.1 the error is much smaller. (See http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems)

Metamodelling tools..

What tools are available for metamodelling?

Especially for developing diagram editors, at the moment trying out Eclipse GMF

Wondering what other options are out there? Any comparison available?

From stackoverflow

Although generally a UML tool, I would look at StarUML. It supports additional modules beyond what are already built in. If it doesn't have what you need built in or as a module, I supposed you could make your own, but I don't know how difficult that is.
Dia has an API for this - I was able to fairly trivially frig their UML editor into a basic ER modelling tool by changing the arrow styles. With a DB reversengineering tool I found in sourceforge (took the schema and spat out dia files) you could use this to document databases. While what I did was fairly trivial, the API was quite straightforward and it didn't take me that long to work out how to make the change.

If you're of a mind to try out Smalltalk There used to be a Smalltalk meta-case framework called DOME which does this sort of thing. If you download VisualWorks, DOME is one of the contributed packages.
Your question is simply to broad, to get a single answer - due to many aspects.

First, meta-modelling is not a set term, but rather a very fuzzy thing, including modelling models of models and reaching out to terms like MDA.

Second, there are numerous options to developing diagram editors - going the Eclipse way is surely a nice option.

To get you at least started in the Eclipse department:
- have a look at MOF, that is architecture for "meta-modelling" from the OMG (the guys, that maintain UML)
- from there approach EMOF, a sub set which is supported by the Eclipse Modelling Framework in the incarnation of Ecore.
- building something on top of GMF might be indeed a good idea, because that's the way existing diagram editors for the Eclipse platform take (e.g. Omondo's EclipseUML)
- there are a lot of tools existing in the Eclipse environment, that can utilize Ecore - I simply hope, that GMF builts on top of Ecore itself.
GMF is a nice example. At the core of this sits EMF/Ecore, like computerkram sais. Ecore is also used for the base of Eclipse's UML2 . The prestige use case and proof of concept for GMF is certainly UML2 Tools.
Meta-modeling is mostly done in Smalltalk.

You might want to take a look at MOOSE (http://moose.unibe.ch). There are a lot of tools being developed for program understanding. Most are Smalltalk based. There is also some java and c++ work.

Two of the most impressive tools are CodeCity and Mondrian. CodeCity can visualize code development over time, Mondrian provides scriptable visualization technology.

And of course there is the classic HotDraw, which is also available in java.

For web development there is also Magritte, providing meta-descriptions for Seaside.
I would strongly recommend you look into DSM (Domain Specific Modeling) as a general topic, meta-modeling is directly related. There are eclipse based tools like GMF that currently require java coding, but integrate nicely with other eclipse tools and UML. However there are two other classes out there.
1. MetaCase which I will call a pure DSM tool as it focuses on allowing a developer/modeler with out nearly as much coding create a usable graphical model. Additionally it can be easily deployed for others to use. GMF and Microsoft's Beta software factory/DSM tool fall into this category.
2. Pure Meta-modeling tools which are not intended for DSM tooling, code generation, and the like. I do not follow these tools as closely as I am interested in applications that generate tooling for SMEs, Domain Experts, and others to use and contribute value to an active project not modeling for models sake, or just documentation and theory.
If you want to learn more about number 1, the tooling applications for DSMs/Meta-modeling, then check out my post "DSMForum.org great resources, worth a look." or just navigate directly to the DSMForum.org

Processing Linux SIGnals using Gambas

I would like to send a (as yet undetermined) SIGnal from a bash script to a Gambas program when a specific file has been changed.

How can I get my Gambas program to process this SIGnal?

From stackoverflow

If the documentation is anything to go by, this doesn't seem possible. However, I would suggest asking the same question on the Gambas mailing list (and/or link to this question on Stack Overflow). Please report back if you get an answer via the mailing list.
According to http://www.mail-archive.com/gambas-user@lists.sourceforge.net/msg01890.html and http://www.nabble.com/Question-about-signal-management-inside-gambas-console-program---td20626972.html, it is not yet possible.

IDebugControl::WaitForEvent works once then returns E_HANDLE

I'm trying to make a small tool that makes use of the Debugger Engine API, but I'm having very limited success.

I can get my IDebugClient and IDebugControl instances, and from there I am able to attach into an already running user process. I then enter a main loop where I call WaitForEvent, OutputStackTrace, SetExecutionStatus(DEBUG_STATUS_GO), and repeat. In essence this will be a very crude sampling based profiler.

Good so far..

My loop runs for one full iteration, I can see a stack trace being displayed and then the target process going back into a running state.

The problem I have is that on my 2nd iteration the call to WaitForEvent returns E_HANDLE ("The handle is invalid"). I cannot see in the documentation why this error should be returned. Does anyone know why this might be happening?

From stackoverflow

The problem turned out to be that I was compiling, linking, and running against an old version of the SDK. Now that I've upgraded my SDK to the latest version (which I presume is the version that the online docs refer to) I get behaviour that is at least consistent with the docs.

I still have problems, but no longer this problem.

How can I list the missing dates from an array of non-continuous dates in Java?

I have a table of data sorted by date, from which a user can select a set of data by supplying a start and end date. The data itself is non-continuous, in that I don't have data for weekends and public holidays.

I would like to be able to list all the days that I don't have data for in the extracted dataset. Is there an easy way, in Java, to go:

Here is an ordered array of dates.
This is the selected start date. (The first date in the array is not always the start date)
This is the selected end date. (The last date in the array is not always the end date)
Return a list of dates which have no data.

From stackoverflow

You should be able to create a filtered iterator that provides this. Perhaps have the method for the iterator accept the start and stop date of your sub-collection. As for the actual implementation of the iterator, I can't think of anything much more elegant than a brute-force run at the whole collection once the start element has been found.

You could create a temp list and x it as needed.

(Not actual Java. Sorry, my memory of it is horrible.)

dates = [...]; // list you have now;

// build list
unused = [];
for (Date i = startdate; i < enddate; i += day) {
    unused.push(i);
}

// remove used dates
for (int j = 0; j < dates.length; j += 1) {
    if (unused.indexOf((Date) dates[j]) > -1) { // time = 00:00:00
        unused.remove(unused.indexOf((Date) dates[j]));
    }
}

You can either create a list of all possible dates between start and end date and then remove dates which appear in the list of given data (works best when most dates are missing), or you can start with an empty list of dates and add ones that don't appear in the given data.

Either way, you basically iterate over the range of dates between the start date and end date, keeping track of where you are in the list of given dates. You could think of it as a 'merge-like' operation where you step through two lists in parallel, processing the records that appear in one list but not in the other. In pseudo-code, the empty list version might be:
```
# given   - array of given dates
# N       - number of dates in given array
# missing - array of dates missing

i = 0;    # Index into given date array
j = 0;    # Index into missing data array
for (current_date = start_date; current_date <= end_date; current_date++)
{
    while (given[i] < current_date && i < N)
        i++
    if (i >= N)
        break
    if (given[i] != current_date)
        missing[j++] = current_date
}
while (current_date < end_date)
{
    missing[j++] = current_date
    current_date++
}
```
I'm assuming that the date type is quantized in units of a day; that is, date + 1 (or date++) is the day after date.
While the other answers already given look rather simple and enjoyable and hold some good ideas (I especially agree with the Iterator suggestion by Nerdfest), I thought I'd give this a shot anyway and code a solution just to show how I'd do it for the first iteration, I'm sure there's room for improvement in what's below.

I also maybe took your requirements a bit too literally but you know how to adjust the code to your liking. Oh and sorry for horrible naming of objects. Also since this sample uses Calendar, remember that Calendar.roll() may not update the entire Calendar object in some cases so that's a potential bug right there.
```
protected List<Calendar> getDatesWithNoData(Calendar start, Calendar end,
  Calendar[] existingDates) throws ParseException {

 List<Calendar> missingData = new ArrayList<Calendar>();

 for(Calendar c=start ; c.compareTo(end)<=0 ; c.roll(Calendar.DAY_OF_MONTH, true) ) {

  if(!isInDataSet(c, existingDates)) {
   Calendar c2 = Calendar.getInstance();
   c2.setTimeInMillis(c.getTimeInMillis());

   missingData.add(c2);
  }
 }
 return missingData;
}

protected boolean isInDataSet(Calendar toSearch, Calendar[] dataSet) {
 for(Calendar l : dataSet) {
  if(toSearch.equals(l)) return true;
 }
 return false;
}
```
Start with this: what's a date? Is it GMT or local?

If it's GMT, each day is the java.util.Date.getTime() value divided by 86400000. You can quickly run through your array, and add the resulting Long values to a TreeSet (which is sorted). Then iterate the TreeSet to find gaps.

If a date is local time, you'll have to add/subtract an appropriate offset before dividing.

mod_python.publisher always gives content type 'text/plain'

I've just set up mod python with apache and I'm trying to get a simple script to work, but what happens is it publishes all my html as plain text when I load the page. I figured this is a problem with mod_python.publisher, The handler I set it too. I searched through the source of it and found the line where it differentiates between 'text/plain' and 'text/html' and it searches the last hundred characters of the file it's outputting for ' in my script, so I put it in, and then it still didn't work. I even tried commenting out some of the code so that publisher would set everything as 'text/html' but it still did the same thing when I refreshed the page. Maybe I've set up something wrong.

Heres my configuration in the httpd.conf

< Directory "C:/Program Files/Apache Software Foundation/Apache2.2/htdocs">
SetHandler mod_python
PythonHandler mod_python.publisher
PythonDebug On
< /Directory >

From stackoverflow

Your configuration looks okay: I've got a working mod_python.publisher script with essentially the same settings.

A few other thoughts:
- When you tried editing the publisher source code, did you restart your web server? It only loads Python libraries once, when the server is first started.
- Publisher's autodetection looks for a closing HTML tag: </html>. Is that what you added? (I can't see it in your question, but possibly it just got stripped out when you posted it.)
- If nothing else works, you can always set the content type explicitly. It's more code, but it's guaranteed to work consistently. Set the content_type field on your request to 'text/html'.
For example, if your script looks like this right now:
```
def index(req, an_arg='default'):
    return some_html
```
it would become:
```
def index(req, an_arg='default'):
    req.content_type = 'text/html'
    return some_html
```

Need to get specific indexes out of a PHP array

I made an array in PHP which holds a bucnh of unix timestamps.

I'm trying to make a function that will return an array containing the indexes of the 3 largest numbers in that array.

For instance, if the largest numbers are located at indexes 3,5 and 8

And if the largest is 5, second largest is 8 and smallest of the three is number 3, I want an array that holds the values (5,8,3) in that order.

And frankly, I don't have a clue how to pull this off. Does anybody know how to do this?

From stackoverflow

Simon posted the simple and probably good-enough performing method.

The other option, only if you have a really large array, is to scan through the array and keep track of the indexes of the three highest values you see. This is O(n), but (especially since its in interpreted PHP code, not a compiled built-in function), probably slower for all but the largest of arrays.
You could use asort to sort the array and maintain index and then use slice along with the 4th parameter, again to maintain the index, to grap the top x number of elements you are after, and finally use array_keys.

There may well be a quicker way, but it's just to show there are plenty of PHP array functions to help you achieve the effect you're looking for.

Vordreller : Thanks, this solved my problem with one little exception. The function that gave me the 3 largest values as the first 3 numbers in the new array needs to be arsort, instead of asort

Simon : Ah, glad it helped. Alternatively you could have used array_slice to slice the other end of the array - but it's the same difference really.

In pseudo-code:

function select(list[1..n], k)
     for i from 1 to k
         maxIndex = i
         maxValue = list[i]
         for j from i+1 to n
             if list[j] > maxValue
                 maxIndex = j
                 maxValue = list[j]
         swap list[i] and list[maxIndex]
     return list[k]

newarray[] = select(array, 1);
newarray[] = select(array, 2);
newarray[] = select(array, 3);

In PHP code:

function threeLargest($array){
 krsort($array, "SORT_NUMERIC");
 $return[0] = $array[0];
 $return[1] = $array[1];
 $return[2] = $array[2];
 return $return;
}

Vordreller : this did not work. For some reason, non of the sorting functions for arrays seem to be working in my debugger...

Determining the TCP port number to which client got bound.

Hi,

I create a TCP socket without bothering about the port number to bind to [socket.sin_port = 0]. However later on if I want to print the port number of client how do I do that? The client C application (on Linux) creates many clients which get connected to server. To debug issues I capture the traffic on ethereal. I thought of printing the port number in logs while issue arises so that filtering on ethereal becomes easy.

Any help would be appreciated.

-Prabhu

From stackoverflow

Use the getsockname() call to get the socket address and port after a successful connection.

Edit: correct method name. Sometimes I can't copy a simple word from one window to another!
I believe that Darron meant getsockname(). This is what you want if you need to determine the port number on the client side (the side calling connect()) programmatically. On the server side (the side calling bind()), you would use getpeername() to get the same information from the connected socket.

However, if you are debugging and can't change the code, then you end up using things like the netstat or sockstat utility depending on the operating system. I'm not sure what utilities are available under Linux (or even your particular package) but I would start with man sockstat. If it's installed, you can run it from either side and see which endpoints have been assigned to which processes. Combine this with grep and you can usually figure out which address to filter on in Ethereal. Good luck!
```
lsof -p <process id> | grep TCP
```

List.Add seems to be duplicating entries. What's wrong?

I have a class like this:

public class myClass
{
  public List<myOtherClass> anewlist = new List<myOtherClass>;

  public void addToList(myOtherClass tmp)
  {
    anewList.Add(tmp);
  }

}

So I call "addToList" a hundred times, each adding a unique item to the list. I've tested my items to show that before I run the "addToList" method, they are unique. I even put a line in to test "tmp" to make sure it was what I was expecting.

However, when I do this (lets say myClass object is called tmpClass):

int i = tmpClass.anewList.Count();
for (int j = 0; j<i; j++)
{
   //write out each member of the list based on index j...
}

I get the same exact item, and it's the last one that was written into my list. It's as if when I add, I'm overwriting the entire list with the last item I've added.

Help? This makes no sense. I've also tried List.Insert, where I'm always inserting at the end or at index 0. Still no dice. Yes, I'm doubly source my indexing is correct and when I do my test I'm indexing through each of the elements.

UPDATE: Okay, I tried this and still had the same problem:

foreach(myOtherClass tmpC in tmpClass.anewList)
{    
    Console.WriteLine(tmpC.theStringInMyClass.ToString());
}

and still for each of the 100 items, I got the same string output... I'm sure I'm doing something completely stupid, but I don't know what yet. I'm still 100% sure that the right string is getting passed in to begin with.

-Adeena

From stackoverflow

Try iterating through the list using foreach rather than by index. I suspect that the problem is in the code that you have omitted from your example, not the list itself.
```
foreach (MyOtherClass item in tmpClass.anewList)
{
     Console.WriteLine( item );  // or whatever you use to write it
}
```
EDIT

Have you examined the list structure in the debugger to ensure that your unique items are actually added? Also, you might want to use .Count (the property) instead of .Count() (the extension method). The extension method may actually iterate over the list to count the methods, while the property is just looking up the value of a private variable that holds the count.

@James may be onto something here. If you are just changing the properties of the item you've inserted and reinserting it, instead of creating a new object each time, this would result in the behavior you are seeing.
Okay, I tried this and still had the same problem:
```
foreach(myOtherClass tmpC in tmpClass.anewList)
{
    Console.WriteLine(tmpC.theStringInMyClass.ToString());
}
```
and still for each of the 100 items, I got the same string output... I'm sure I'm doing something completely stupid, but I don't know what yet. I'm still 100% sure that the right string is getting passed in to begin with.

-Adeena
Also note that when you use the exact form:
```
 for (int j = 0; j < tmpClass.anewList.Count(); j++)
```
The C# compile preforms a special optimization on the loop. If you vary from the syntax (e.g. by pulling the Count property out of the loop into a separate varaible, as you did in you example), the compile skips that optimization.

It won't affect what is displayed, but it will take longer.
In this case, it would likely be helpful to see how you are validating each item to be sure that the items are unique. If you could show the ToString() method of your class it might help: you might be basing it on something that is actually the same between each of your objects. This might help decide whether you really are getting the same object each time, or if the pieces under consideration really are not unique.

Also, rather than accessing by index, you should use a foreach loop whenever possible.

Finally, the items in a list are not universally unique, but rather references to an object that exists elsewhere. If you're trying to check that the retrieved item is unique with respect to an external object, you're going to fail.

One more thing, I guess: you probably want to have the access on anewList to be private rather than public.

For debugging, to ensure that I was building up my list "uniquely", I did the following:

public class myClass
{
  public List<myOtherClass> anewlist = new List<myOtherClass>;

  public void addToList(myOtherClass tmp)
  {
    //A
    Console.WriteLine(" the input = "+tmp.theStringInMyClass.ToString());
    anewList.Add(tmp);
    int i = anewList.Count();
    //B
    Console.WriteLine(" the new list count = "+ i);
    Console.WriteLine(" the last entry =" +anewList.ElementAt(i-1).theStringInMyClass.ToString());
    if (i == 100) // just for debug, I know this is my last entry
    {
       foreach (myOtherClass tmpL in anewList)
       {
           //C
           Console.WriteLine( tmpL.theStringInMyClass.ToString());
       }
    }
  }

}

The output after comment "A" is what I expect - all the unique entries that are input. The output in the two lines after comment "B" is what I expect - i increases, and all the string output is the unique ones I expect. "C" is what's not right... it writes out the same entry, which happens to be the last one.

???

-Adeena

tvanfosson : Show the code that's calling addToList.

Given the signature of your addToList method:
```
public void addToList(myOtherClass tmp)
  {
    anewList.Add(tmp);
  }
```
Is is possible that in the consumer of that method, you aren't actually creating a new instance?

You said that you are calling addToList 100 times. Presumably, that is in a loop. At each loop iteration, you will need to create a new instance of "myOtherClass", otherwise, you'll just be updating the same object in memory.

For example, if you do the below, you will have 100 copies of the same object:
```
myOtherClass item = new myOtherClass();

for(int i=0; i < 100; i++)
{
  item.Property = i;
  addToList(item);
}
```
However, if your loop looks like the below, it will work fine:
```
myOtherClass item = null;
for(int i=0; i < 100; i++)
{
  item = new myOtherClass();
  item.Property = i;
  addToList(item);
}
```
Hope that helps!

VVS : Exactly my guess: he's probably adding the same reference to an instance a hundred times.

alexandrul : +1 for the detailed answer, this should be the accepted answer.

configurator : +1, my thoughts exactly. Except better articulated :)

Got it! Thank you James -

here's the stupid thing I did wrong:

I had:

myClass tmpClass = new myClass();
myOtherClass anewitem = new myOtherClass();
string tst = "";

for (int i = 0; i < 100; i++) 
{
    tst += "blah";
    anewitem.theStirngInMyClass = tst;
    tmpClass.AddToList(anewitem);
}

when I changed it to be this:

myClass tmpClass = new myClass();
string tst = "";

for (int i = 0; i < 100; i++) 
{
    myOtherClass anewitem = new myOtherClass()
    tst += "blah";
    anewitem.theStringInMyClass = tst;
    tmpClass.AddToList(tst);
}

All was well. I get it. :)

Thanks for the help guys!

-Adeena

Well, from I've read here, I supose your problem could be in adding items in list - are you sure, you're not adding the same reference again and again? That could be reason, why you have 100 "last items" in list.

adeena : Yep - that was exactly it. :) -A

Reading a large file into a Dictionary

Hi,

I have a 1GB file containing pairs of string and long. What's the best way of reading it into a Dictionary, and how much memory would you say it requires?

File has 62 million rows. I've managed to read it using 5.5GB of ram.

Say 22 bytes overhead per Dictionary entry, that's 1.5GB. long is 8 bytes, that's 500MB. Average string length is 15 chars, each char 2 bytes, that's 2GB. Total is about 4GB, where does the extra 1.5 GB go to?

The initial Dictionary allocation takes 256MB. I've noticed that each 10 million rows I read, consume about 580MB, which fits quite nicely with the above calculation, but somewhere around the 6000th line, memory usage grows from 260MB to 1.7GB, that's my missing 1.5GB, where does it go?

Thanks.

From stackoverflow

You'll need to specify the file format, but if it's just something like name=value, I'd do:
```
Dictionary<string,long> dictionary = new Dictionary<string,long>();
using (TextReader reader = File.OpenText(filename))
{
    string line;
    while ((line = reader.ReadLine()) != null)
    {
        string[] bits = line.Split('=');
        // Error checking would go here
        long value = long.Parse(bits[1]);
        dictionary[bits[0]] = value;
    }
}
```
Now, if that doesn't work we'll need to know more about the file - how many lines are there, etc?

Are you using 64 bit Windows? (If not, you won't be able to use more than 3GB per process anyway, IIRC.)

The amount of memory required will depend on the length of the strings, number of entries etc.

Unkwntech : 3.5GB on 32-bit windows.

Jon Skeet : I thought the 3.5GB was the amount of physical memory the whole system would use, but with a 3GB per process limit. Either way, it's less than 5 :)

Aaron Fischer : And your application needs to be set to any cpu or x64 to take advantage of a 64 bit system.

sixlettervariables : 32bit still has 2GB per process limit without PAE. The 3GB switch just means all applications share 3GB and the kernel uses 1GB.
Thinking about this, I'm wondering why you'd need to do it... (I know, I know... I shouldn't wonder why, but hear me out...)

The main problem is that there is a huge amount of data that needs to be presumably accessed quickly... The question is, will it essentially be random access, or is there some pattern that can be exploited to predict accesses?

In any case, I would implement this as a sliding cache. E.g. I would load as much as feasibly possible into memory to start with (with the selection of what to load based as much on my expected access pattern as possible) and then keep track of accesses to elements by time last accessed. If I hit something that wasn't in the cache, then it would be loaded and replace the oldest item in the cache.

This would result in the most commonly used stuff being accessible in memory, but would incur additional work for cache misses.

In any case, without knowing a little more about the problem, this is merely a 'general solution'.

It may be that just keeping it in a local instance of a sql db would be sufficient :)

Meidan Alon : 1 GB is the bare minimum for gaining any kind of performance improvement (after exploiting all possible access patterns). Actually, I'm really thinking about an in-memory DB.

Andrew Rollings : Well, in that case, you could try the above... With a bit of tuning, it may well do what you need. :) Failing that, stick it in a local db instance and let that take care of the cacheing.
Loading a 1 GB file in memory at once doesn't sound like a good idea to me. I'd virtualize the access to the file by loading it in smaller chunks only when the specific chunk is needed. Of course, it'll be slower than having the whole file in memory, but 1 GB is a real mastodon...

lubos hasko : It maybe doesn't sound like a good idea to you but you bet that it sounds like a good idea to Google because this is exactly what they're doing. Storing in memory all their indexes.

Boyan : @lubos hasko: Maybe it's easier when you have a Google datacenter. But assuming every client has one still doesn't sound like a good idea.
Maybe you can convert that 1 GB file into a SQLite database with two columns key and value. Then create an index on key column. After that you can query that database to get the values of the keys you provided.
Everyone here seems to be in agreement that the best way to handle this is to read only a portion of the file into memory at a time. Speed, of course, is determined by which portion is in memory and what parts must be read from disk when a particular piece of information is needed.

There is a simple method to handle deciding what's the best parts to keep in memory:

Put the data into a database.

A real one, like MSSQL Express, or MySql or Oracle XE (all are free).

Databases cache the most commonly used information, so it's just like reading from memory. And they give you a single access method for in-memory or on-disk data.
Don't read 1GB of file into the memory even though you got 8 GB of physical RAM, you can still have so many problems. -based on personal experience-

I don't know what you need to do but find a workaround and read partially and process. If it doesn't work you then consider using a database.
I am not familiar with C#, but if you're having memory problems you might need to roll your own memory container for this task.

Since you want to store it in a dict, I assume you need it for fast lookup? You have not clarified which one should be the key, though.

Let's hope you want to use the long values for keys. Then try this:

Allocate a buffer that's as big as the file. Read the file into that buffer.

Then create a dictionary with the long values (32 bit values, I guess?) as keys, with their values being a 32 bit value as well.

Now browse the data in the buffer like this: Find the next key-value pair. Calculate the offset of its value in the buffer. Now add this information to the dictionary, with the long as the key and the offset as its value.

That way, you end up with a dictionary which might take maybe 10-20 bytes per record, and one larger buffer which holds all your text data.

At least with C++, this would be a rather memory-efficient way, I think.
It's important to understand what's happening when you populate a Hashtable. (The Dictionary uses a Hashtable as its underlying data structure.)

When you create a new Hashtable, .NET makes an array containing 11 buckets, which are linked lists of dictionary entries. When you add an entry, its key gets hashed, the hash code gets mapped on to one of the 11 buckets, and the entry (key + value + hash code) gets appended to the linked list.

At a certain point (and this depends on the load factor used when the Hashtable is first constructed), the Hashtable determines, during an Add operation, that it's encountering too many collisions, and that the initial 11 buckets aren't enough. So it creates a new array of buckets that's twice the size of the old one (not exactly; the number of buckets is always prime), and then populates the new table from the old one.

So there are two things that come into play in terms of memory utilization.

The first is that, every so often, the Hashtable needs to use twice as much memory as it's presently using, so that it can copy the table during resizing. So if you've got a Hashtable that's using 1.8GB of memory and it needs to be resized, it's briefly going to need to use 3.6GB, and, well, now you have a problem.

The second is that every hash table entry has about 12 bytes of overhead: pointers to the key, the value, and the next entry in the list, plus the hash code. For most uses, that overhead is insignificant, but if you're building a Hashtable with 100 million entries in it, well, that's about 1.2GB of overhead.

You can overcome the first problem by using the overload of the Dictionary's constructor that lets you provide an initial capacity. If you specify a capacity big enough to hold all of the entries you're going to be added, the Hashtable won't need to be rebuilt while you're populating it. There's pretty much nothing you can do about the second.

Meidan Alon : I've created the Dictionary with a big enough initial capacity. I can also see that memory usage grows slowly, wthout any big jumps.
Can you convert the 1G file into a more efficient indexed format, but leave it as a file on disk? Then you can access it as needed and do efficient lookups.

Perhaps you can memory map the contents of this (more efficient format) file, then have minimum ram usage and demand-loading, which may be a good trade-off between accessing the file directly on disc all the time and loading the whole thing into a big byte array.
If you choose to use a database, you might be better served by a dbm-style tool, like Berkeley DB for .NET. They are specifically designed to represent disk-based hashtables.

Alternatively you may roll your own solution using some database techniques.

Suppose your original data file looks like this (dots indicate that string lengths vary):
```
[key2][value2...][key1][value1..][key3][value3....]
```
Split it into index file and values file.

Values file:
```
[value1..][value2...][value3....]
```
Index file:
```
[key1][value1-offset]
[key2][value2-offset]
[key3][value3-offset]
```
Records in index file are fixed-size key->value-offset pairs and are ordered by key. Strings in values file are also ordered by key.

To get a value for key(N) you would binary-search for key(N) record in index, then read string from values file starting at value(N)-offset and ending before value(N+1)-offset.

Index file can be read into in-memory array of structs (less overhead and much more predictable memory consumption than Dictionary), or you can do the search directly on disk.

How to use nhibernate with ASP.NET 3.5 (not MVC)

What is the best way to interop with NHibernate 2.0 and ASP.NET 3.5? How can I easily develop a CRUD application?

Is the ObjectDataSource the way to go?

Thank you.

From stackoverflow

you can watch screencasts at http://www.summerofnhibernate.com/ that will explain how to set up CRUD, and will shed some light on more advanced topics
You might find Rhino Commons a good option. It offers Repository<T> and UnitOfWorkApplication. Together these provide data gateway and session management in the context of a web application. Use with Castle.Service.Transaction to handle transactions transparently.
```
#include "printf.h"``
```

How to transform rows to columns

I have a simple problem when querying the SQL Server 2005 database. I have tables called Customer and Products (1->M). One customer has most 2 products. Instead of output as

CustomerName, ProductName ...

I like to output as

CustomerName, Product1Name, Product2Name ...

Could anybody help me?

Thanks!

From stackoverflow

in sql2005, there are functions called "PIVOT" and "UNPIVOT" which can be used to transform between rows and columns.

Hope that could help you.
Here two link about pivot:

http://www.tsqltutorials.com/pivot.php

http://www.simple-talk.com/sql/t-sql-programming/creating-cross-tab-queries-and-pivot-tables-in-sql/

I solve my problem with pivot ;)
Like others have said, you can use the PIVOT and UNPIVOT operators. Unfortunately, one of the problems with both PIVOT and UNPIVOT are that you need to know the values you will be pivoting on in advance or else use dynamic SQL.

It sounds like, in your case, you're going to need to use dynamic SQL. To get this working well you'll need to pull a list of the products being used in your query. If you were using the AdventureWorks database, your code would look like this:
```
USE AdventureWorks;
GO

DECLARE @columns NVARCHAR(MAX);

SELECT x.ProductName
INTO #products
FROM (SELECT p.[Name] AS ProductName
 FROM Purchasing.Vendor AS v
 INNER JOIN Purchasing.PurchaseOrderHeader AS poh ON v.VendorID = poh.VendorID
 INNER JOIN Purchasing.PurchaseOrderDetail AS pod ON poh.PurchaseOrderID = pod.PurchaseOrderID
 INNER JOIN Production.Product AS p ON pod.ProductID = p.ProductID
 GROUP BY p.[Name]) AS x;

SELECT @columns = STUFF(
 (SELECT ', ' + QUOTENAME(ProductName, '[') AS [text()]
    FROM #products FOR XML PATH ('')
 ), 1, 1, '');

SELECT @columns;
```
Now that you have your columns, you can pull everything that you need pivot on with a dynamic query:
```
DECLARE @sql NVARCHAR(MAX);

SET @sql = 'SELECT CustomerName, ' + @columns + '
FROM (
 // your query goes here
) AS source
PIVOT (SUM(order_count) FOR product_name IN (' + @columns + ') AS p';

EXEC sp_executesql @sql
```
Of course, if you need to make sure you get decent values, you may have to duplicate the logic you're using to build @columns and create an @coalesceColumns variable that will hold the code to COALESCE(col_name, 0) if you need that sort of thing in your query.

As others have mentioned, SQL 2005 has the PIVOT function which is probably the best for general use. In some cases, however, you can simply do something like this.

Select 
 Customer,
 Sum(Case When Product = 'Foo' Then 1 Else 0 End) Foo_Count,
 Sum(Case When Product = 'Bar' Then 1 Else 0 End) Bar_Count
From Customers_Products
Group By Customer

If this is a real-life question you are just asking if it's possible directly in SQL, then you might consider querying the database from Excel or Access, They have excellent built-in pivot tools accessible without building any SQL.

le dorfier : Interesting. We have someone who is offended by simply Access and/or Excel. And won't identify themselves. I guess a certain % of your answers are subject to random vandalism. Oh well.
Thank you all of the help! I probably will make it a little clear. Suppose Customer 'AW0000002'has two products, and 'AW0000003' has one product (every customer has maxmium two products). I like to have the output as following:
```
CustomerName  ProductName1             ProductName2
AW00000002    Adjustable Race          Bearing Ball
AW00000003    Headset Ball Bearings    null
```
Thanks!

Read & Update filestream

I have a little utility that does a search of a number of files. I had to create it because both Google & Windows desktop searches were not finding the appropriate lines in files. The searching works fine (I am willing to improve on it) but one of the things I would like to add to my util is a batch find/replace.

So how would be the best way to read a line from a file, compare it to a SearchTerm and if it passes, then update the line, and continue through the rest of the file?

Also, am I asking the completely wrong question? Is there a reliable tool to do this already? [Google & MS have failed me :-( ]

From stackoverflow

I would do the following for each file:
- Do the search as normal. Also check for the token to replace. As soon as you've seen it, start that file again. If you don't see the token to replace, you're done.
- When you start again, create a new file and copy each line that you read from the input file, doing the replacement as you go.
- When you've finished with the file:
  - Move the current file to a backup filename
  - Move the new file to the original filename
  - Delete the backup file
Be careful that you don't do this on binary files etc though - the consequences of doing a textual search and replace on binary files would usually be dire!

grepsedawk : There is a trade off. If you know that most likely the files will have the SearchTerm, it may be better to start a temporary file copying the contents of the other file as you go. Instead of searching through the file twice.

Jon Skeet : Yes, I wondered about that. I figured I would give just the simple algorithm instead :)
Have you tried googling for "grep for windows"?
Total Commander ?

Nathan Koop : Thanks for the suggestion, I'll take a look into this tool.

If PowerShell is an option, the function defined below can be used to perform find and replace across files. For example, to find 'a string' in text files in the current directory, you would do:

dir *.txt | FindReplace 'a string'

To replace 'a string' with another value, just add the new value at the end:

dir *.txt | FindReplace 'a string' 'replacement string'

You can also call it on a single file using FindReplace -path MyFile.txt 'a string'.

function FindReplace( [string]$search, [string]$replace, [string[]]$path ) {
  # Include paths from pipeline input.
  $path += @($input)

  # Find all matches in the specified files.
  $matches = Select-String -path $path -pattern $search -simpleMatch

  # If replacement value was given, perform replacements.
  if( $replace ) {
    # Group matches by file path.
    $matches | group -property Path | % {
      $content = Get-Content $_.Name

      # Replace all matching lines in current file.
      foreach( $match in $_.Group ) {
        $index = $match.LineNumber - 1
        $line = $content[$index]
        $updatedLine = $line -replace $search,$replace
        $content[$index] = $updatedLine

        # Update match with new line value.
        $match | Add-Member NoteProperty UpdatedLine $updatedLine
      }

      # Update file content.
      Set-Content $_.Name $content
    }
  }

  # Return matches.
  $matches
}

Note that Select-String also supports regex matches, but has been constrainted to simple matches for simplicity ;) You can also perform a more robust replacement like Jon suggested, rather than just overwriting the file with the new content.

Migration From Datatable to Linq to Sql

In past I use dynamic sql and datatable to get data from database.

Such as :

Public shared function GetUsersByUsername(byval username as string) as datatable

dim strSQL as string="select * from

Users where Username= " & username

return dbClass.datatable(strSQL) 

end function

And I could use this data such this:

Dim Email as string = GetUsersByUsername("mavera").rows(0).items("email")`

datagrid1.datasource=GetUsersByUsername("mavera")

datagrid1.databind()

And now, I want to use linq to sql to do that. I can write query with linq, but I can't use it like a datatable. What should my new using be done?

From stackoverflow

You should get rid of GetUsersByName() altogether, because you can do it in one line. You will also have to change how you get things like the user's email. So GetUsersByName() would be rewritten something like:
```
dc.Users.Where(Function(u) u.Username = username);
```
and your email assignment statement would be written as:
```
Dim Email as string = users.First().Email;
```
Forgive me if my VB syntax is off. I never use it anymore...

Business Object DAL design

When designing business objects I have tried several different methods of writing the data access layer. Some have worked out better than others but I have always felt there must be a "better" way.

I would really just like to see the different ways people have handled the DAL in different situations and their opinon of how the technique worked or didn't work well.

From stackoverflow

Unfortunately I don't think there is a "better way", it's too dependent on the specific situation as to what DAL approach you use. A great discussion of the "state of the art" is Patterns of Enterprise Application Architecture by Martin Fowler.

Chapter 10, Data Source Architectural Patterns specifically talks about most of the most commonly used patterns for business applications.

In general though, I've found using the simplest approach that meets the basic maintainability and adaptability requirements is the best choice.

For example on a recent project a simple "Row Data Gateway" was all I needed. (This was simply code generated classes for each relevant database table, including methods to perform the CRUD operations). No endless debates about ORM versus stored procs, it just worked, and did the required job well.

Bob Dizzle : I agree that for different situations the answer will be different, but I'm just looking for some different techniques and how they worked out.
There are several common patterns. 'The patterns of enterprise architecture' book is a good reference for these:
- Table Data Gateway
- Row Data Gateway
- Active Record
- Data Mapper
If you use an ORM, such as llblgen, you get the choice of self-servicing or adaptor.
I've relied heavily on Billy McCafferty's NHibernate Best Practices article / sample code for many Web / WinForms applications now. It's a wonderfully written article that will provide you with a good solid sample architecture -- in addition to teaching you basic NHibernate and TDD. He tries to give you an overview of his architecture and design decisions.

He creates a very elegant DAL using generic DataAccessObjects which you can extend for each domain object -- and its very loosely coupled to the BL using interfaces and a DAOFactory. I would recommend looking at the BasicSample first, especially if you haven't worked with NHibernate before.

Note, this article relies heavily on NHibernate, but I think the general approach it could be easily altered to suit other ORMs.
If you're going down the NHibernate route (good article link BTW from @Watson above), then I'd strongly recommend that you checkout the suvius-flamingo sample project from codebetter. He has a very nice, succinct, sample project which shows MVC and NHibernate in action.

Here's the suvius-flamingo link.
In our open source project Bunian, we concluded that the Business Objects (the whole component) is the core of the system, and everything should revolve around it including that data access layer.

The Business component will dictate to others what it needs, implying that through itnerfaces. For example Business Object Person will have an interface member called IRepositoryForPerson, this member will be assigned an instance through Dependency Injection container when needed.

For more details check my blog post here:

http://www.emadashi.com/index.php/2008/11/data-access-within-business-objects-bunian-design//

and check Bunian's code here (although it's amateur yet):

http://www.codeplex.com/Bunian

Of course there will be emerging new things with this approach like the life cycle of the data access session (if you are using NHibernate for example). but that would be for another question i guess :)

i hope you find this useful
I am going to assume you mean writing a DAL that is accessing SQL, because this is the most common part today. ONe if the biggest problems in writing a DAL against SQL is the ORM part. That is, there is a fundamental impedance mismatch between OO programming and relational database schemas. There have been many great, succesful even, attempts at writing ORMs. But they all suffer from the same problem that is their benefit: they abstract you away from the underlying SQL being generated. Why this is a problem, is that the performance of your database is a critical compponent of how well your system functions overall. Many ORMs (perhaps most) not only have less-than-stellar performance for many standard queries, but actually encourage patterns of usage that will degrade performance considerably (traversing relationships repeatedly within loops when querying collections being one common example, making resolving deadlocks difficult being another). Of course, after learning the ORM API in detail, you can usually find ways around these performance potholes.

My current take on the state of ORMs is that I want it to do as little as possible, while still giving me the efficiencies of a solid library that takes care of all of the nuts and bolts of data access. In other words, because I don't think they are "good enough" yet, and may never be with SQL as the back end, I want to retain control at the bare-metal level, and I will drop down to writing SQL by hand without hesitation in many cases, regardless of the ORM, because I know the specific way I want the data to be queried for my given needs.

This is obviously a more brittle approach to coding than if you religiously use the ORM as it was intended, so as a result, you have to be extra diligent in terms of unit testing, SQL injection, and proper separation of concerns. So, to sum up, I agree with Ash, although that does not imply he/she agrees with me :)

Application Frameworks - Buy, Build, or Assimilate?

I was curious as to what other shops are doing regarding base application frameworks? I look at an application framework as being able to provide additional or extended functionality to improve the quality of applications built from it.

There are a variety of out of the box frameworks, such as Spring (or Spring.NET), etc. I find that the largest problem with these being that they are not a la carte. Basically, they have too much functionality and unless every piece of that functionality is the best implementation available, chances are that you will end up using a patchwork of multiple frameworks to accomplish these tasks - causing bloat and confusion. This applies to free and commercial systems, in my opinion.

Of course, writing is largely re-inventing the wheel. I don't think it is without merit, though, as it provides the most customizable option. Some things are just too large to develop, though, and seem to be poorly implemented or not implemented at all in this case because of the hesitation to commit to the upfront costs of development.

There are a large variety of open source projects that address individual portions of a could-be application framework as well. These can be adopted or assimilated (obviously depending upon license agreements) to help frame in a comprehensive framework from diverse sources.

We approached the situation by looking at some of the larger concerns in our applications across the entire enterprise and came up with a list of valid cross-cutting concerns and recurring implementation issues. In the end, we came up with hybrid solution that is partially open source, partially based on existing open source options, and partially custom developed.

A few examples of things that are in our framework:

Exception and event logging providers. A simple, uniform means by which every application can log exceptions and events in an identical fashion with a minimal coding effort. Out of the box, it can log to a SQL Server, text file, event viewer, etc. It contains extensibility points to log to other sources, as well.
Variable assignment enforcement. A generic class that exposes extension methods based upon the object type, using a syntax that is inspired by JUnit. For example, to determine if myObject is not null, we can do a simple Enforce.That(myObject).IsNotNull(); or determine if it is a specific type by doing a simple Enforce.That(myObject).IsOfType(typeof(Hashtable)); Enforcement failures raise the appropriate exception, both reducing the amount of code and providing consistency in implementation.
Unit testing helpers. A series of classes, based upon reflection that can automatically test classes and their properties. (Inspired by Automatic Class Tester from CodePlex) but written from the ground up. Helps to simplify the creation of unit tests for things that are traditionally hard or time-consuming to test.

We have also outright adopted some other functionality, as is. For example, we are using PostSharp for AOP, moq for mocking, and autofaq for DI.

Just wondering what other people might have done and what concerns your framework addresses that you did not find tooling that you were satisfied with? As for our experience, we are definitely reaping the benefits of the new framework and are content with the approach that we have taken.

From stackoverflow

My simple advice is that you use a framework that suits your needs. Of course, in order to do this you have to experiment and know beforehand what are you looking for. Even if the framework comes with much more than you need, what is the cost of this? For the average problem, the cost is only a few extra Mbs in a jar, which I think is OK for most projects.

In the end, you should choose a framework that does the job right, so that your focus is at providing user value and easing the maintenance of the developer. Of course, there isn't a single framework that addresses everyone's problems, but there are some frameworks that hit the sweet spot on what they aim for. It's all a matter of going with the best compromise.

joseph.ferris : Exactly. That is why we went with the approach of finding as many pieces to the puzzle as possible and built around them.
Our approach was to devote an entire team of architects (namely 'Technical Architects') for:
- either adapting existing open-source framework, in some case encapsulating them in an in-house API in order to be able to change framework should the need arise
- or creating new framework based on the specific needs found a several teams for several projects.
Whatever the approach, those frameworks need to be very well documented (at least with a complete public API), and their release need to be well advertised:
Since all teams will based their work on those frameworks, they need to upgrade their versions of framework as soon as possible, in order to build their own deliveries.

joseph.ferris : That is very similar to what we are doing and the requirements that we have. The only difference is that I am the "team" of tech archs saddled with the task. ;-)

VonC : Impressive! Good luck with that :)

Friday, March 4, 2011

Blog Archive