Sunday, April 17, 2011

Quickest way in C# to find a file in a directory with over 20,000 files

I have a job that runs every night to pull xml files from a directory that has over 20,000 subfolders under the root. Here is what the structure looks like:

rootFolder/someFolder/someSubFolder/xml/myFile.xml
rootFolder/someFolder/someSubFolder1/xml/myFile1.xml
rootFolder/someFolder/someSubFolderN/xml/myFile2.xml
rootFolder/someFolder1
rootFolder/someFolderN

So looking at the above, the structure is always the same - a root folder, then two subfolders, then an xml directory, and then the xml file. Only the name of the rootFolder and the xml directory are known to me.

The code below traverses through all the directories and is extremely slow. Any recommendations on how I can optimize the search especially if the directory structure is known?

string[] files = Directory.GetFiles(@"\\somenetworkpath\rootFolder", "*.xml", SearchOption.AllDirectories);
From stackoverflow
  • Are there additional directories at the same level as the xml folder? If so, you could probably speed up the search if you do it yourself and eliminate that level from searching.

            System.IO.DirectoryInfo root = new System.IO.DirectoryInfo(rootPath);
            List<System.IO.FileInfo> xmlFiles=new List<System.IO.FileInfo>();
    
            foreach (System.IO.DirectoryInfo subDir1 in root.GetDirectories())
            {
                foreach (System.IO.DirectoryInfo subDir2 in subDir1.GetDirectories())
                {
                    System.IO.DirectoryInfo xmlDir = new System.IO.DirectoryInfo(System.IO.Path.Combine(subDir2.FullName, "xml"));
    
                    if (xmlDir.Exists)
                    {
                        xmlFiles.AddRange(xmlDir.GetFiles("*.xml"));
                    }
                }
            }
    
  • I can't think of anything faster in C#, but do you have indexing turned on for that file system?

    Joey : Indexing shouldn't do much for simple traversal.
  • Rather than doing GetFiles and doing a brute force search you could most likely use GetDirectories, first to get a list of the "First sub folder", loop through those directories, then repeat the process for the sub folder, looping through them, lastly look for the xml folder, and finally searching for .xml files.

    Now, as for performance the speed of this will vary, but searching for directories first, THEN getting to files should help a lot!

    Update

    Ok, I did a quick bit of testing and you can actually optimize it much further than I thought.

    The following code snippet will search a directory structure and find ALL "xml" folders inside the entire directory tree.

    string startPath = @"C:\Testing\Testing\bin\Debug";
    string[] oDirectories = Directory.GetDirectories(startPath, "xml", SearchOption.AllDirectories);
    Console.WriteLine(oDirectories.Length.ToString());
    foreach (string oCurrent in oDirectories)
        Console.WriteLine(oCurrent);
    Console.ReadLine();
    

    If you drop that into a test console app you will see it output the results.

    Now, once you have this, just look in each of the found directories for you .xml files.

    Henk Holterman : It will probably be faster and maybe it's possible to start processing in the background.
  • Only way I can see that would make much difference is to change from a brute strength hunt and use some third party or OS indexing routine to speed the return. that way the search is done off line from your app.

    But I would also suggest you should look at better ways to structure that data if at all possible.

  • Use P/Invoke on FindFirstFile/FindNextFile/FindClose and avoid overhead of creating lots of FileInfo instances.

    But this will be hard work to get right (you will have to do all the handling of file vs. directory and recursion yourself). So try something simple (Directory.GetFiles(), Directory.GetDirectories()) to start with and get things working. If it is too slow look at alternatives (but always measure, too easy to make it slower).

0 comments:

Post a Comment