Thursday, February 17, 2011

What would be the ideal script language for parsing text files?

My code reads in a file, usually HTML but it could be any plain text. Now I was thinking to have each piece as a separate module loaded externally at run time so I don't have to maintain it. I would like to use a scripting language to parse the text/strings and call my appropriate c or c++ functions. What scripting language would be good to use? I'd need traveling forward and backward in the string/textfile, extracting HTML like links, as well as numbers, dates, and a few other pieces of data. What should I use?

Being able something like:

goto(".com"); //move pos forward
rgoto("method"); //move back
string1 = "Username is %STRING%" //extract a string with a space deliminator 
time = Time%HH_MM_SS%?m //extract the time down to the second

Where ? would be 1 letter so it wouldn't matter if its am or pm. doing %MH_MM% would be good to (military hours, 0-23 and minutes, no second available.

But any script language would due. I just like something doesn't take much effort to parse text with.

From stackoverflow
  • Personally I would use Python for this. Especially to be able to call your existing C/++ code.

    Vinko Vrsalovic : Lua is another good option
    EBGreen : I have briefly looked at Lua, but I am more conversant with parsing in Python that is why I would suggest it. Not because I have any criticism of Lua for the task. Thanks for the edit btw. My typing-fu is weak today.
  • lua is designed for embedding. It's perfect for what you are trying to do.

    Vinko Vrsalovic : Python is also a good option
    Paul : Both work. Lua's considerably smaller (which may or may not be an issue) and as David said, designed from the get-go for easy embedding
    EBGreen : Ya. I was a little confused by the question. When I read it I got the feeling that the OP wanted a language that lived outside his app. On second read, embedded may be what the OP really wanted in which case Lua probably would be easier than Python.
  • Lua is not only expressly designed for embedding, but there's also LPEG (by the same author), which lets you write a parser almost in EBNF. it's like having YACC (or Bison) in a small library.

  • Vote for python - the text processing is top notch, and it's great for embedding. It can call your c/c++ code and you can dynamically import modules - great for keeping that maintainability cost down. Dive into python has a really great tutorial on how to set that up using the GetAttr() method and a consistent naming scheme.

    http://diveintopython.org/object_oriented_framework/index.html

  • How come nobody's mentioned perl yet? I personally can't stand perl, but it is purpose built for doing this kind of stuff.

    If you can't stomach perl either, I'd ecommend Ruby - it takes a lot of the powerful things from perl, like built-in regular expression syntax, but without a lot of the mess. It is also very easy to create Ruby <-> C libraries to handle those transitions.

    Python doesn't have a lot of this, so while it's a great general purpose language, and may or may not be as easy to link to your C/C++ code, I don't think it'd be quite as nice for pure string processing

  • Perl is definitely the traditional answer - it'll certainly get the job done, and would totally have been my answer two months ago.

    On the other hand, I've recently discovered python, which I've completely switched over to for this kind of thing. The inclusion of HTML/XML/DOM parsers in the standard library is what did it for me.

    Just the other other day I knocked a HTML data extractor / screen scraper together in python in 20 minutes and ~4o lines of code. Loved every second of it.

  • PERL, anyone... thats what it was invented for.

  • AWK was designed for that... http://en.wikipedia.org/wiki/Awk

  • Perl is the language that was made specifically for this. Noting that, I would recommend you pick up Python or Ruby, both of which are sort of a modern equivalent to Perl. Perl's syntax can be a little maddening.

    Python is for people who appreciate a very clean language with one way to accomplish each thing.

    Ruby is for people who like being able to do neat "tricks" and accomplish the same thing different ways.

    Oh, and Perl was made as a replacement for scripts using awk, sed, and grep. Don't start with them.. they're really odd if you didn't know them from long ago.

  • Chiming in with Perl.

  • This is pretty much what perl was made for.

0 comments:

Post a Comment