Wednesday, April 20, 2011

How can I repeatedly match from A until B in VIM?

I need to get all text between <Annotation> and </Annotation>, where a word MATCH occurs. How can I do it in VIM?

<Annotation about="MATCH UNTIL </Annotation>   " timestamp="0x000463e92263dd4a" href="     5raS5maS90ZWh0YXZha29rb2VsbWEvbGFza2FyaS8QyrqPk5L9mAI">                                                                        
  <Label name="las" />
  <Label name="_cse_6sbbohxmd_c" />
  <AdditionalData attribute="original_url" value="MATCH UNTIL </Annotation>       " />
</Annotation>
<Annotation about="NO MATCH" href="     Cjl3aWtpLmhlbHNpbmtpLmZpL2Rpc3BsYXkvbWF0aHN0YXRLdXJzc2l0L0thaWtraStrdXJzc2l0LyoQh_HGoJH9mAI">
  <Label name="_cse_6sbbohxmd_c" />
  <Label name="courses" />
  <Label name="kurssit" />
  <AdditionalData attribute="original_url" value="NO MATCH" />
</Annotation>
<Annotation about="MATCH UNTIL </ANNOTATION>     " score="1" timestamp="0x000463e90f8eed5c" href="CiZtYXRoc3RhdC5oZWx     zaW5raS5maS90ZWh0YXZha29rb2VsbWEvKhDc2rv8kP2YAg">
  <Label name="_cse_6sbbohxmd_c" />
  <Label name="exercises_without_solutions" />
  <Label name="tehtäväkokoelma" />
  <AdditionalData attribute="original_url" value="MATCH UNTIL </ANNOTATION>" />
</Annotation>
From stackoverflow
  • Does it have to be done within vim? Could you cheat, and open a second window where you pipe something into more/less that tells you what line number to go to within vim?

    -- edit --

    I have never done a multi-line match/search in vi[m]. However, to cheat in another window:

    perl -n -e 'if ( /<tag/ .. /<\/tag/)' -e '{ print "$.:$_"; }' file.xml | less
    

    will show the elements/blocks for "tag" (or other longer matching names), with line numbers, in less, and you can then search for the other text within each block.

    Close enough?

    -- edit --

    within "less", type

    /MATCH
    

    to search for occurrences of MATCH. On the left margin will be the line number where that instance (within the targeted element/tags) is.

    within vi[m], type

    :n
    

    where "n" is the desired line number.

    Of course, if what you really wanted to do was some kind of search/yank/replace, it's more complicated. At that point, awk / perl / ruby (or something similar which meets your tastes ... or xsl?) is really the tool you should be using for the transformation.

    Eddie : I think something like this will be the only possible answer, as to do this right you need to use an XML parser.
    Masi : Where is the MATCH word supposed to be? In the place of ..?
  • First, a disclaimer: Any attempt to slice and dice XML with regular expressions is fragile; a real XML parser would do better.

    The pattern:

    \(<Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>\)\@<=\(\(<\/Annotation\)\@!\_.\)\{-}"MATCH\_.\{-}\(<\/Annotation>\)\@=
    

    Let's break it down...

    Group 1 is <Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>. It matches the start-tag of the Attribute element. Group 2, which is embedded in Group 1, matches an attribute and may be repeated 0 or more times.

    Group 2 is \s*\w\+="[^"]\{-}"\s\{-}. Most of these pieces are commonly used; the most unusual is \{-}, which means non-greedy repetition (*? in Perl-compatible regular expressions). The non-greedy whitespace match at the end is important for performance; without it, Vim will try every possible way to split the whitespace between attributes between the \s* at the end of Group 2 and the \s* at the beginning of the next occurrence of Group 2.

    Group 1 is followed by \@<=. This is a zero-width positive look-behind. It prevents the start-tag from being included in the matched text (e.g., for s///).

    Group 3 is \(<\/Annotation\)\@!\_.. It includes Group 4, which matches the beginning of the Attribute end-tag. The \@! is a zero-width negative look-ahead and \_. matches any character (including newlines). Together, this groups matches at any character except where the Attribute end-tag starts. Group 3 is followed by a non-greedy repetition marker \{-} so that it matches the smallest block of text before MATCH. If you were to use \_. instead of Group 3, the matched text could include the end-tag of an Annotation element that did not include MATCH and continue through into the next Annotation element with MATCH. (Try it.)

    The next bit is straightforward: Find MATCH and a minimal number of other characters before the end-tag.

    Group 5 is easy: It's the end tag. \@= is a zero-width positive look-ahead, which is included here for the same reason as the \@<= for the start-tag. We have to repeat <\/Attribute rather than use \4 because groups with zero-width modifiers aren't captured.

    Masi : +1 for the explanations. It takes me some time to thoroughly understand them :)

0 comments:

Post a Comment