HowTo: grep Text Between Two Words in Unix / Linux

Posted on in Categories , last updated August 12, 2012

I got over 100s of HTML files in the following format:

<HTML>
<HEAD>
 <TITLE>Statistics for ABC LTD - January 2007 - Rang IDXYZZAZZZZ</TITLE>
</HEAD>
 
<BODY BGCOLOR="#E8E8E8" TEXT="#000000" LINK="#0000FF" VLINK="#FF0000">
<H2>Statistics for ABC LTF</H2>
<SMALL><STRONG>
Summary Period: January 2007<BR>
Generated 01-Feb-2007 06:40 CET<BR>
</STRONG></SMALL>
<CENTER>
<HR>
<P>
<FONT SIZE="-1"></CENTER><PRE>
 
my data 1
my data 2
my data 3
my data 10000
my data N times











Generated by MyAppDbStatsWriter (UNIX) version 1.9b2



How do I extract text between two words (<PRE> and </PRE>) in unix or linux using grep command?

The grep command is not suitable for this kind of work. I suggest that you use sed command. The syntax is:

sed -n "/START-WORD-HERE/,/END-WORD-HERE/p" input
sed -n "/START-WORD-HERE/,/END-WORD-HERE/p" input > output

In this example, extract text between two <PRE> and </PRE> using sed commmand:

sed -n "/<PRE>/,/<\/PRE>/p" input.html
sed -n "/<PRE>/,/<\/PRE>/p" input.html > output.html

Posted by: Vivek Gite

The author is the creator of nixCraft and a seasoned sysadmin and a trainer for the Linux operating system/Unix shell scripting. He has worked with global clients and in various industries, including IT, education, defense and space research, and the nonprofit sector. Follow him on Twitter, Facebook, Google+.

12 comment

    1. awk '/WORD1/,/WORD2/' /path/to/file
      awk '/&lt;PRE&gt;/,/&lt;/PRE&gt;/' /path/to/file
      awk '/&lt;PRE&gt;/,/&lt;/PRE&gt;/' /path/to/file > output.file
  1. Thanks for a simple & elegant solution. But it prints START-WORD & END-WORD as well. Is there a simple way to exclude these, by including something simple like \zs or \ze (of ViM) in the START-WORD & END-WORD?

  2. @SSengupta: I don’t think you understand the question that Surya, myself, and now billy have asked. Conduct the example above:
    – copy the example html into a file, input.html
    – execute the sed command to “extract the text between `pre` and `/pre`”
    – You end up with “

    the text

    For something where this is just one “chunk” and you can just remove on set of `pre` and `/pre` matches, not a big deal. But I was doing this on a big file with tons of matches and thus keeping the “bookend” text really doesn’t help me much.

    A workaround is to do the `sed` command above, and then go back through with two other steps:
    – sed -i “s/

    //g” output.html
    – set -i “s,

    ,,g” output.html

    (Use commas for second command since typical separator, “/” is used in the closing pre tag)

  3. Yeah, that went horribly… Trying once more:
    —–
    @SSengupta: I don’t think you understand the question that Surya, myself, and now billy have asked. Conduct the example above:
    – copy the example html into a file, input.html
    – execute the sed command to extract the text between `pre` and `/pre`
    – you end up with both the pre tags *and* the text in between

    For something where this is just one chunk and you can just remove one set of `pre` and `/pre` matches, not a big deal. But I was doing this on a big file with tons of matches and thus keeping the “bookend” text really doesn’t help me much.

    A workaround is to do the `sed` command above, and then go back through with two other steps:

    sed -i "s/
    //g" output.html
    set -i "s,

    ,,g" output.html

    (Use commas for second command since typical separator, “/” is used in the closing pre tag).

  4. In contrast to what is stated in this article, grep is actually a perfect tool for this job! It supports both positive and negative look-ahead and look-back. To extract text from in between HTML tags, for example:

     grep -iPo '(?<=)(.*)(?=)'
    

    Now. I don’t know if it’s faster than sed or awk, but it works! :-)

Leave a Comment