≡ Menu

HowTo: grep Text Between Two Words in Unix / Linux

I got over 100s of HTML files in the following format:

<HTML>
<HEAD>
 <TITLE>Statistics for ABC LTD - January 2007 - Rang IDXYZZAZZZZ</TITLE>
</HEAD>
 
<BODY BGCOLOR="#E8E8E8" TEXT="#000000" LINK="#0000FF" VLINK="#FF0000">
<H2>Statistics for ABC LTF</H2>
<SMALL><STRONG>
Summary Period: January 2007<BR>
Generated 01-Feb-2007 06:40 CET<BR>
</STRONG></SMALL>
<CENTER>
<HR>
<P>
<FONT SIZE="-1"></CENTER><PRE>
 
my data 1
my data 2
my data 3
my data 10000
my data N times











Generated by MyAppDbStatsWriter (UNIX) version 1.9b2



How do I extract text between two words (<PRE> and </PRE>) in unix or linux using grep command?

The grep command is not suitable for this kind of work. I suggest that you use sed command. The syntax is:

sed -n "/START-WORD-HERE/,/END-WORD-HERE/p" input
sed -n "/START-WORD-HERE/,/END-WORD-HERE/p" input > output

In this example, extract text between two <PRE> and </PRE> using sed commmand:

sed -n "/<PRE>/,/<\/PRE>/p" input.html
sed -n "/<PRE>/,/<\/PRE>/p" input.html > output.html
Share this tutorial on:

Your support makes a big difference:
I have a small favor to ask. More people are reading the nixCraft. Many of you block advertising which is your right, and advertising revenues are not sufficient to cover my operating costs. So you can see why I need to ask for your help. The nixCraft, takes a lot of my time and hard work to produce. If you use nixCraft, who likes it, helps me with donations:
Become a Supporter →    Make a contribution via Paypal/Bitcoin →   

Don't Miss Any Linux and Unix Tips

Get nixCraft in your inbox. It's free:



{ 12 comments… add one }
  • Arash August 12, 2012, 4:27 pm

    Easy and practical, thanks.
    I just wanted to know is it possible to use `awk` instead of `sed` ?

    • nixCraft August 12, 2012, 9:41 pm
      awk '/WORD1/,/WORD2/' /path/to/file
      awk '/<PRE>/,/<\/PRE>/' /path/to/file
      awk '/<PRE>/,/<\/PRE>/' /path/to/file > output.file
      
  • Surya December 2, 2012, 8:34 am

    Thanks for a simple & elegant solution. But it prints START-WORD & END-WORD as well. Is there a simple way to exclude these, by including something simple like \zs or \ze (of ViM) in the START-WORD & END-WORD?

  • jwhendy May 20, 2013, 11:28 pm

    I have the same inquiry as Surya — how do you omit the start and end regexp’s?

  • SSengupta January 28, 2014, 12:31 pm

    Replace the words START-WORD & END-WORD with your keyword.

  • billy February 13, 2014, 1:17 pm

    Whenever I use the sed or awk version of

    /word1/,/word2/

    say i have a file

    word1 and word2

    it prints

    word1 and word2

    and not just

    and

  • jwhendy February 13, 2014, 9:23 pm

    @SSengupta: I don’t think you understand the question that Surya, myself, and now billy have asked. Conduct the example above:
    – copy the example html into a file, input.html
    – execute the sed command to “extract the text between `pre` and `/pre`”
    – You end up with “

    the text

    For something where this is just one “chunk” and you can just remove on set of `pre` and `/pre` matches, not a big deal. But I was doing this on a big file with tons of matches and thus keeping the “bookend” text really doesn’t help me much.

    A workaround is to do the `sed` command above, and then go back through with two other steps:
    – sed -i “s/

    //g” output.html
    – set -i “s,

    ,,g” output.html

    (Use commas for second command since typical separator, “/” is used in the closing pre tag)

  • jwhendy February 13, 2014, 9:27 pm

    Yeah, that went horribly… Trying once more:
    —–
    @SSengupta: I don’t think you understand the question that Surya, myself, and now billy have asked. Conduct the example above:
    – copy the example html into a file, input.html
    – execute the sed command to extract the text between `pre` and `/pre`
    – you end up with both the pre tags *and* the text in between

    For something where this is just one chunk and you can just remove one set of `pre` and `/pre` matches, not a big deal. But I was doing this on a big file with tons of matches and thus keeping the “bookend” text really doesn’t help me much.

    A workaround is to do the `sed` command above, and then go back through with two other steps:

    sed -i "s/
    //g" output.html
    set -i "s,

    ,,g" output.html

    (Use commas for second command since typical separator, “/” is used in the closing pre tag).

  • jwhendy February 13, 2014, 9:33 pm

    You get the idea… just manually sed the pre tags out afterward. Sorry, huge comment fail…

  • Andi F. July 12, 2016, 7:00 pm

    In contrast to what is stated in this article, grep is actually a perfect tool for this job! It supports both positive and negative look-ahead and look-back. To extract text from in between HTML tags, for example:

     grep -iPo '(?<=)(.*)(?=)'
    

    Now. I don’t know if it’s faster than sed or awk, but it works! :-)

    • Andi F. July 12, 2016, 7:01 pm

      It seems my code was interpreted as HTML and not fully displayed. I really don’t know a way around that, so never mind, I guess.

Security: Are you a robot or human?

Leave a Comment

You can use these HTML tags and attributes: <strong> <em> <pre> <code> <a href="" title="">


   Tagged with: , , , , , , , , , ,