HowTo: grep Text Between Two Words in Unix / Linux

by on August 12, 2012 · 9 comments· LAST UPDATED August 12, 2012

in ,

I got over 100s of HTML files in the following format:

 
<HTML>
<HEAD>
 <TITLE>Statistics for ABC LTD - January 2007 - Rang IDXYZZAZZZZ</TITLE>
</HEAD>
 
<BODY BGCOLOR="#E8E8E8" TEXT="#000000" LINK="#0000FF" VLINK="#FF0000">
<H2>Statistics for ABC LTF</H2>
<SMALL><STRONG>
Summary Period: January 2007<BR>
Generated 01-Feb-2007 06:40 CET<BR>
</STRONG></SMALL>
<CENTER>
<HR>
<P>
<FONT SIZE="-1"></CENTER><PRE>
 
my data 1
my data 2
my data 3
my data 10000
my data N times
</PRE></FONT>
</CENTER>
<P>
<HR>
<TABLE WIDTH="100%" CELLPADDING=0 CELLSPACING=0 BORDER=0>
<TR>
<TD ALIGN=left VALIGN=top>
<SMALL>Generated by MyAppDbStatsWriter (UNIX) version 1.9b2</A>
</SMALL>
</TD>
</TR>
</TABLE>
</BODY>
</HTML>
 

How do I extract text between two words (<PRE> and </PRE>) in unix or linux using grep command?

The grep command is not suitable for this kind of work. I suggest that you use sed command. The syntax is:

 
sed -n "/START-WORD-HERE/,/END-WORD-HERE/p" input
sed -n "/START-WORD-HERE/,/END-WORD-HERE/p" input > output
 

In this example, extract text between two <PRE> and </PRE> using sed commmand:

 
sed -n "/<PRE>/,/<\/PRE>/p" input.html
sed -n "/<PRE>/,/<\/PRE>/p" input.html > output.html
 
TwitterFacebookGoogle+PDF versionFound an error/typo on this page? Help us!

{ 9 comments… read them below or add one }

1 Arash August 12, 2012 at 4:27 pm

Easy and practical, thanks.
I just wanted to know is it possible to use `awk` instead of `sed` ?

Reply

2 nixCraft August 12, 2012 at 9:41 pm
awk '/WORD1/,/WORD2/' /path/to/file
awk '/<PRE>/,/<\/PRE>/' /path/to/file
awk '/<PRE>/,/<\/PRE>/' /path/to/file > output.file

Reply

3 Surya December 2, 2012 at 8:34 am

Thanks for a simple & elegant solution. But it prints START-WORD & END-WORD as well. Is there a simple way to exclude these, by including something simple like \zs or \ze (of ViM) in the START-WORD & END-WORD?

Reply

4 jwhendy May 20, 2013 at 11:28 pm

I have the same inquiry as Surya — how do you omit the start and end regexp’s?

Reply

5 SSengupta January 28, 2014 at 12:31 pm

Replace the words START-WORD & END-WORD with your keyword.

Reply

6 billy February 13, 2014 at 1:17 pm

Whenever I use the sed or awk version of

/word1/,/word2/

say i have a file

word1 and word2

it prints

word1 and word2

and not just

and

Reply

7 jwhendy February 13, 2014 at 9:23 pm

@SSengupta: I don’t think you understand the question that Surya, myself, and now billy have asked. Conduct the example above:
- copy the example html into a file, input.html
- execute the sed command to “extract the text between `pre` and `/pre`”
- You end up with “

the text

For something where this is just one “chunk” and you can just remove on set of `pre` and `/pre` matches, not a big deal. But I was doing this on a big file with tons of matches and thus keeping the “bookend” text really doesn’t help me much.

A workaround is to do the `sed` command above, and then go back through with two other steps:
- sed -i “s/

//g” output.html
- set -i “s,

,,g” output.html

(Use commas for second command since typical separator, “/” is used in the closing pre tag)

Reply

8 jwhendy February 13, 2014 at 9:27 pm

Yeah, that went horribly… Trying once more:
—–
@SSengupta: I don’t think you understand the question that Surya, myself, and now billy have asked. Conduct the example above:
- copy the example html into a file, input.html
- execute the sed command to extract the text between `pre` and `/pre`
- you end up with both the pre tags *and* the text in between

For something where this is just one chunk and you can just remove one set of `pre` and `/pre` matches, not a big deal. But I was doing this on a big file with tons of matches and thus keeping the “bookend” text really doesn’t help me much.

A workaround is to do the `sed` command above, and then go back through with two other steps:

sed -i "s/
//g" output.html
set -i "s,

,,g" output.html

(Use commas for second command since typical separator, “/” is used in the closing pre tag).

Reply

9 jwhendy February 13, 2014 at 9:33 pm

You get the idea… just manually sed the pre tags out afterward. Sorry, huge comment fail…

Reply

Leave a Comment

Tagged as: , , , , , , , , , ,

Previous Faq:

Next Faq: