I got over 100s of HTML files in the following format:
<HTML> <HEAD> <TITLE>Statistics for ABC LTD - January 2007 - Rang IDXYZZAZZZZ</TITLE> </HEAD> <BODY BGCOLOR="#E8E8E8" TEXT="#000000" LINK="#0000FF" VLINK="#FF0000"> <H2>Statistics for ABC LTF</H2> <SMALL><STRONG> Summary Period: January 2007<BR> Generated 01-Feb-2007 06:40 CET<BR> </STRONG></SMALL> <CENTER> <HR> <P> <FONT SIZE="-1"></CENTER><PRE> my data 1 my data 2 my data 3 my data 10000 my data N times
Generated by MyAppDbStatsWriter (UNIX) version 1.9b2 |
Easy and practical, thanks.
I just wanted to know is it possible to use `awk` instead of `sed` ?
784862632603ffee5907f4_000003
Thanks for a simple & elegant solution. But it prints START-WORD & END-WORD as well. Is there a simple way to exclude these, by including something simple like \zs or \ze (of ViM) in the START-WORD & END-WORD?
I have the same inquiry as Surya — how do you omit the start and end regexp’s?
Replace the words START-WORD & END-WORD with your keyword.
Whenever I use the sed or awk version of
/word1/,/word2/
say i have a file
word1 and word2
it prints
word1 and word2
and not just
and
@SSengupta: I don’t think you understand the question that Surya, myself, and now billy have asked. Conduct the example above:
– copy the example html into a file, input.html
– execute the sed command to “extract the text between `pre` and `/pre`”
– You end up with “
”
For something where this is just one “chunk” and you can just remove on set of `pre` and `/pre` matches, not a big deal. But I was doing this on a big file with tons of matches and thus keeping the “bookend” text really doesn’t help me much.
A workaround is to do the `sed` command above, and then go back through with two other steps:
– sed -i “s/
,,g” output.html
(Use commas for second command since typical separator, “/” is used in the closing pre tag)
Yeah, that went horribly… Trying once more:
—–
@SSengupta: I don’t think you understand the question that Surya, myself, and now billy have asked. Conduct the example above:
– copy the example html into a file, input.html
– execute the sed command to extract the text between `pre` and `/pre`
– you end up with both the pre tags *and* the text in between
For something where this is just one chunk and you can just remove one set of `pre` and `/pre` matches, not a big deal. But I was doing this on a big file with tons of matches and thus keeping the “bookend” text really doesn’t help me much.
A workaround is to do the `sed` command above, and then go back through with two other steps:
,,g" output.html
(Use commas for second command since typical separator, “/” is used in the closing pre tag).
You get the idea… just manually sed the pre tags out afterward. Sorry, huge comment fail…
In contrast to what is stated in this article, grep is actually a perfect tool for this job! It supports both positive and negative look-ahead and look-back. To extract text from in between HTML tags, for example:
Now. I don’t know if it’s faster than sed or awk, but it works! :-)
It seems my code was interpreted as HTML and not fully displayed. I really don’t know a way around that, so never mind, I guess.
Put it between pre html tags.