c# - How can I extract just text from the html -
i have requirement extract text present in <body>
of html. sample html input :-
<html> <title>title</title> <body> <h1> big title.</h1> how doing you? <h3> fine </h3> <img src="abc.jpg"/> </body> </html>
the output should :-
this big title. how doing you? fine
i want use htmlagility purpose. no regular expressions please.
i know how load htmldocument , using xquery '//body' can body contents. how strip html have shown in output?
thanks in advance :)
you can use body's innertext
:
string html = @" <html> <title>title</title> <body> <h1> big title.</h1> how doing you? <h3> fine </h3> <img src=""abc.jpg""/> </body> </html>"; htmldocument doc = new htmldocument(); doc.loadhtml(html); string text = doc.documentnode.selectsinglenode("//body").innertext;
next, may want collapse spaces , new lines:
text = regex.replace(text, @"\s+", " ").trim();
note, however, while working in case, markup such hello<br>world
or hello<i>world</i>
converted innertext
helloworld
- removing tags. difficult solve issue, display ofter determined css, not markup.
Comments
Post a Comment