c# - How can I extract just text from the html -


i have requirement extract text present in <body> of html. sample html input :-

<html>     <title>title</title>     <body>            <h1> big title.</h1>            how doing you?            <h3> fine </h3>            <img src="abc.jpg"/>     </body> </html> 

the output should :-

this big title. how doing you? fine 

i want use htmlagility purpose. no regular expressions please.

i know how load htmldocument , using xquery '//body' can body contents. how strip html have shown in output?

thanks in advance :)

you can use body's innertext:

string html = @" <html>     <title>title</title>     <body>            <h1> big title.</h1>            how doing you?            <h3> fine </h3>            <img src=""abc.jpg""/>     </body> </html>";  htmldocument doc = new htmldocument(); doc.loadhtml(html); string text = doc.documentnode.selectsinglenode("//body").innertext; 

next, may want collapse spaces , new lines:

text = regex.replace(text, @"\s+", " ").trim(); 

note, however, while working in case, markup such hello<br>world or hello<i>world</i> converted innertext helloworld - removing tags. difficult solve issue, display ofter determined css, not markup.


Comments

Popular posts from this blog

objective c - Change font of selected text in UITextView -

php - Accessing POST data in Facebook cavas app -

c# - Getting control value when switching a view as part of a multiview -