Skip to content Skip to sidebar Skip to footer

How To Parse Html From Javafx Webview And Transfer This Data To Jsoup Document?

I am trying to parse sidebar TOC(Table of Components) of some documentation site. Jsoup I have tried Jsoup. I can not get TOC elements because the HTML content in this tag is not p

Solution 1:

    WebView browser = new WebView();
    WebEngine webEngine = browser.getEngine();
    String url = "https://docs.microsoft.com/en-us/ef/ef6/";
    webEngine.load(url);
    //get w3c document from webEngine
    org.w3c.dom.Document w3cDocument = webEngine.getDocument();
    // use jsoup helper methods to convert it to string
    String html =  new org.jsoup.helper.W3CDom().asString(webEngine.get);
    // create jsoup document by parsing html
    Document doc = Jsoup.parse(url, html);

Solution 2:

I can't promise this is the best way as I've not used Jsoup before and I'm not an expert on the XML API.

The org.jsoup.Jsoup class has a method for parsing HTML in String form: Jsoup.parse(String). This means we need to get the HTML from the WebView as a String. The WebEngine class has a document property that holds a org.w3c.dom.Document. This Document is the HTML content of the currently showing web page. We just need to convert this Document into a String, which we can do with a Transformer.

import java.io.StringWriter;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.jsoup.Jsoup;

public class Utils {

  private static Transformer transformer;

  // not thread safe
  public static org.jsoup.nodes.Document convert(org.w3c.dom.Document doc)
      throws TransformerException {
    if (transformer == null) {
      transformer = TransformerFactory.newDefaultInstance().newTransformer();
    }

    StringWriter writer = new StringWriter();
    transformer.transform(new DOMSource(doc), new StreamResult(writer));
    return Jsoup.parse(writer.toString());
  }

}

You would call this every time the document property changes. I did some "tests" by browsing Google and printing the org.jsoup.nodes.Document to the console and everything seems to be working.

There is a caveat, though; as far as I understand it the document property does not change when there are changes within the same web page (the Document itself may be updated, however). I'm not a web person, so pardon me if I don't make sense here, but I believe that this includes things like a frame changing its content. There may be a way around this by interfacing with the JavaScript using WebEngine.executeStript(String), but I don't know how.


Post a Comment for "How To Parse Html From Javafx Webview And Transfer This Data To Jsoup Document?"