The Read Webpage module is a data module that can extract the text from a specific URL.

This module has multiple configurations that can enhance its capabilities.

  • Depth of sub-links: The number of sub-links that will be read in each webpage. Note that the deeper, the more time it’s going to take to run.
  • Scroll the Webpage: When enabled, the model can smartly scroll the webpage to find more information. (Important for sites, that don’t load completely and require the user to scroll)
  • Advanced Options:
    • Continue on Error: If enabled, in case the scraping fails, the output value of Text will contain the value <ERROR>, instead of failing the workflow
    • Parse HTML to Markdown: If disabled, the output value of Text will contain the raw HTML, instead of the interpreted markdown version of the content

The Read Webpage module has one input and two outputs:

  • Input:
    • URL, the link to the webpage you want to scrape. Anything more than the link will result in an error.
  • Output:
    • Pages, the text that was extracted from the webpage
    • Links Found, a list with all the links found on each page. If you want to transform each sequence of links in a page into a list, you can use the module Single value to list