Octolooks
You are interested in extracting content from a site and are wondering if you can accomplish this with Scrapes. With a few simple controls you can learn the answer immediately.
Scrapes; is a WordPress scraper plugin developed with PHP programming language that runs automatically on the server where your site is running after the necessary settings are made. With Scrapes, settings made using the visual selector via the web browser are processed and saved as a task to automatically run when the schedule time is come. Because this task is running completely in the background on the server, there is no need web browser to be open during the process.
Requests to perform scraping to the source site are sent directly from the server without using a web browser; JavaScript based codes found on the source site and developed to work on the web browser are discarded during the process and will not work. If some of these codes are using the Ajax method to populate the desired content, the task may not work as intended.
Although we continue to make various improvements so that they can be overcome, such situations, which are created by technical limitations, cause the content to be not grabbed from some source sites. In order to be able to create a problem-free task, some suggestions for solving the technical barriers encountered on the listing and detail pages, together with the criteria to be met by the source site, are as follows.
Listing page criteria
1) On the listing page, there should be a “Pagination” or “Next page” button that redirects to the next page. The content pages which are loading dynamic content without address change like “load more” button pressing or using infinite scrolls are partially supported.
Workarounds
- Without defining the “Next page”, you can have a limited number of up-to-date content that was recently added and listed only on the first page. The old content on the following pages will not be grabbed, but the new content will be displayed on the first page, so it will be automatically extracted when the runtime is come for the defined task.
- You can manually define the “Next page” value without using the visual selector. The next page link can be found in the page source in some cases, even though it is not visible in the button format on the page. You can find the HTML element that contains this link in the page source and define the XPath value in the “Next page” field. Displaying “Next page is found automatically.” means that you do not need to make this definition.
- You can select the “Enter URL parameter” option so that the next listing pages can be found by taking advantage of URL parameters in “?page=1” format. When the new URL parameter is added and its name, initial value and increment amount are defined; listing pages with this URL parameter will be detected automatically.
- You can manually create a single listing page that contains all the links on different listing pages and that lead to the detail page. If you send this file in HTML format to your server and define the web address to the “Source URL” field, the contents on the detail pages directed by the links will be automatically scraped.
2) On the listing page, there should be links in a given order and directed to the detail pages. Listing pages with links that are not positioned in a specific order, opening pages in a popup instead of link redirection, or loading content after page load action with JavaScript / Ajax technologies are not supported.
Workarounds
- You can activate “exact match only” feature to identify similar links. Links that are located on listing page redirects to detail pages when clicked; they will be detected automatically if they have similar structure to the defined “Post Item” even though they are not in a fixed format and not wrapped in the same container element.
- You can define the feed address of source site to “Source URL” by selecting the task type as “Feed” instead of “Serial”. Feeds; Since it consists of the summary content in the standard format that leads to the detail page when clicked, there will be no need to define “Post Item” field for the automatic detection of other links of the same type.
- You can define the web address of the mobile version of source site to “Source URL” field. You can find out whether a site has a mobile version by adding the “m” subdomain to the base web address (https://m.facebook.com) or by visiting this site with your mobile device. Mobile sites; since they are developed using the most basic technologies to support all devices all links of the same type with the “Post Item” defined will be detected automatically.
Detail page criteria
1) The content should be loaded when the page is opened. JavaScript / Ajax technologies and post-loaded page content are not supported.
2) The detail pages that are redirected from the listing page should have the same HTML page structure and must be a valid HTML. Non-valid detail pages with different HTML page structure may not be supported.
Workarounds
-
Instead of identifying fields such as “Title” and “Featured image” in detail pages having different page structures using the visual selector, you can match them with generic HTML tags used on most websites, and select “Detect automatically” option for “Content” field. If the identified generic tags are already on the page, the contents within these tags will be automatically extracted. Some generic tags that can be defined are as follows.
Title
//title
//h1
//meta[@property="og:title"]/@contentFeatured image
//meta[@property="og:image"]/@content
- You can activate “Disable styles” at the top left of the visual selector when the desired content is present in the page source but not visible in the interface. Content that is hidden by CSS in the source site and then displayed by JavaScript will become visible for the visual selector to locate.
Feed criteria
1) RSS and Atom feeds with non-valid XML structure may not be supported by our WordPress news aggregator plugin.
Workarounds
- To fix errors in feeds that do not have valid XML structure, you can install the Tidy PHP extension if not installed on your server. This extension will, if possible, fix the errors in the feed of the source site.
You can check whether the source site you are interested in is suitable for Scrapes by trying the Scrapes live demo or by contacting us at any time.