Custom Extraction – Page Copy & Specific HTML – With Screaming Frog [2024]

  • Open Screaming Frog
  • Go to Configuration in the top menu
  • Custom > Custom Extraction
  • Use Inspect Element (right click on the copy and choose “inspect” if you use Chrome browser) – to identify the name, class or ID of the div or element the page copy is contained in:

    In this example the Div class is “prose” (f8ck knows why)

  • You can copy the Xpath instead – but it appears to do the same thing as just entering the class or id of the div:
  • The following will scrape any text in the div called “prose”:

*Click Image to enlarge^

Once you are in the Custom Extraction Window – Choose:

  • Extractor 1
  • X Path
  • In the next box enter –> //div[@class=’classofdiv‘] —-> in this example – //div[@class=’prose’]
  • Extract Text


Here are some more examples:

How to Extract Common HTML Elements

//div[@class='read-more']
XPathOutput
//h1Extract all H1 tags
//h3[1]Extract the first H3 tag
//h3[2]Extract the second H3 tag
//div/pExtract any <p> contained within a <div>
//div[@class=’author’]Extract any <div> with class “author” (remember to check ‘ quote marks are correct)
//p[@class=’bio’]Extract any <p> with class “bio”
//*[@class=’bio’]Extract any element with class “bio”
//ul/li[last()]Extract the last <li> in a <ul>
//ol[@class=’cat’]/li[1]Extract the first <li> in a <ol> with class “cat”
count(//h2)Count the number of H2’s (set extraction filter to “Function Value”)
//a[contains(.,’click here’)]Extract any link with anchor text containing “click here”
//a[starts-with(@title,’Written by’)]Extract any link with a title starting with “Written by”

 

How to Extract Common HTML Attributes

XPathOutput
//@hrefExtract all links
//a[starts-with(@href,’mailto’)]/@hrefExtract link that starts with “mailto” (email address)
//img/@srcExtract all image source URLs
//img[contains(@class,’aligncenter’)]/@srcExtract all image source URLs for images with the class name containing “aligncenter”
//link[@rel=’alternate’]Extract elements with the rel attribute set to “alternate”
//@hreflangExtract all hreflang values

 

How to Extract Meta Tags (including Open Graph and Twitter Cards)

I recommend setting the extraction filter to “Extract Inner HTML” for these ones.

Extract Meta Tags:

XPathOutput
//meta[@property=’article:published_time’]/@contentExtract the article publish date (commonly-found meta tag on WordPress websites)

Extract Open Graph:

XPathOutput
//meta[@property=’og:type’]/@contentExtract the Open Graph type object
//meta[@property=’og:image’]/@contentExtract the Open Graph featured image URL
//meta[@property=’og:updated_time’]/@contentExtract the Open Graph updated time

Extract Twitter Cards:

XPathOutput
//meta[@name=’twitter:card’]/@contentExtract the Twitter Card type
//meta[@name=’twitter:title’]/@contentExtract the Twitter Card title
//meta[@name=’twitter:site’]/@contentExtract the Twitter Card site object (Twitter handle)

How to Extract Schema Markup in Microdata Format

If it’s in JSON-LD format, then jump to the section on how to extract schema markup with regex.

Extract Schema Types:

XPathOutput
//*[@itemtype]/@itemtypeExtract all of the types of schema markup on a page

References:

Update:

If the ‘shorter code’ in the tables above doesn’t work for some reason, you may have to right click – inspect and copy the full Xpath code to be more specific with what you want to extract:

For sections of text like paragraphs and on page descriptions, select the actual text in the inspect window before copying the Xpath.

Update 2

We wanted to compare the copy and internal links before and after a site-migration to a new CMS.

To see the links in HTML format – you just need to check “Extract Text” to “Extract Inner HTML” in the final drop down:
(click image to enlarge)

On the new CMS, it was easier to just copy the XPath

Why Use Custom Extraction with Screaming Frog?

I’m glad you asked.

We used it to check that page copy had migrated properly to a new CMS.

We also extracted the HTML within the copy, to check the internal links were still present.

One cool thing you can do – is scrape reviews and then analyse the reviews to see key feedback/pain points that could inform superior design.

Here’s a good way to use custom extraction/search to find text that you want to use for anchor text for internal links:

custom search screaming frog

I’m still looking into how to analyse the reviews – but this tool is a good starting point: https://seoscout.com/tools/text-analyzer

Throw the reviews in and see what words are repeated etc

This tool is also very good:

https://voyant-tools.org

Or – just paste into Chat GPT and ask for insights and pain-points to help develop a better product.

I asked Chat GPT and Google Gemini and get these responses:

Monitoring Competitors

  1. Content Strategy Insights: Extract headings, titles, and meta descriptions from competitor websites to understand their content strategy and how they optimize for search engines.
  2. Pricing Information: For e-commerce sites, custom extraction can be used to monitor competitors’ pricing strategies by extracting prices for similar products.
  3. Promotional Strategies: Extract information on promotions, discounts, or special offers from competitors’ sites to compare with your own promotional strategies.
  4. Keyword Usage: Extract specific keywords from competitors’ content to see how they are targeting certain topics or niches.
  5. Backlink Analysis: Although Screaming Frog is not primarily a backlink tool, you can use custom extraction to identify external links on a page, which can give insights into a competitor’s backlink strategy.

Optimizing Your Own Website

  1. Content Optimization: Use custom extraction to audit your own site’s headings, titles, meta descriptions, and keyword usage to identify areas for improvement.
  2. Structural Data: Extract and analyze structured data (e.g., Schema markup) across your site to ensure it’s correctly implemented and to identify opportunities for enhancement.
  3. Internal Linking: Extract internal linking structures to analyze the flow of page authority throughout your site and identify orphan pages or pages with weak internal link support.
  4. Audit Forms and Calls to Action (CTAs): Extract form action URLs and CTA links to ensure they are functioning as intended and to optimize their placement and wording.
  5. Identify Outdated Content: Use custom extraction to find dates and timestamps on your content, helping to identify articles or pages that may need updating.

Gemini said:

Reasons to Use Custom Extraction in Screaming Frog:

The custom extraction feature in Screaming Frog offers several benefits for analyzing websites and gathering data. Here are some key reasons to use it:

1. Extract Specific Data:

  • Go beyond basic SEO metrics: While Screaming Frog provides valuable insights, custom extraction allows you to target specific data points not readily available, like product prices, article publish dates, or user reviews.
  • Gain deeper competitor insights: Extract competitor pricing strategy, product descriptions, or content structure to compare against your own website.
  • Gather information for internal analysis: Track specific elements on your site, such as call to action (CTA) button text or internal linking structure, to monitor changes and measure their impact.

2. Automate Data Collection:

  • Save time and effort: Manually collecting data can be tedious and error-prone. Custom extraction automates the process, allowing you to gather information efficiently across multiple pages.
  • Maintain consistent data: By establishing automated data extraction, you ensure consistent data collection over time, enabling better trend analysis and comparisons.

3. Enhance Reporting and Analysis:

  • Combine extracted data with existing Screaming Frog metrics: Integrate the extracted data with other SEO parameters like page titles, meta descriptions, and internal links for a more comprehensive analysis.
  • Create custom reports: Utilize the extracted data to build tailored reports for specific needs, such as competitor pricing comparisons or content performance evaluations.

Monitoring Competitors:

Custom extraction can be a powerful tool for competitor monitoring in several ways:

  • Extract competitor pricing data: Track competitor pricing trends, identify potential gaps in your own pricing strategy, and make informed pricing decisions.
  • Analyze competitor content structure and keywords: Understand how competitors format their content, identify their targeted keywords, and gain insights for improving your own content strategy.
  • Monitor competitor product descriptions and features: Compare your product offerings to competitors, identify potential areas for improvement, and refine your product messaging.

Optimizing Your Own Website:

Custom extraction can also be used to optimize and update your own website:

  • Track internal linking structure: Extract data on internal links to identify broken links, orphaned pages, and opportunities for optimization.
  • Monitor call to action (CTA) button text: Track the wording and placement of CTAs across your website and analyze their effectiveness in driving conversions.
  • Extract product information for quality control: Ensure product descriptions, prices, and other information are consistent and accurate across all product pages.

Overall, the custom extraction feature in Screaming Frog offers a versatile tool for extracting valuable data from websites, enabling deeper competitor analysis, improved website optimization, and enhanced reporting capabilities.

Leave a comment