Screaming Frog – Custom Extraction – Extract Specific Page Copy [2025]

Last Updated – a few days ago (probably)

  • Open Screaming Frog
  • Go to Configuration in the top menu
  • Custom > Custom Extraction
  • Use Inspect Element (right click on the copy and choose “inspect” if you use Chrome browser) – to identify the name, class or ID of the div or element the page copy is contained in:

    In this example the Div class is “prose” (f8ck knows why)

  • You can copy the Xpath instead – but it appears to do the same thing as just entering the class or id of the div:
  • The following will scrape any text in the div called “prose”:

*Click Image to enlarge^

Once you are in the Custom Extraction Window – Choose:

  • Extractor 1
  • X Path
  • In the next box enter –> //div[@class=’classofdiv‘] —->

    in this example – //div[@class=’prose’]
  • Extract Text

//div[@class='prose']

^Enter the above into the 3rd 'box' in the custom extraction window/tab. 
Replace "prose" with the name of the div you want to scrape.


If you copy the Xpath using Inspect Element – select the exact element you want. For example, don’t select the Div that contains text you want to scrape – select the text itself:

Here are some more examples:

How to Extract Common HTML Elements

//div[@class='read-more']
XPathOutput
//h1Extract all H1 tags
//h3[1]Extract the first H3 tag
//h3[2]Extract the second H3 tag
//div/pExtract any <p> contained within a <div>
//div[@class=’author’]Extract any <div> with class “author” (remember to check ‘ quote marks are correct)
//p[@class=’bio’]Extract any <p> with class “bio”
//*[@class=’bio’]Extract any element with class “bio”
//ul/li[last()]Extract the last <li> in a <ul>
//ol[@class=’cat’]/li[1]Extract the first <li> in a <ol> with class “cat”
count(//h2)Count the number of H2’s (set extraction filter to “Function Value”)
//a[contains(.,’click here’)]Extract any link with anchor text containing “click here”
//a[starts-with(@title,’Written by’)]Extract any link with a title starting with “Written by”

 

How to Extract Common HTML Attributes

XPathOutput
//@hrefExtract all links
//a[starts-with(@href,’mailto’)]/@hrefExtract link that starts with “mailto” (email address)
//img/@srcExtract all image source URLs
//img[contains(@class,’aligncenter’)]/@srcExtract all image source URLs for images with the class name containing “aligncenter”
//link[@rel=’alternate’]Extract elements with the rel attribute set to “alternate”
//@hreflangExtract all hreflang values

 

How to Extract Meta Tags (including Open Graph and Twitter Cards)

I recommend setting the extraction filter to “Extract Inner HTML” for these ones.

Extract Meta Tags:

XPathOutput
//meta[@property=’article:published_time’]/@contentExtract the article publish date (commonly-found meta tag on WordPress websites)

Extract Open Graph:

XPathOutput
//meta[@property=’og:type’]/@contentExtract the Open Graph type object
//meta[@property=’og:image’]/@contentExtract the Open Graph featured image URL
//meta[@property=’og:updated_time’]/@contentExtract the Open Graph updated time

Extract Twitter Cards:

XPathOutput
//meta[@name=’twitter:card’]/@contentExtract the Twitter Card type
//meta[@name=’twitter:title’]/@contentExtract the Twitter Card title
//meta[@name=’twitter:site’]/@contentExtract the Twitter Card site object (Twitter handle)

How to Extract Schema Markup in Microdata Format

If it’s in JSON-LD format, then jump to the section on how to extract schema markup with regex.

Extract Schema Types:

XPathOutput
//*[@itemtype]/@itemtypeExtract all of the types of schema markup on a page

References:

Update:

If the ‘shorter code’ in the tables above doesn’t work for some reason, you may have to right click – inspect and copy the full Xpath code to be more specific with what you want to extract:

For sections of text like paragraphs and on page descriptions, select the actual text in the inspect window before copying the Xpath.

Update 2

We wanted to compare the copy and internal links before and after a site-migration to a new CMS.

To see the links in HTML format – you just need to check “Extract Text” to “Extract Inner HTML” in the final drop down:
(click image to enlarge)

On the new CMS, it was easier to just copy the XPath

Why Use Custom Extraction with Screaming Frog?

I’m glad you asked.

We used it to check that page copy had migrated properly to a new CMS.

We also extracted the HTML within the copy, to check the internal links were still present.

One cool thing you can do – is scrape reviews and then analyse the reviews to see key feedback/pain points that could inform superior design.

Here’s a good way to use custom extraction/search to find text that you want to use for anchor text for internal links:

custom search screaming frog

I’m still looking into how to analyse the reviews – but this tool is a good starting point: https://seoscout.com/tools/text-analyzer

Throw the reviews in and see what words are repeated etc

This tool is also very good:

https://voyant-tools.org

Or – just paste into Chat GPT and ask for insights and pain-points to help develop a better product.

I asked Chat GPT and Google Gemini why I would want to scrape copy from sites and got these responses:


Monitoring Competitors

Content Strategy Insights: Gather headings, titles, and meta descriptions from competitor websites to gain insights into their content strategies and how they optimize for search engines.

Pricing Information: For e-commerce platforms, you can use custom extraction to keep an eye on competitors’ pricing strategies by pulling prices for similar products.

Promotional Strategies: Collect details on promotions, discounts, or special offers from competitors’ sites to evaluate how they stack up against your own promotional efforts.

Keyword Usage: Identify specific keywords from competitors’ content to understand how they are targeting particular topics or niches.

Backlink Analysis: While Screaming Frog isn’t primarily a backlink tool, you can utilize custom extraction to pinpoint external links on a page, providing insights into a competitor’s backlink strategy.

Optimizing Your Own Website

Content Optimization: Employ custom extraction to review your own site’s headings, titles, meta descriptions, and keyword usage to spot areas that need improvement.

Structural Data: Extract and assess structured data (like Schema markup) across your site to ensure proper implementation and identify potential enhancement opportunities.

Internal Linking: Analyze internal linking structures to understand the flow of page authority throughout your site and to find orphan pages or those with insufficient internal link support.

Audit Forms and Calls to Action (CTAs): Extract form action URLs and CTA links to verify they are working correctly and to optimize their placement and wording.

Identify Outdated Content: Use custom extraction to locate dates and timestamps on your content, which can help you identify articles or pages that may require updates.

Gemini said:

Reasons to Use Custom Extraction in Screaming Frog:

The custom extraction feature in Screaming Frog provides numerous advantages for website analysis and data collection. Here are some important reasons to utilize it:

  1. Extract Specific Data:

Go beyond basic SEO metrics: While Screaming Frog offers valuable insights, custom extraction enables you to focus on specific data points that may not be easily accessible, such as product prices, article publish dates, or user reviews.

Gain deeper competitor insights: Extract information on competitor pricing strategies, product descriptions, or content structures to compare with your own website.

Gather information for internal analysis: Monitor specific elements on your site, like call to action (CTA) button text or internal linking structures, to observe changes and assess their impact.

  1. Automate Data Collection:

Save time and effort: Collecting data manually can be labor-intensive and prone to errors. Custom extraction streamlines the process, allowing you to efficiently gather information across numerous pages.

Maintain consistent data: By setting up automated data extraction, you ensure uniform data collection over time, which facilitates better trend analysis and comparisons.

  1. Enhance Reporting and Analysis:

Combine extracted data with existing Screaming Frog metrics: Merge the extracted data with other SEO parameters such as page titles, meta descriptions, and internal links for a more thorough analysis.

Create custom reports: Use the extracted data to generate tailored reports for specific purposes, like competitor pricing comparisons or evaluations of content performance.

Monitoring Competitors:

Custom extraction serves as a valuable tool for competitor monitoring in various ways:

Extract competitor pricing data: Keep track of competitor pricing trends, identify potential gaps in your own pricing strategy, and make informed pricing decisions.

Analyze competitor content structure and keywords: Learn how competitors format their content, pinpoint their targeted keywords, and gain insights to enhance your own strategy.

Note for self – for Magento 2, Hyva theme Sub-category page copy – scrape using:

//div[@id='descriptionDiv']

Product page descriptions upper and lower divs -

//div[@class="product-description"]

//*[@id="specifications"]/div/div[2]/div/div/div/div/div

//*[@id="description"]/div/div[2]

Auditing Canonicals With Screaming Frog [2024]

  • TL;DR – Go to the “issues” reports – Bulk Report – Issues – All –> Export into a folder
  • View canonicals canonicalized reports

Some of the names of the reports can be confusing:

  • The “canonicals missing inlinks” report – is a list of the pages missing canonical URLs, and the inlinks to those pages
  • Canonicals Missing report – as you’d expect – shows you the pages without canonical URLs/tags
  • Canonicals Canonicalised – has pages with canonical to a different URL. So you might have example.com/help/contact – canonicalised to example.com/help – which may or may not be a problem.
  • Canonicals canonicalised inlinks – those pages with canonicals different to their own URL – the inlinks to those pages.

  • Check canonical with JS turned off (using Developer Chrome Extension) & check it remains the same
  • Check the View Source Code with and Without JS turned on – check canonical remains the same
  • Check canonical is not added using JS – this is not idea – more info here
  • Check for multiple canonical URLs using Screaming Frog and check visually in the view source code (JS rendering might be required to see all cononicals)

  • For paginated pages – check if you want each page indexed, that each different page has it’s own canonical URL
  • Check that faceted/filters on pages don’t change the canonical URL (generally you dont want them to)

Exclude these/filter out by, in this instance adding a filter in Excel – does not contain – +

Check this blog post too about auditing canonicals and Hreflang tags

Creating an Actionable 404 Report from Screaming Frog

Update – I don’t think all the process below is required.

Just download the 404 – inlinks report from Screaming Frog

Bulk Export (very top, slightly to the left on the GUI)> Response Codes > Internal > Client Error 4**s

Copy the “Destination” (column C on report) column and paste into an new Excel tab/sheet and remove duplicates

In the first sheet, copy and paste the source column into column D

In the second sheet, do a vlookup using the first destination URL, and “lookup” in the first sheet – columns C and D, to return the relevant source URL

Copy the vlookup and Paste – values – into column A into the second sheet

You can also copy and paste the anchor text and location into column C

Follow this protocol, to produce a sheet you can send to devs etc, to remove 404s

  • This will get rid of the site-wide 404s and some individual 404s

Run a crawl with Screaming Frog

Export the report –> Screaming Frog – Bulk Export  – Response Codes – Internal – Internal Client Error (4xxs)  (check 500s too)

In Excel – Copy and paste the “destination” URLs into a new sheet – into column A

Remove duplicates from the destination URLs that you’ve just copied into a new sheet

rename the colum – 404s Destination

  • Copy and paste the Source URLs and the Anchor Text into a new sheet.

Paste Source URLs in column A, and Anchor Text into column C

In cell B1 type – ” | ”

In cell D1 – give the column the heading “Source | Anchor”

In cell D2 concatenate – =CONCATENATE(A2,$B$1,C2)

Drag the formula down.

You’ll now have the anchor text and the source URL together, so you can vlookup the destination (404) URL

  • create a new sheet
  • Copy and paste all of the source URLs | Anchor Text (from the concatentate formula – paste special -values only
  • Copy & Paste Destination URLs from the original sheet into columns B and C in the new sheet you just made.


You need “destination” in column B and “Source | Anchor Text” in column C, as vlookup has to go left to right

  • So you’ll have – 404s Destination – Destination – Source | Anchor Text

Name column D in the new sheet “Example Source URL & Anchor Text” and in cell D2 enter the lookup – VLOOKUP(B2,B:C,2,0) (put “equals sign” before the V. Drag the formula down

Copy column A and paste into a new sheet. Name the sheet “Final”.

Copy column D with the vlookup and paste values into column B in the “Final Spreadsheet”

In “final”, you should now have all the unique 404s and an example of a page that links to those 404s with the anchor text.

  • You can use “text to columns” to seperate the source URLS and anchor text if you wish
  • If you’re sending the spreadsheet onto a dev or someone to fix the 404s, you are probably best sending the full sheet with all the inlinks to the 404s, plus the one you’ve just made. It depends how they go about fixing the 404s.

    Once 404s have been fixed, rerun a crawl and recheck them.

look out for 404s that are classed as  HTTP Redirects in the “type” column – these don’t seem to have a unique source URL. You may have to search for the URL in the search box in Screaming Frog and click the “inlinks” tab to see original link to the non-secure http page

If you like, before you send off the report to someone, you can double check the “destination” URLs definitely are 404s, by pasting them into screaming frog in “list” mode