How to Crawl a Sub-Folder Only in Screaming Frog [2025] – example.com/example-sub-folder/

If You’re Folder URL Resolves with a 200 with a trailing slash /

For example – to crawl mysite.com/buyers-guides/

To crawl only URLs in the /buyers-guides/ folder on mysite.com using Screaming Frog, follow these steps:

  1. Open Screaming Frog SEO Spider and enter the URL: https://www.mysite.com/buyers-guides/
  2. Go to Configuration > Spider > Crawl
  3. Under “Crawl Limits,” select “Limit Crawl Depth” and set it to 1 to crawl only the specified folder
  4. In the “Include” box, add this pattern: https://www.mysite.com/buyers-guides/.*
  5. Start the crawl by clicking the “Start” button

This configuration will ensure Screaming Frog only crawls URLs within the /buyers-guides/ folder on networldsports.co.uk, excluding other sections of the website.


If You’re Folder URL Resolves with a 200 without a trailing slash /

https://www.mysite.com/buyers-guides

Enter the URL of the domain plus sub/folder into the main address bar on Screaming Frog.

Choose Sub-folder option to the right of the address bar:

Go to Configuration (top menu bar, to the left) >Spider > crawl > Include
– Add the sub-folder (without trailing slash) to the include section:

Click “Start”. (Button to the right of the “Subfolder” drop down).


Crawling a sub-folder – plus all internal & external links that are contained within the sub-folder pages.

The protocol above would only check the status codes of any URLs actually held within the /buyers-guides/ folder.

For example, if a football goal guide, links to the FA’s website, and 404s, the above methods would not pick this up (as the FA doesn’t have /buyers-guides/ in the homepage etc)

  1. Crawl and check all the URLs within a sub-folder e.g.

https://www.example.com/buyers-guides



2. get the status codes of any internal and external link, that point to outside the folder?



For example:


Our football goal guide –

https://www.example.com/buyers-guides/football-pitch-size-guide

Contains links that point outside of the buyers-guides folder, like to our product pages and external links to thefa.com etc.

Sub-Folder with Trailing Slash /buyers-guides/

Crawl https://www.example.com/buyers-guides/ with ‘Crawl Outside of Start Folder’ disabled, but with ‘Check Links OutSide of Start Folder’ enabled.

Sub-Folder with NO Trailing Slash /buyers-guides

Perform the crawl outlined above (using the include), get the list of https://www.example.com/buyers-guides URLs

then switch to list mode (Mode > List), go to ‘Config > Spider > Limits‘ and change the ‘Limit Crawl Depth’ from ‘0’ to ‘1” and upload and crawl the URLs

Remember to delete in include /buyers-guides from the crawl config before doing the above
i.e. Config>Spider>Include – remove anything in the box/field.

Scraping JSON Schema with Screaming Frog Custom Extraction [2025]

Last updated – a few days ago (probably).

If you go to Menus – configuration > Custom > Custom Extraction

Click the +Add button on the bottom right and choose “Regex” from the drop-down menu, which is to the right of “Extractor 1” text box.

Add the code below, in the box/field to the right of “Regex”.

<script type=\"application\/ld\+json\">(.*?)</script>

I’m using the code below, to extract product schema only – I can export to excel and filter the URLs containing product schema, but don’t have the aggregaterating:

<script type=\"application\/ld\+json\">(.*?"@type":\s*"Product".*?)<\/script>

Scrape Product Schema & Identify Missing Fields
(Product Schema missing aggregateRating field in this case)

  • Turned out easier to use this regex to identify all the URLs that have aggregateRating fields:
"aggregateRating":\s*\{[^}]+\}


  • and set up a second custom extraction to check for URLs/pages with any reviews –
"review":\s*\[\s*\{[^]]+\}

If the page had review schema, but not aggregateRating – then we needed to fix them.

scrape schema screaming frog

Screaming Frog – How to Save & Load a Custom Crawl Coniguration

Handy if you have loads of custom extractions set up etc.

  • Saving Custom Config File:

Under the Configuration Menu – go to – Profiles>Save As

  • Importing The Custom Config File

Configuration > Profiles > Load…

SF custom Config File Import
  • Save as default

    Also worth knowing – if you use a similar setup each time you run Screaming Frog – let’s say, you exclude anything with “account” or “customer” in the URL, you can save your setup/config as default:

Which can be a nice time saver too!

I pretty much always add the following to the excludes ‘file’, so it’s easier to keep it in the default setup:

^https?://[^/]+/customer/account/.*
^https?://[^/]+/checkout/cart/.*

^.\?.

.js$

.css$

Screaming Frog – Custom Extraction – Extract Specific Page Copy [2025]

Last Updated – a few days ago (probably)

  • Open Screaming Frog
  • Go to Configuration in the top menu
  • Custom > Custom Extraction
  • Use Inspect Element (right click on the copy and choose “inspect” if you use Chrome browser) – to identify the name, class or ID of the div or element the page copy is contained in:

    In this example the Div class is “prose” (f8ck knows why)

  • You can copy the Xpath instead – but it appears to do the same thing as just entering the class or id of the div:
  • The following will scrape any text in the div called “prose”:

*Click Image to enlarge^

Once you are in the Custom Extraction Window – Choose:

  • Extractor 1
  • X Path
  • In the next box enter –> //div[@class=’classofdiv‘] —->

    in this example – //div[@class=’prose’]
  • Extract Text

//div[@class='prose']

^Enter the above into the 3rd 'box' in the custom extraction window/tab. 
Replace "prose" with the name of the div you want to scrape.


If you copy the Xpath using Inspect Element – select the exact element you want. For example, don’t select the Div that contains text you want to scrape – select the text itself:

Here are some more examples:

How to Extract Common HTML Elements

//div[@class='read-more']
XPathOutput
//h1Extract all H1 tags
//h3[1]Extract the first H3 tag
//h3[2]Extract the second H3 tag
//div/pExtract any <p> contained within a <div>
//div[@class=’author’]Extract any <div> with class “author” (remember to check ‘ quote marks are correct)
//p[@class=’bio’]Extract any <p> with class “bio”
//*[@class=’bio’]Extract any element with class “bio”
//ul/li[last()]Extract the last <li> in a <ul>
//ol[@class=’cat’]/li[1]Extract the first <li> in a <ol> with class “cat”
count(//h2)Count the number of H2’s (set extraction filter to “Function Value”)
//a[contains(.,’click here’)]Extract any link with anchor text containing “click here”
//a[starts-with(@title,’Written by’)]Extract any link with a title starting with “Written by”

 

How to Extract Common HTML Attributes

XPathOutput
//@hrefExtract all links
//a[starts-with(@href,’mailto’)]/@hrefExtract link that starts with “mailto” (email address)
//img/@srcExtract all image source URLs
//img[contains(@class,’aligncenter’)]/@srcExtract all image source URLs for images with the class name containing “aligncenter”
//link[@rel=’alternate’]Extract elements with the rel attribute set to “alternate”
//@hreflangExtract all hreflang values

 

How to Extract Meta Tags (including Open Graph and Twitter Cards)

I recommend setting the extraction filter to “Extract Inner HTML” for these ones.

Extract Meta Tags:

XPathOutput
//meta[@property=’article:published_time’]/@contentExtract the article publish date (commonly-found meta tag on WordPress websites)

Extract Open Graph:

XPathOutput
//meta[@property=’og:type’]/@contentExtract the Open Graph type object
//meta[@property=’og:image’]/@contentExtract the Open Graph featured image URL
//meta[@property=’og:updated_time’]/@contentExtract the Open Graph updated time

Extract Twitter Cards:

XPathOutput
//meta[@name=’twitter:card’]/@contentExtract the Twitter Card type
//meta[@name=’twitter:title’]/@contentExtract the Twitter Card title
//meta[@name=’twitter:site’]/@contentExtract the Twitter Card site object (Twitter handle)

How to Extract Schema Markup in Microdata Format

If it’s in JSON-LD format, then jump to the section on how to extract schema markup with regex.

Extract Schema Types:

XPathOutput
//*[@itemtype]/@itemtypeExtract all of the types of schema markup on a page

References:

Update:

If the ‘shorter code’ in the tables above doesn’t work for some reason, you may have to right click – inspect and copy the full Xpath code to be more specific with what you want to extract:

For sections of text like paragraphs and on page descriptions, select the actual text in the inspect window before copying the Xpath.

Update 2

We wanted to compare the copy and internal links before and after a site-migration to a new CMS.

To see the links in HTML format – you just need to check “Extract Text” to “Extract Inner HTML” in the final drop down:
(click image to enlarge)

On the new CMS, it was easier to just copy the XPath

Why Use Custom Extraction with Screaming Frog?

I’m glad you asked.

We used it to check that page copy had migrated properly to a new CMS.

We also extracted the HTML within the copy, to check the internal links were still present.

One cool thing you can do – is scrape reviews and then analyse the reviews to see key feedback/pain points that could inform superior design.

Here’s a good way to use custom extraction/search to find text that you want to use for anchor text for internal links:

custom search screaming frog

I’m still looking into how to analyse the reviews – but this tool is a good starting point: https://seoscout.com/tools/text-analyzer

Throw the reviews in and see what words are repeated etc

This tool is also very good:

https://voyant-tools.org

Or – just paste into Chat GPT and ask for insights and pain-points to help develop a better product.

I asked Chat GPT and Google Gemini why I would want to scrape copy from sites and got these responses:


Monitoring Competitors

Content Strategy Insights: Gather headings, titles, and meta descriptions from competitor websites to gain insights into their content strategies and how they optimize for search engines.

Pricing Information: For e-commerce platforms, you can use custom extraction to keep an eye on competitors’ pricing strategies by pulling prices for similar products.

Promotional Strategies: Collect details on promotions, discounts, or special offers from competitors’ sites to evaluate how they stack up against your own promotional efforts.

Keyword Usage: Identify specific keywords from competitors’ content to understand how they are targeting particular topics or niches.

Backlink Analysis: While Screaming Frog isn’t primarily a backlink tool, you can utilize custom extraction to pinpoint external links on a page, providing insights into a competitor’s backlink strategy.

Optimizing Your Own Website

Content Optimization: Employ custom extraction to review your own site’s headings, titles, meta descriptions, and keyword usage to spot areas that need improvement.

Structural Data: Extract and assess structured data (like Schema markup) across your site to ensure proper implementation and identify potential enhancement opportunities.

Internal Linking: Analyze internal linking structures to understand the flow of page authority throughout your site and to find orphan pages or those with insufficient internal link support.

Audit Forms and Calls to Action (CTAs): Extract form action URLs and CTA links to verify they are working correctly and to optimize their placement and wording.

Identify Outdated Content: Use custom extraction to locate dates and timestamps on your content, which can help you identify articles or pages that may require updates.

Gemini said:

Reasons to Use Custom Extraction in Screaming Frog:

The custom extraction feature in Screaming Frog provides numerous advantages for website analysis and data collection. Here are some important reasons to utilize it:

  1. Extract Specific Data:

Go beyond basic SEO metrics: While Screaming Frog offers valuable insights, custom extraction enables you to focus on specific data points that may not be easily accessible, such as product prices, article publish dates, or user reviews.

Gain deeper competitor insights: Extract information on competitor pricing strategies, product descriptions, or content structures to compare with your own website.

Gather information for internal analysis: Monitor specific elements on your site, like call to action (CTA) button text or internal linking structures, to observe changes and assess their impact.

  1. Automate Data Collection:

Save time and effort: Collecting data manually can be labor-intensive and prone to errors. Custom extraction streamlines the process, allowing you to efficiently gather information across numerous pages.

Maintain consistent data: By setting up automated data extraction, you ensure uniform data collection over time, which facilitates better trend analysis and comparisons.

  1. Enhance Reporting and Analysis:

Combine extracted data with existing Screaming Frog metrics: Merge the extracted data with other SEO parameters such as page titles, meta descriptions, and internal links for a more thorough analysis.

Create custom reports: Use the extracted data to generate tailored reports for specific purposes, like competitor pricing comparisons or evaluations of content performance.

Monitoring Competitors:

Custom extraction serves as a valuable tool for competitor monitoring in various ways:

Extract competitor pricing data: Keep track of competitor pricing trends, identify potential gaps in your own pricing strategy, and make informed pricing decisions.

Analyze competitor content structure and keywords: Learn how competitors format their content, pinpoint their targeted keywords, and gain insights to enhance your own strategy.

Note for self – for Magento 2, Hyva theme Sub-category page copy – scrape using:

//div[@id='descriptionDiv']

Product page descriptions upper and lower divs -

//div[@class="product-description"]

//*[@id="specifications"]/div/div[2]/div/div/div/div/div

//*[@id="description"]/div/div[2]

Creating an Actionable 404 Report from Screaming Frog

Update – I don’t think all the process below is required.

Just download the 404 – inlinks report from Screaming Frog

Bulk Export (very top, slightly to the left on the GUI)> Response Codes > Internal > Client Error 4**s

Copy the “Destination” (column C on report) column and paste into an new Excel tab/sheet and remove duplicates

In the first sheet, copy and paste the source column into column D

In the second sheet, do a vlookup using the first destination URL, and “lookup” in the first sheet – columns C and D, to return the relevant source URL

Copy the vlookup and Paste – values – into column A into the second sheet

You can also copy and paste the anchor text and location into column C

Follow this protocol, to produce a sheet you can send to devs etc, to remove 404s

  • This will get rid of the site-wide 404s and some individual 404s

Run a crawl with Screaming Frog

Export the report –> Screaming Frog – Bulk Export  – Response Codes – Internal – Internal Client Error (4xxs)  (check 500s too)

In Excel – Copy and paste the “destination” URLs into a new sheet – into column A

Remove duplicates from the destination URLs that you’ve just copied into a new sheet

rename the colum – 404s Destination

  • Copy and paste the Source URLs and the Anchor Text into a new sheet.

Paste Source URLs in column A, and Anchor Text into column C

In cell B1 type – ” | ”

In cell D1 – give the column the heading “Source | Anchor”

In cell D2 concatenate – =CONCATENATE(A2,$B$1,C2)

Drag the formula down.

You’ll now have the anchor text and the source URL together, so you can vlookup the destination (404) URL

  • create a new sheet
  • Copy and paste all of the source URLs | Anchor Text (from the concatentate formula – paste special -values only
  • Copy & Paste Destination URLs from the original sheet into columns B and C in the new sheet you just made.


You need “destination” in column B and “Source | Anchor Text” in column C, as vlookup has to go left to right

  • So you’ll have – 404s Destination – Destination – Source | Anchor Text

Name column D in the new sheet “Example Source URL & Anchor Text” and in cell D2 enter the lookup – VLOOKUP(B2,B:C,2,0) (put “equals sign” before the V. Drag the formula down

Copy column A and paste into a new sheet. Name the sheet “Final”.

Copy column D with the vlookup and paste values into column B in the “Final Spreadsheet”

In “final”, you should now have all the unique 404s and an example of a page that links to those 404s with the anchor text.

  • You can use “text to columns” to seperate the source URLS and anchor text if you wish
  • If you’re sending the spreadsheet onto a dev or someone to fix the 404s, you are probably best sending the full sheet with all the inlinks to the 404s, plus the one you’ve just made. It depends how they go about fixing the 404s.

    Once 404s have been fixed, rerun a crawl and recheck them.

look out for 404s that are classed as  HTTP Redirects in the “type” column – these don’t seem to have a unique source URL. You may have to search for the URL in the search box in Screaming Frog and click the “inlinks” tab to see original link to the non-secure http page

If you like, before you send off the report to someone, you can double check the “destination” URLs definitely are 404s, by pasting them into screaming frog in “list” mode

Finding & Fixing 404s with Screaming Frog [2024]

Export the 404 Inlinks Report into Excel

The best report in Screaming Frog to see the source and destination of all 404s – is to go to Bulk Export at the top menu:

And then Response Codes – Client Error Inlinks

In the Exported Excel Sheet:

Copy the “destination” column

Paste into a new sheet/tab – In Column B

Remove duplicates

Back in the first sheet – Paste “Source”, into column D to the right of “destination” in first tab/sheet


In the second sheet –
Do a vlookup in the second sheet/tab – to import a source URL to each unique “destination” URL
Add the word “source” to cell C1 in the second sheet

I think you have to click the file name to download the example sheet below:

Sheet / Tab 1 should look like this:

The Second Sheet / Tab should look like this:

The Lookup Value in Cell C2, in the sheet above is:

=VLOOKUP(Sheet1!B2,’1 – Client Error (4xx) Inlinks’!C:D,2,0)

Just double click / drag it down to complete the lookup

You can add more lookups and columns to provide more details about the link location.

Just copy and paste values on the last Vlookup.

Paste another column like “Anchor” into column D on the first Tab

Paste the vlookup into the adjacent column

“0” means nothing found – i.e. no anchor text etc

Update –

404s found in the Navigation or “Aside” tend to be site-wide, so you can find one instance of this and update the 404 and Bob’s your uncle.

If a 404 is within the page content/copy however, they are possibly unique and need fixing individually. So take a look the Content 404s separately. They probably only have 1 source URL so no need to faff with vlookups etc.

Bulk checking Canonical URLs & Hreflang Tags for SEO on an eCommerce Store (With Screaming Frog)

TLDR – Easiest way to check hreflang is to just go to reports > Hreflang > non200 Hreflang URLs:

Filter “Source” column to your current domain and then you should see all the hreflang that 300 or 404.

  • However – you ideally want to get all of your indexable, 200 status code URLs – to have the relevant hreflang tags.
  • So first find all the indexable, 200, URLs with exact-match canonicals
  • Then check the hreflang on these only
  • Get all the URLs that 200 and are indexable from “internal_all” report
  • Paste in list mode, all of these URLs – check hreflang tab
  • Perform a Crawl Analysis
  • Check The overview panel

hreflang screaming Frog
  • Order the Hreflang tab by occurrences to double check tall URLs have at least 1 hreflang

Checklist here – https://docs.google.com/spreadsheets/d/1IHRZNFravVKaZ5T9uOEOg_zQ5ee8dHcRP8m92vkn4KY/edit#gid=0

Configuration – spider – crawl and store Hreflang

Find URLs that should have a hreflang tag – Indexable URLs

  • Crawl the website with Screaming Frog – in configuration crawl and store Hreflang
  • Download the “internal” report on first tab
  • Filter relevant columns to find all URLs that return 200 (Status Code), are indexable (Indexability)

Check Canonical URLs are Correct and Indexable

  • From the internal Report – filter status code to 200 only and Indexable
  • Paste the filtered columns into a new tab/sheet and delete irrelevant columns so you just have URL and Canonical columns
  • Use =Exact formula to find all URLs that match the canonical URL – add an extra column for this and call it “exact?”
  • For canonical checks – filter to “FALSE” and see which URLs don’t have self ref canonicals

Back in Internal – Original Spreadsheet

  • unfilter the internal report.
  • Filter to show all the canonical URLs. Canonical Column – filter – untick “blanks”
  • Copy and paste into Screaming Frog , list mode – check status code of all these canonical URLs

  • For Hreflang – Filter Canonical URL – “exact?” column to “true”
  • You should be left with all URLs that are indexable (and should therefore have hreflang if an international site) – i.e. 200 status codes and URLs that exactly match canonical URLs.

Hreflang Checks

For the indexable URLs found above, check hreflang using:

https://technicalseo.com/tools/hreflang/

  • Go to screaming frog GUI – and the Hreflang Window

Order by Occurrences – check if any URLs have less hreflang then they should have

  • Check all the reports in the drop down menu on the left

Non-200 Hreflang URLs – URLs within the rel=”alternate” markup that don’t result in a 200 status code

Download the non-200 hreflang report – by going to the “hreflang” tab in Screaming Frog.

Delete all the columns except those containing the actual hreflang URLs

Then in the column adjacent to all the hreflang URLs add the formula:
=INDEX($A$1:$G$126,CEILING(ROW()/7, 1),MOD(ROW()-1,7)+1)

The formula above would be used if the hreflang URLs are in columns A to G, and the last row that is populated in number 126.
You’ll also need to update the number 7 in the formula – for example if you have 5 columns of hreflang URLs, then change 7 to 5.

Drag this formula down until you start seeing #REF! errors, indicating you’ve covered all the data in your original columns

  • You should now have all the hreflang URLs listed in the non-200 hreflang report
  • To find the specific URLs that are non-200 – paste the last column containing all the URLs, into screaming frog in “List” mode

Export the List mode ‘crawl’ from Screaming Frog and filter by status code

  • Unlinked Hreflang URLS – Page sthat conain one or more hrelgang tag / URL that’s only linked to by a hreflang tag and not in the actual webpages

  • Missing Return Links – Hreflang should be recipricol.
  • Inconsistent Language & Region Return Links – This filter includes URLs with inconsistent language and regional return links to them. This is where a return link has a different language or regional value than the URL is referencing itself
  • Non Canonical Return Links – URLs with non canonical hreflang return links. Hreflang should only include canonical versions of URLs.
  • Noindex Return Links – Return links which have a ‘noindex’ meta tag. All pages within a set should be indexable, 
  • Incorrect Language & Region Codes – This simply verifies the language (in ISO 639-1 format) and optional regional (in ISO 3166-1 Alpha 2 format) code values are valid
  • Missing Self Reference – URLs missing their own self referencing rel=”alternate” hreflang annotation. It was previously a requirement to have a self-referencing hreflang, but Google has updated their guidelines to say this is optional. It is however good practice and often easier to include a self referencing attribute.
  • Not Using Canonical – URLs not using the canonical URL on the page, in it’s own hreflang annotation. Hreflang should only include canonical versions of URLs.
  • Missing – URLs missing an hreflang attribute completely. These might be valid of course, if there aren’t multiple versions of a page.
  • Outside <head> – Pages with an hreflang link element that is outside of the head element in the HTML. The hreflang link element should be within the head element, or search engines will ignore it.

Official documentation from Screaming Frog here.

You can also use this Google sheet to check hreflang further

Excluding URLs in Screaming Frog Crawl [2024]

To exclude URLs just go to:

Configuration > Exclude (in the very top menu bar)

To exclude URLs within a specific folder, use the following regex:

^https://www.mydomain.com/customer/account/.*
^https://www.mycomain.com/checkout/cart/.*

The above regex, will stop Screaming Frog from Crawling the customer/account folder and the cart folder.

Or – this is easier for me, as I have to check and crawl lots of international domains with the same site structure and folders:

^https?://[^/]+/customer/account/.*
^https?://[^/]+/checkout/cart/.*

Excluding Images –

Ive just been using the image extensions to block them in the crawl, e.g.

.*jpg

Although you can block them in the Configuration>Spider menu too.

Excluding Parameter URLs

this appears to do the job:

^.*\?.*

My typical “Excludes” looks like this:

^https?://[^/]+/customer/account/.*
^https?://[^/]+/checkout/cart/.*

^.*\?.*
jpg$

png$

.js$

.css$

Update – you can just use this to block any URLs containing “cart” or “account”

/account/|/cart/

Update:

Currently using this for my excludes config, as I actually want to crawl images:

^https?://[^/]+/customer/account/.*
^https?://[^/]+/checkout/cart/.*

^.\?.

.js$

.css$

  • You can just exclude crawling JS and CSS in the crawl > Configuration but I find it slightly quicker this way
  • If you are using JS rendering to crawl, you might want to crawl JS files too. Depending on if they’re required to follow any JS links etc (generally bad idea to have JS links, if you do, have a HTML backup or prerendering in place)