Screaming Frog – Custom Extraction – Extract Specific Page Copy [2025]

Last Updated – a few days ago (probably)

  • Open Screaming Frog
  • Go to Configuration in the top menu
  • Custom > Custom Extraction
  • Use Inspect Element (right click on the copy and choose “inspect” if you use Chrome browser) – to identify the name, class or ID of the div or element the page copy is contained in:

    In this example the Div class is “prose” (f8ck knows why)

  • You can copy the Xpath instead – but it appears to do the same thing as just entering the class or id of the div:
  • The following will scrape any text in the div called “prose”:

*Click Image to enlarge^

Once you are in the Custom Extraction Window – Choose:

  • Extractor 1
  • X Path
  • In the next box enter –> //div[@class=’classofdiv‘] —->

    in this example – //div[@class=’prose’]
  • Extract Text

//div[@class='prose']

^Enter the above into the 3rd 'box' in the custom extraction window/tab. 
Replace "prose" with the name of the div you want to scrape.


If you copy the Xpath using Inspect Element – select the exact element you want. For example, don’t select the Div that contains text you want to scrape – select the text itself:

Here are some more examples:

How to Extract Common HTML Elements

//div[@class='read-more']
XPathOutput
//h1Extract all H1 tags
//h3[1]Extract the first H3 tag
//h3[2]Extract the second H3 tag
//div/pExtract any <p> contained within a <div>
//div[@class=’author’]Extract any <div> with class “author” (remember to check ‘ quote marks are correct)
//p[@class=’bio’]Extract any <p> with class “bio”
//*[@class=’bio’]Extract any element with class “bio”
//ul/li[last()]Extract the last <li> in a <ul>
//ol[@class=’cat’]/li[1]Extract the first <li> in a <ol> with class “cat”
count(//h2)Count the number of H2’s (set extraction filter to “Function Value”)
//a[contains(.,’click here’)]Extract any link with anchor text containing “click here”
//a[starts-with(@title,’Written by’)]Extract any link with a title starting with “Written by”

 

How to Extract Common HTML Attributes

XPathOutput
//@hrefExtract all links
//a[starts-with(@href,’mailto’)]/@hrefExtract link that starts with “mailto” (email address)
//img/@srcExtract all image source URLs
//img[contains(@class,’aligncenter’)]/@srcExtract all image source URLs for images with the class name containing “aligncenter”
//link[@rel=’alternate’]Extract elements with the rel attribute set to “alternate”
//@hreflangExtract all hreflang values

 

How to Extract Meta Tags (including Open Graph and Twitter Cards)

I recommend setting the extraction filter to “Extract Inner HTML” for these ones.

Extract Meta Tags:

XPathOutput
//meta[@property=’article:published_time’]/@contentExtract the article publish date (commonly-found meta tag on WordPress websites)

Extract Open Graph:

XPathOutput
//meta[@property=’og:type’]/@contentExtract the Open Graph type object
//meta[@property=’og:image’]/@contentExtract the Open Graph featured image URL
//meta[@property=’og:updated_time’]/@contentExtract the Open Graph updated time

Extract Twitter Cards:

XPathOutput
//meta[@name=’twitter:card’]/@contentExtract the Twitter Card type
//meta[@name=’twitter:title’]/@contentExtract the Twitter Card title
//meta[@name=’twitter:site’]/@contentExtract the Twitter Card site object (Twitter handle)

How to Extract Schema Markup in Microdata Format

If it’s in JSON-LD format, then jump to the section on how to extract schema markup with regex.

Extract Schema Types:

XPathOutput
//*[@itemtype]/@itemtypeExtract all of the types of schema markup on a page

References:

Update:

If the ‘shorter code’ in the tables above doesn’t work for some reason, you may have to right click – inspect and copy the full Xpath code to be more specific with what you want to extract:

For sections of text like paragraphs and on page descriptions, select the actual text in the inspect window before copying the Xpath.

Update 2

We wanted to compare the copy and internal links before and after a site-migration to a new CMS.

To see the links in HTML format – you just need to check “Extract Text” to “Extract Inner HTML” in the final drop down:
(click image to enlarge)

On the new CMS, it was easier to just copy the XPath

Why Use Custom Extraction with Screaming Frog?

I’m glad you asked.

We used it to check that page copy had migrated properly to a new CMS.

We also extracted the HTML within the copy, to check the internal links were still present.

One cool thing you can do – is scrape reviews and then analyse the reviews to see key feedback/pain points that could inform superior design.

Here’s a good way to use custom extraction/search to find text that you want to use for anchor text for internal links:

custom search screaming frog

I’m still looking into how to analyse the reviews – but this tool is a good starting point: https://seoscout.com/tools/text-analyzer

Throw the reviews in and see what words are repeated etc

This tool is also very good:

https://voyant-tools.org

Or – just paste into Chat GPT and ask for insights and pain-points to help develop a better product.

I asked Chat GPT and Google Gemini why I would want to scrape copy from sites and got these responses:


Monitoring Competitors

Content Strategy Insights: Gather headings, titles, and meta descriptions from competitor websites to gain insights into their content strategies and how they optimize for search engines.

Pricing Information: For e-commerce platforms, you can use custom extraction to keep an eye on competitors’ pricing strategies by pulling prices for similar products.

Promotional Strategies: Collect details on promotions, discounts, or special offers from competitors’ sites to evaluate how they stack up against your own promotional efforts.

Keyword Usage: Identify specific keywords from competitors’ content to understand how they are targeting particular topics or niches.

Backlink Analysis: While Screaming Frog isn’t primarily a backlink tool, you can utilize custom extraction to pinpoint external links on a page, providing insights into a competitor’s backlink strategy.

Optimizing Your Own Website

Content Optimization: Employ custom extraction to review your own site’s headings, titles, meta descriptions, and keyword usage to spot areas that need improvement.

Structural Data: Extract and assess structured data (like Schema markup) across your site to ensure proper implementation and identify potential enhancement opportunities.

Internal Linking: Analyze internal linking structures to understand the flow of page authority throughout your site and to find orphan pages or those with insufficient internal link support.

Audit Forms and Calls to Action (CTAs): Extract form action URLs and CTA links to verify they are working correctly and to optimize their placement and wording.

Identify Outdated Content: Use custom extraction to locate dates and timestamps on your content, which can help you identify articles or pages that may require updates.

Gemini said:

Reasons to Use Custom Extraction in Screaming Frog:

The custom extraction feature in Screaming Frog provides numerous advantages for website analysis and data collection. Here are some important reasons to utilize it:

  1. Extract Specific Data:

Go beyond basic SEO metrics: While Screaming Frog offers valuable insights, custom extraction enables you to focus on specific data points that may not be easily accessible, such as product prices, article publish dates, or user reviews.

Gain deeper competitor insights: Extract information on competitor pricing strategies, product descriptions, or content structures to compare with your own website.

Gather information for internal analysis: Monitor specific elements on your site, like call to action (CTA) button text or internal linking structures, to observe changes and assess their impact.

  1. Automate Data Collection:

Save time and effort: Collecting data manually can be labor-intensive and prone to errors. Custom extraction streamlines the process, allowing you to efficiently gather information across numerous pages.

Maintain consistent data: By setting up automated data extraction, you ensure uniform data collection over time, which facilitates better trend analysis and comparisons.

  1. Enhance Reporting and Analysis:

Combine extracted data with existing Screaming Frog metrics: Merge the extracted data with other SEO parameters such as page titles, meta descriptions, and internal links for a more thorough analysis.

Create custom reports: Use the extracted data to generate tailored reports for specific purposes, like competitor pricing comparisons or evaluations of content performance.

Monitoring Competitors:

Custom extraction serves as a valuable tool for competitor monitoring in various ways:

Extract competitor pricing data: Keep track of competitor pricing trends, identify potential gaps in your own pricing strategy, and make informed pricing decisions.

Analyze competitor content structure and keywords: Learn how competitors format their content, pinpoint their targeted keywords, and gain insights to enhance your own strategy.

Note for self – for Magento 2, Hyva theme Sub-category page copy – scrape using:

//div[@id='descriptionDiv']

Product page descriptions upper and lower divs -

//div[@class="product-description"]

//*[@id="specifications"]/div/div[2]/div/div/div/div/div

//*[@id="description"]/div/div[2]

Using Google Translate in Google Sheet [2024]

With my copy in column B, starting in cell B2, and with the language short codes “en” for the original language and “no” for Norwegian – what I want to translate the English into, in cells E1 and F1 respectively, the code I place in cell D2 is:
=GOOGLETRANSLATE(B2,$E$1,$F$1)

I can drag the formula down to translate all the English in column B.

That’s it really

🙂

By the way – if you have a site that’s randomly in 2 languages, or more – you can use detect language Google Sheets function

=DETECTLANGUAGE(“cellnumber”)

Product Schema Example (with review schema) 2024

Here’s an example:

<script type="application/ld+json">	
{
"@context": "http://schema.org",
"@type": "Product",
"description": "The best pads you can buy online for MMA and boxing. Made with leather, manufactured by BJJ black belts and elves.",
"gtin8": "sdfdfsf3w5455",
"name": "Boxing and MMA Pads ",
"image": "https://cdnfake.com/media/catalog/product/m/i/boxing-pads-solo.jpg",
"sku": "boxing-mini-pads-only",
"url": "https://www.blackbeltwhitehat.co.uk/nice-mma-target-pads-boxing-only.html"
"brand": "Nice MMA",
"offers": [
{
"@type": "Offer",
"itemCondition": "http://schema.org/NewCondition",
"price": "89.99",
"availability": "InStock",
"priceCurrency": "GBP",
"url": "https://www.blackbeltwhitehat.co.uk/nice-mma-target-pads-boxing-only.html"

}
]
,
"review": [
{
"@type": "Review",
"author": {
"@type": "Person",
"name": "DAVE MACDONALD"
},
"datePublished": "2017-27-07",
"description": "Grandson loves using these",
"name": "ALFFI-JAC MACDONALD",
"reviewRating": {
"@type": "Rating",
"bestRating": "5",
"ratingValue": "5", "worstRating": "1"
}
} ]
,
"aggregateRating": {
"@type": "AggregateRating",
"ratingValue": "5",
"reviewCount": "1"
}
}
</script>

I use these tools to check schema

https://classyschema.org/Visualisation
https://search.google.com/test/rich-results
https://validator.schema.org/

and use – https://www.diffchecker.com/ to check an existing schema, that I know works and is validated, to another one that I’m testing.

Here’s another example I just found within the Google ‘documentation

 <html>
  <head>
    <title>Executive Anvil</title>
    <script type="application/ld+json">
    {
      "@context": "https://schema.org/",
      "@type": "Product",
      "name": "Executive Anvil",
      "description": "Sleeker than ACME's Classic Anvil, the Executive Anvil is perfect for the business traveler looking for something to drop from a height.",
      "review": {
        "@type": "Review",
        "reviewRating": {
          "@type": "Rating",
          "ratingValue": 4,
          "bestRating": 5
        },
        "author": {
          "@type": "Person",
          "name": "Fred Benson"
        }
      },
      "aggregateRating": {
        "@type": "AggregateRating",
        "ratingValue": 4.4,
        "reviewCount": 89
      }
    }
    </script>
  </head>
  <body>
  </body>
</html>

Enjoy!

How to Audit Hreflang Tags with Screaming Frog

Quickest way to check Hreflang:

  • In configuration store but don’t crawl the hreflang
  • Complete a crawl
  • In the top menu go to Reports > Hreflang > Non-200
  • Also look at all Hreflang in the reports and order but number – to check any missing hreflang

Check using the tests and reports here first:

https://businessdaduk.com/wp-content/uploads/2024/02/hreflang-checklist-nov-2023.xlsx

To crawl and include Hreflang Reports –

Menu Bar > Configuration > Spider > Crawl – Check the box to store Hreflang and Crawl (if you want to crawl and check they all 200)

To Filter on the Hreflang Tab

Click on the icon of the sliders on the top right Search window

You then get “Excel style” filtering options:

hreflang filter on screaming frog

TL;DR – Use the hreflang Tab/Window
– Use the drop down menu to show each report
– Download – and check “indexable” URLs only –> none-indexable URLs don’t need a hreflang
– Check each indexable URL has appropriate number of hreflang
– Also check non-200 report

Find URLs without all of the relevant Hreflang Attributes:

  • Hreflang Window –
    “All” in Drop down menu
    – Filter out any parameter URLs using the filter in screenshot above – Does Not Contain (!~) – ?
  • Export to Excel
  • Filter so only shows “Indexable” URLs
  • Find any URLs without the relevant number of hreflangs – e.g. if you have 8 sites in different languages/regions – you’ll probably want most of your Indexable URLs to have 8 “occurrences” of hreflang
  • Check Non-200 Hreflang dropdown for any errors – Easiest way to export the specific hreflang URLs that don’t 200 – is to go to Reports > Hreflang > Non 200.
    Filter the spreadsheet to non-empty hreflang column D:
  • Unlinked Hreflang URLs – Perform a crawl Analysis to Check this

  • Missing Return Links

Use the search function near the top right (click the slider icon) – Filter to show only “indexable” URLs to find URLs that should have return links, that are missing them.

Notes from Screaming Frog’s website

1) Select ‘Crawl’ and ‘Store’ Hreflang under ‘Config > Spider > Crawl’

2) To Crawl Hreflang In XML Sitemaps, Select ‘Crawl Linked XML Sitemaps’ Under ‘Config > Spider > Crawl’

3) Crawl The Website

4) View The Hreflang Tab

5) View the different Hreflang reports using the drop down menu

6) Perform a “crawl analysis” to see the 6th report

Reports

  • COntains hreflang – URLS that have the rel=”alternate” markup
  • Non-200 Hreflang URLs – URLs within the rel=”alternate” markup that don’t result in a 200 status code
  • Unlinked Hreflang URLS – Page sthat conain one or more hrelgang tag / URL that’s only linked to by a hreflang tag and not in the actual webpages
  • Missing Return Links – Hreflang should be recipricol.
  • Inconsistent Language & Region Return Links – This filter includes URLs with inconsistent language and regional return links to them. This is where a return link has a different language or regional value than the URL is referencing itself
  • Non Canonical Return Links – URLs with non canonical hreflang return links. Hreflang should only include canonical versions of URLs.
  • Noindex Return Links – Return links which have a ‘noindex’ meta tag. All pages within a set should be indexable, 
  • Incorrect Language & Region Codes – This simply verifies the language (in ISO 639-1 format) and optional regional (in ISO 3166-1 Alpha 2 format) code values are valid
  • Missing Self Reference – URLs missing their own self referencing rel=”alternate” hreflang annotation. It was previously a requirement to have a self-referencing hreflang, but Google has updated their guidelines to say this is optional. It is however good practice and often easier to include a self referencing attribute.
  • Not Using Canonical – URLs not using the canonical URL on the page, in it’s own hreflang annotation. Hreflang should only include canonical versions of URLs.
  • Missing – URLs missing an hreflang attribute completely. These might be valid of course, if there aren’t multiple versions of a page.
  • Outside <head> – Pages with an hreflang link element that is outside of the head element in the HTML. The hreflang link element should be within the head element, or search engines will ignore it.

To bulk export details of source pages, that contain errors or issues for hreflang, use the ‘Reports > Hreflang’ options.

For example, the ‘Reports > Hreflang > Non-200 Hreflang URLs’ export,

Notes from Search Engine Land

Irrelevant Hreflang Values

Sometimes, the language and region values in a hreflang tag are not properly aligned with the page’s relevant languages or countries. This error can be trickier to handle as tools won’t be able to identify it, so a manual review will be needed to detect if the hreflang values are really showing the correct language and/or country for the page in question.

Remember, hreflang attributes require a language to be specified, but region is optional and should only be used when necessary (for example, if you want to serve different pages to Spanish speakers in Mexico and Spanish speakers in Spain).

It’s critical to verify, before implementing anything, whether the site is language or country targeted (or if there’s a mix of approaches that you need to be aware of). The hreflang values will need to be generated according to this targeting.

Another scenario I’ve found is that, in some cases, the language (or country) code hasn’t been correctly implemented and always specifies the same language (or country) for each alternate URL. In this example from Audible, the home pages for France and Germany have been tagged as English language pages, even though they’re really in French and in German, respectively:

 Irrelevant URLs

Similar to the previous example, sometimes the hreflang attributes are showing the right language and/or country values, but the URLs have not been correctly specified.

For example, in the case of Skype, you can see that the English language version URL is always specified instead of the relevant language URL for each case. (Similarly, the canonical tag is always showing the English URL instead of the relevant one, as in the case of the Spanish language page below).

Full URLs including Full Prefix e.g. has www. instead of https://www. in hreflang

There are also situations where URLs that are meant to have absolute paths are not including the “https://” or “https://” at the start, making them relative URLs which don’t point to the correct page, as can be seen in this example:

Notes from LinkedIn article

In some cases, the same page may contain information for people speaking different languages, so using hreflang tags alone may not be sufficient. Using schema.org markup can help search engines more accurately recognize parts of web pages. For example, inLanguage defines the language of the content or performance or used in an action in schemes such as Event, CreativeWork, BroadcastService and others.

There are multiple free online tools available for testing. My favorite is https://technicalseo.com/tools/hreflang/ Google Search Console depreciated their country-targeting feature September of 2022, however, third party crawl tools such as ScreamingFrog and Ryte.com can uncover site-wide language and regional targeting issues fairly well.

If you use a tool and get the message:

“Missing region-independant link for that language (en)”
It can mean, for example with the Technical SEO tool; that we need a generic URL for English speaking visitors, regardless of what region/country they come from.

In practice, it’s often recommended to have a ‘fallback’ or a default hreflang tag for each language. For English, this would be a tag with the language code “en” without a country code. This tag acts as a catch-all for English speakers in regions not specifically targeted by other tags (like en-GB or en-US).

For example, if your website has English pages specifically for the US and the UK, your hreflang tags might look something like this:

  • <link rel="alternate" href="http://example.com/en-gb" hreflang="en-gb" /> for English speakers in the UK
  • <link rel="alternate" href="http://example.com/en-us" hreflang="en-us" /> for English speakers in the US

To resolve the error, you would add a tag for a generic English version:

  • <link rel="alternate" href="http://example.com/en" hreflang="en" /> for English speakers in general, regardless of region

This setup ensures that search engines know which page to show to English-speaking users based on their location, and also have a default page to show to English speakers in locations not covered by your region-specific tags.

Headless CMS & SEO [2023]

Summarised a few articles about SEO and headless CMS’s below:

  • Headless SEO:
    • Refers to SEO processes when using a headless CMS.
    • Requires content modelling for SEO, intentional technical setup, and an omnichannel approach.
    • Traditional SEO tools and techniques may not automatically apply to a headless CMS.
    • Core SEO principles remain the same: provide the best content to answer user queries.

  • Three Main Aspects of Headless SEO:
    • Content Models for SEO:
      • Headless CMS decouples content from its presentation.
      • Content is treated as data, stored centrally, and can be reused.
      • Content modeling defines content types, attributes, and relationships.

  • Limited Plugins and Add-ons:
    • Can’t rely on traditional SEO plugins like Yoast.
    • SEOs have full control over technical setup.
    • Explicit technical requests to the development team are necessary.


  • Omnichannel SEO:
    • SEO now goes beyond just website optimization.
    • Includes optimizing for search engines, social media, email, and other channels.
    • SEOs must ensure consistent experiences across all channels.


  • Is Headless Better for SEO?:
    • Trade-offs exist with headless CMS.
    • Opportunities for content distribution and repurposing are larger.
    • Benefits include decoupling content from presentation and managing content in multiple languages.
    • Headless CMS can be combined with static site generators for modern web architecture.



  • Best Practices for Headless CMS SEO:
    • Meta Tags:
      • Request essential meta tags like title, meta description, meta robots, viewport, content type, Open Graph tags, and language.
      • Include validation rules for these fields.

    • URL Slug:
      • Ensure content editors can modify the URL slug.
      • Keep URLs user-friendly and keyword-rich.

    • Canonical URLs:
      • Define canonical URLs in the page’s head or HTTP header.
      • Use absolute URLs and define only one canonical URL per page.
      • Consider URL types and evaluate with the technical team for additional canonicalization rules.
    • Define your XML sitemap setup:
      • Sitemaps should be dynamic and updated regularly.
      • Option to clear sitemap cache and regenerate on demand is beneficial.
      • Include only indexable, canonical URLs with a 200 HTTP response code in the sitemap.
      • Sitemaps typically reside in the website’s root directory but can be elsewhere if indicated in the robots.txt file.
      • Sitemaps can be divided based on content types (e.g., posts, pages, authors).
      • Google often ignores <priority> and <changefreq> tags in sitemaps.
      • Sitemaps can submit non-web page content like videos, images, or news articles.

      • Request a field for schema markup:
        • Schema markup helps search engines understand content better.
        • Can be added per URL or at a content component level.
        • Enables eligibility for various rich results (e.g., breadcrumbs, video results, FAQs).
    • Ensure logical heading hierarchy:
      • Headings in a headless CMS can be tricky due to content being decoupled from layout.
      • Heading hierarchy should reflect content organization.
      • Proper hierarchy is essential for web accessibility, aiding visually impaired users.

      • Summary:
        • Headless architecture empowers businesses to control their digital experience.
        • Initial technical setup for headless SEO can be challenging but offers greater control and integration.
        • Headless SEO provides vast possibilities for content-led experiences in ecommerce and content management for large international sites.
        • The true potential of headless SEO lies in the innovative ways digital teams will harness it in the future.

  • What is a headless CMS?
    • A CMS that separates content creation and management from its presentation.
    • Offers flexibility in delivering content to various devices and platforms.
    • Results in faster loading times and personalized experiences.


  • How headless CMS works in practice:
    • Removes the need for content changes to go through the development team.
    • Content marketers can publish and edit content using familiar CMS platforms.
    • Content is sent to the main website via API integration.


  • Benefits of headless CMS for organizations and SEO:
    • Scalability: Allows rapid content production across various platforms.
    • Improved Relationships: Separation fosters better collaboration between developers and SEOs.
  • SEO-friendly Features: Includes structured data support and customizable URLs.
  • Enhanced Security: Decouples presentation from backend, shielding sensitive systems.
  • Faster Load Speeds: Reduces complexity associated with traditional CMS architectures.

  • Google’s stance on headless CMS:
    • Google is neutral about the CMS choice.
    • John Mueller, Google’s Senior Search Analyst, stated Google doesn’t look for specific CMS signals in its algorithm.
    • Headless CMS neither positively nor negatively impacts SEO directly.

  • Best practices for SEO with headless CMS:
    • Ensure schema, metadata, and URL structure are consistent.
    • Validate content for mobile-friendliness.
    • Ensure content is included in fetch and render tests.
    • Update XML sitemap with new content.

  • Conclusion:
    • Adopting a headless CMS offers numerous benefits, including improved SEO performance.
    • Understanding its workings allows organizations to leverage its advantages for content production, collaboration, security, and speed.

  • Differences between headless SEO and traditional SEO:
    • Aim: Both aim to optimize content for search intent and high SERP ranking.
    • Omnichannel content delivery:
      • Traditional SEO has content delivery restrictions.
      • Headless SEO supports omnichannel content delivery for personalized experiences across devices.

    • Presentation layer freedom:
      • Traditional solutions limit innovative front-end experiences.
      • Headless SEO offers more creative freedom.

    • Platform dependence:
      • Traditional SEO locks users into a specific CMS platform.
      • Headless SEO offers freedom to choose tools and technologies.

    • Loading speed:
      • Traditional SEO can result in slow-loading websites.
      • Headless SEO optimizes loading speed and overall performance.

  • Pros and Cons of headless SEO:
    • Pros:
      • Enhances performance: Optimizes front-end code for speed and user experience.
      • Effective at Scale: Allows independent work of developers and content teams, beneficial for large websites.
    • Cons:
      • Technical skills required: Need for developers to build the front end and manage data fetching.
      • Requires knowledge of technical SEO: Users must handle aspects like website crawlability, speed, mobile responsiveness, redirects, and more, which were typically managed by traditional CMSs.

Pros and Cons of Traditional SEO

  • Pros:
    • No need for technical expertise due to traditional CMS.
    • Reduces decision-making time as tools and integrations are predefined.
  • Cons:
    • Slow website speeds due to simultaneous loading of frontend, backend, and content.
    • Limited omnichannel content delivery and digital experiences.

Comparison: Headless SEO vs. Traditional SEO

  • Headless SEO:
    • Supports omnichannel delivery.
    • Offers freedom in tool and technology choice.
    • Faster page load times.
  • Traditional SEO:
    • Limited content delivery channels.
    • Restricted to platform’s tools and integrations.
    • Slower page load times.

SEO Best Practices for Developers

  • Use semantic HTML for better content structure.
  • Ensure websites are mobile-friendly.
  • Optimize URLs for search with clear and relevant slugs.
  • Implement schema markup for enhanced search results.
  • Use server-side rendering for better page indexing. (look into “pre-rendering JS”
  • Include essential SEO meta tags.
  • Create XML sitemaps for improved website crawlability.
  • Optimize images for faster loading and better user experience.

Conclusion

  • Components like link building and pillar pages remain consistent regardless of CMS type.
  • Headless SEO demands technical skills and in-depth SEO knowledge.
  • Combining on-page, off-page, and technical SEO with the right tools ensures optimal search engine rankings.

Technical SEO And Headless CMS

Tech Audit case study from Bizwrites.com

Background

  • Client used a headless CMS without considering marketing or SEO.
  • Headless CMS without proper planning can lead to preventable problems.
  • Traditional CMS (e.g., WordPress) covers basic SEO with plugins like Yoast.
  • Headless CMS requires developers to add specific SEO functionalities.
  • Client’s team lacked technical SEO knowledge during site development.

Approach

  • Client’s main goal: reach their ideal customer.
  • Instead of focusing on content, a technical SEO audit was recommended.
  • Preliminary look revealed numerous technical issues.
  • Decision made to prioritize technical audit over content.

Technical Issues Identified and Dealt With

  1. Irrelevant pages being crawled:
    • Over 4000 irrelevant pages crawled by Google.
    • Adjusted robots.txt and eliminated source of dynamically generated duplicate pages.
    • Result: Significant reduction in crawled but not indexed pages.
  2. Duplicate content issues:
    • No meta tag control: Duplicated page titles and meta descriptions.
      • Solution: Added a plugin for easy editing of titles and descriptions.
    • Incorrectly implemented redirects: Soft 404s caused by meta refresh redirects.
      • Solution: Use 301 redirects instead of meta refresh redirects.
    • Missing redirects: No consistent URL format, leading to multiple versions of the same page.
      • Solution: Redirected non-trailing-slash, non-WWW URL formats to the correct format.
      • Result: Eliminated duplicate content issues and improved user experience.

Results

  • Collaborated with client’s dev team for implementation.
  • Changes made within two weeks in December.
  • Significant improvements observed:
    • Avg. weekly clicks increased by 92.5%.
    • Avg. weekly impressions increased by 2.3X.
    • Total keywords the domain ranked for increased by 4.2X.
  • SEMRush’s crawl comparison showed positive technical SEO metrics.
  • Emphasized that while growth is impressive, it will eventually level out.

Key Points about SEO in the JAMStack World with a Headless CMS

  • Living in the JAMStack world doesn’t alter primary SEO priorities.
  • Migrating from a Legacy CMS to a Headless CMS offers SEO benefits related to page performance, security, user experience, and multi-platform content delivery.

  • Unlike traditional CMSs, a headless CMS doesn’t offer plug-and-play simplicity for SEO factors.
  • The main difference between traditional and headless CMS is the ability to edit metadata instantly.
  • Traditional CMSs, like WordPress or Drupal, allow easy addition of page titles, descriptions, and other meta tags.

  • Headless CMSs, such as Hygraph, require different handling due to their cross-platform flexibility.
  • Standard best practices for on-page and off-page optimization remain unchanged: quality content, keyword optimization, interlinked content, domain authority, social sharing, and reputable backlinks.
  • SEO starts with the build and technical implementations, setting the foundation for content teams.

Technical SEO Best Practices for Headless CMS

  • Ease-of-crawlability and Page Structure:
    • Use Schema.org structured data markup to enrich HTML tags, making content more understandable to search engines.
    • Structured data aids in better indexing and understanding by search engine bots.
    • Example provided: JSON-LD structured data snippet by Google for contact information.
    • Structured data can help content appear as featured snippets in search results.
  • Meta Tags:
    • Essential for describing a page’s content to search engines.
    • Four key meta tags: Title tag, Meta Description, Meta Keywords, and Meta Robots. (not sure about meta keywords – think they went out with the arc)
    • In a headless CMS like Hygraph, meta tags should be added as String fields to content models, allowing content authors to add relevant metadata.
    • Proper use of a headless CMS with these practices can enhance SEO results.
  • React or Vue:
    • While they favor fast loading, there are SEO challenges.
    • Essential to add a component for metadata.
    • React Helmet can manage metadata in React apps.
    • React Router can enhance URL structure.
    • Use Isomorphic Javascript or Prerender for better SEO with client-side JavaScript.

  • Static Site Generators (SSGs):
    • Hugo, Jekyll, and Gatsby handle major SEO challenges.
    • They offer fast loading and manage metadata effectively.
    • Gatsby has plugins like React Helmet and Sitemap.

  • Content Delivery Network (CDN):
    • Distributes content across global servers.
    • Enhances website performance and user experience.

  • Image Optimization:
    • Optimize image size in headless CMS.
    • Use Lazy Loading for images and videos.
    • Prefer SVG or WebP formats for faster loading.
    • Add Open Graph metatags for images.

  • HTTPS:
    • Upgrade to HTTPS for security and user trust.
    • HTTPS websites load faster and rank better on Google.

Setting Foundations for Content Creators

  • URL Structure:
    • Use SEO-friendly URLs like example.com/page-title-with-keywords.
    • Include target keywords in URLs.
    • Ensure content has clear tags like H1, H2, etc.

  • Content Enrichment:
    • Optimize the use of images and media for better user experience and SEO.
    • Ensure assets are resized, compressed, and have relevant file names and alt-attributes.
    • Make pages easily shareable with OpenGraph and Twitter Card meta information.
    • Shareable pages improve backlinks, referral traffic, and brand awareness.

Final Takeaway

  • Proper technical foundation is crucial before content creation for better SEO.
  • Empower content teams with the right tools and practices.
  • Continuous engagement and regular audits are essential for maintaining and improving SEO.
  • While challenges exist with headless CMS, with the right practices, SEO improvement is achievable.

References:

Search Engine Land

https://opensenselabs.com/blog/articles/omnichannel-content-delivery-guide

Bizwrites.com

Hygraph.com

In case you’re wondering what “omnichannel” means:

What is an omnichannel content approach? An omnichannel content approach is simply all about bringing together all content channels in order to make them work parallelly to improve and customize the user’s experience. 

OfferCatalog Schema Example for eCommerce Product-Listing / Category Page [2024]

Below is an example for a webpage which lists different soccer balls for sale:

<script type="application/ld+json">

{
@context: "https://schema.org/",
@type: "OfferCatalog",
name: "Cakes for Sale",
description: "A catalog of cakes offered by our online store",
url: "https://www.example.com/all-cakes",
numberOfItems: 2,
itemListOrder: "https://schema.org/ItemListOrderAscending",
itemListElement: [{
@type: "Offer",
itemOffered: {
@type: "Product",
name: "Cake1",
image: "https://www.example.com/images/product1.jpg",
description: "Description of cake 1",
sku: "sku1",
manufacturer: "Manufacturer 1",
brand: "Brand 1",
offers: {
@type: "Offer",
price: "9.99",
priceCurrency: "USD",
availability: "https://schema.org/InStock"
}
}
}, {
@type: "Offer",
itemOffered: {
@type: "Product",
name: "Product 2",
image: "https://www.example.com/images/product2.jpg",
description: "Description of product 2",
sku: "sku2",
manufacturer: "Manufacturer 2",
brand: "Brand 2",
offers: {
@type: "Offer",
price: "199.99",
priceCurrency: "USD",
availability: "https://schema.org/InStock"
}
}
}]
}
</script>

once you’ve done your own code, you can check it using the schema testing sheet – https://docs.google.com/spreadsheets/d/1dDqJ9_qrTVoJOZRxieJDp9XE3Q1TCq4ixNVudVtgp2M/edit#gid=0

Here’s another example:

<script type="application/ld+json">

{

"@context": "http://schema.org",

"@type": "OfferCatalog",

"name": "Competition Tennis Balls",

"description": "High-quality competition tennis balls for professional and amateur players.",

"url": "https://www.example.com/tennis-la/balls/competition-tennis-balls.html",

"itemListElement": [

{

"@type": "Offer",

"itemOffered": {

"@type": "Product",

"name": "Pro Tournament Tennis Ball",

"image": "https://www.example.com/images/pro_tournament_tennis_ball.jpg",

"description": "Premium quality tennis ball designed for professional tournaments.",

"brand": "example Pro",

"sku": "NWPRO123",

"offers": {

"@type": "Offer",

"priceCurrency": "USD",

"price": "20.00",

"availability": "http://schema.org/InStock",

"url": "https://www.example.com/pro-tournament-tennis-ball.html"

}

}

},

{

"@type": "Offer",

"itemOffered": {

"@type": "Product",

"name": "Championship Level Tennis Ball",

"image": "https://www.example.com/images/championship_level_tennis_ball.jpg",

"description": "High-performance tennis ball suitable for championship level play.",

"brand": "Example La",

"sku": "NWCHAMP456",

"offers": {

"@type": "Offer",

"priceCurrency": "USD",

"price": "15.00",

"availability": "http://schema.org/InStock",

"url": "https://www.example.com/championship-level-tennis-ball.html"

}

}

}

]

}

</script>

Instead of Offercatalog, you may want to consider using CollectionPage schema…

Here’s an example of CollectionPage Schema for an eCommerce category page:

<script type="application/ld+json">


{

"@context": "http://schema.org",

"@type": "CollectionPage",

"name": "Shopify Apps",

"url": "https://sherpas.design/pages/shopify-apps",

"description": "We build apps that function and feel natively Shopify",

"image": "https://cdn.shopify.com/s/files/1/0085/8515/0560/files/logox2_500x500.png?v=1555661781",

"isPartOf": {

"@type": "WebSite",

"name": "Sherpas Design",

"url": "https://sherpas.design"

},

"relatedLink": [

"https://sherpas.design/about-us",

"https://sherpas.design/contact"

],

"mainEntity": {

"@type": "ItemList",

"itemListElement": [

{

"@type": "ListItem",

"position": 1,

"url": "http://example.com/coffee_cake.html",

"name": "Coffee Cake",

"image": "http://example.com/images/coffee_cake.jpg",

"description": "product is yummy",

"brand": "NanesBakes"

},

{

"@type": "ListItem",

"position": 2,

"url": "http://example.com/apple_pie.html",

"name": "Apple Pie",

"image": "http://example.com/images/apple_pie.jpg",

"description": "product is yummy",

"brand": "MumsCakesYo"

},

{

"@type": "ListItem",

"position": 3,

"url": "http://example.com/blueberry-pie.html",

"name": "Blueberry Pie",

"image": "http://example.com/images/blueberry_pie.jpg",

"description": "product is yummy",

"brand": "NanesBakes"

}

]

}

}

</script>

More info on CollectionPage Schema for eCommerce in my other blog post here.

Auditing Canonicals With Screaming Frog [2024]

  • TL;DR – Go to the “issues” reports – Bulk Report – Issues – All –> Export into a folder
  • View canonicals canonicalized reports

Some of the names of the reports can be confusing:

  • The “canonicals missing inlinks” report – is a list of the pages missing canonical URLs, and the inlinks to those pages
  • Canonicals Missing report – as you’d expect – shows you the pages without canonical URLs/tags
  • Canonicals Canonicalised – has pages with canonical to a different URL. So you might have example.com/help/contact – canonicalised to example.com/help – which may or may not be a problem.
  • Canonicals canonicalised inlinks – those pages with canonicals different to their own URL – the inlinks to those pages.

  • Check canonical with JS turned off (using Developer Chrome Extension) & check it remains the same
  • Check the View Source Code with and Without JS turned on – check canonical remains the same
  • Check canonical is not added using JS – this is not idea – more info here
  • Check for multiple canonical URLs using Screaming Frog and check visually in the view source code (JS rendering might be required to see all cononicals)

  • For paginated pages – check if you want each page indexed, that each different page has it’s own canonical URL
  • Check that faceted/filters on pages don’t change the canonical URL (generally you dont want them to)

Exclude these/filter out by, in this instance adding a filter in Excel – does not contain – +

Check this blog post too about auditing canonicals and Hreflang tags

Tools & Checklist for Schema [2023]

I’ve put together a spreadsheet for SEO schema checks here:

https://docs.google.com/spreadsheets/d/1dDqJ9_qrTVoJOZRxieJDp9XE3Q1TCq4ixNVudVtgp2M/edit#gid=0

Use https://classyschema.org/Visualisation and import a URL – test using different URLs, especially for eCommerce websites – like homepage, about page, product page, category pages

Test also with:

https://search.google.com/test/rich-results – add any issues in column C of sheet

and

https://validator.schema.org/ – add issues in Column D

  • Put your “to do” list in columns E and F

When using validator, make sure you’re schema has all the relevant parameters.

The separate sheets on the Google Sheet linked above, has some examples for you to check against – written in JSON-LD

Finally, if the site is already live, check the schema reports in Search Console for any errors

If you have Screaming Frog, I’d also recommend running a crawl –

Go to Configuration – Spider – Crawl – and tick all the “Structured Data” boxes:

crawl schema in screaming frog

Then when the crawl has run – go to Reports – Structured Data and download each report

Finally, you can also check individual pages using the Ryte Structured Data Helper for Chrome –

Check with CHrome plugin – Ryte structured data helper
https://chrome.google.com/webstore/detail/ryte-structured-data-help/ndodccbbcdpcmabmiocobdnfiaaimgnk/related

SEMRush Guide to the Basics (2023)

SEMRush Tech Audit

Use the tech audit to keep an eye on your website.

You’ll need a Google Search Console account to do an SEMRush tech audit.

Go to “site Audit” near the bottom of the side-menu on the left and then click the “+project” button.

Once the site audit has been carried out, you’ll get a load of metrics that score the health of your website from a tech SEO perspective.

Image from SEMRush.com

Check the notices, warnings and errors, to see what you need to fix.

More info in this blog post.

SEMRush Keyword Gap

Keyword gap – add competitors, see what they are ranking for, that you are not.

Filter by search volume of keywords and keyword difficulty.

Keyword gap is also good for Google ads insights. If a keyword is les than a few dollars per click, but is transactional and difficult to rank for organically, it might be worth bidding on in Google ads.

On Page SEO Checker

Gives you ideas including top pages to optimize.

suggests keywords to include on specific pages

suggests backlinks to go after

UX issues can be flagged here too – connect Google analytics. Check out pages with high bounce rate and short time on site.

Organic Traffic insights

Connect Google analytics and search console

Here, you can see pages with good and bad UX metrics and the keywords driving traffic

You can also see which keywords each page ranks for.

Use this report to decide if you should include other keywords on a given page – in the page’s existing content, or create new, more specific pages.

Domain overview

Key metrics, shown at the top include bounce rate and average time on site.

You can see your own, or competitors top pages in terms of traffic

See traffic sources- what percentage of traffic is direct, coming from social media etc. (you may have to click on “Traffic Analytics” in the side menu to see this)

You can compare domain metrics to another competitors, or your own website.

You can use Bulk Analysis to enter up to 100 domains to examine.

Keyword overview tool

Good for finding head/seed keywords.

Can view the Google SERPs and see what the competition for a given keyword is like

See what other keywords competitors are ranking for, on a given URL:

You can see questions related to/containing a given keyword

You can see international search volume for a keyword.

Related keywords are shown too – e.g. “football supplies” is related to “football equipment”

See SERP Features – what rich snippets etc are showing in the SERPs.

Keyword Magic tool

Start with a seed keyword

by default the report will give you broad match keywords.

Be sure to check out “related” match types (all keywords is also helpful)

For KW research – click on one of the main head KWs – and see what keywords your competitors rank for
(clicking on the keyword takes you to the Keyword Overview tool – scroll down to SERP Analysis)

You can add your selected keywords to a list.

More info in the semrush blog post about the tool.

Content marketing

topic research – add a keyword to get topic ideas

For example, if you enter “football equipment” (in the US) the tool gives lots of article ideas around “shoulder pads”, “football helmets”, “football cleats” etc.

Sometimes you get random ideas that you would never have thought of, like the searches around the dog called “pickle” when you look for topic ideas about Pickleball

  • There is loads more you can do with SEMRush, including position tracking. I’ll cover these additional tools in dedicated blog posts.

Keyword Manager

If you have a big list of keywords you want to get search volume for, I like to click “create a regular list” and then name the list and click “Add keywords” in the top right of the screen

Also handy if you’ve done keyword research for one country and need the search volumes for another country