Extracting URLs from Page Code in Excel

Removing Code from the End of the URLs

If the links don’t have .html in the protocol (at the end of them), then first concatenate the URLs to add * at the end of them all.

If they do have .HTML, then find and replace .HTML with .html*

Next – use Text to Columns, use * as the “delimiter”

Removing code from the Start of the URLs

If the links are relative:

Use text to columns again and use / as the delimiter

If the links contain the full domain name, then concatenate * at the start of the domain name and use text to columns again to remove the code fro the start of the URLs

Alternative Method

Here is another method for doing it – https://businessdaduk.com/data/data-analysis-with-excel/getting-domains-from-a-list-of-urls/

Ensure all the domains have http:// prefix

Use the formula:

=IF(ISERROR(FIND("//www.",A2)), MID(A2,FIND(":",A2,4)+3,FIND("/",A2,9)-FIND(":",A2,4)-3), MID(A2,FIND(":",A2,4)+7,FIND("/",A2,9)-FIND(":",A2,4)-7))
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s