HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEB SITE

How to Find All Existing and Archived URLs on a web site

How to Find All Existing and Archived URLs on a web site

Blog Article

There are lots of motives you could need to have to locate many of the URLs on a web site, but your exact objective will figure out That which you’re trying to find. By way of example, you may want to:

Discover each individual indexed URL to investigate issues like cannibalization or index bloat
Collect present-day and historic URLs Google has found, especially for internet site migrations
Locate all 404 URLs to Recuperate from publish-migration problems
In Every circumstance, only one Instrument won’t give you anything you need. Unfortunately, Google Look for Console isn’t exhaustive, and also a “web site:example.com” lookup is proscribed and hard to extract details from.

With this publish, I’ll walk you thru some tools to construct your URL checklist and right before deduplicating the data employing a spreadsheet or Jupyter Notebook, according to your web site’s dimension.

Old sitemaps and crawl exports
Should you’re seeking URLs that disappeared from your live web-site just lately, there’s an opportunity an individual in your staff might have saved a sitemap file or simply a crawl export before the adjustments had been produced. Should you haven’t presently, look for these documents; they could typically give what you will need. But, if you’re looking at this, you most likely did not get so Blessed.

Archive.org
Archive.org
Archive.org is a useful tool for Search engine marketing tasks, funded by donations. For those who look for a site and select the “URLs” possibility, you could entry approximately ten,000 detailed URLs.

Nonetheless, there are a few constraints:

URL Restrict: You are able to only retrieve around web designer kuala lumpur 10,000 URLs, that's insufficient for more substantial web-sites.
High-quality: Lots of URLs may be malformed or reference source files (e.g., images or scripts).
No export option: There isn’t a designed-in approach to export the checklist.
To bypass The dearth of the export button, use a browser scraping plugin like Dataminer.io. On the other hand, these limits suggest Archive.org might not supply a complete Resolution for much larger web pages. Also, Archive.org doesn’t point out whether or not Google indexed a URL—but when Archive.org discovered it, there’s a very good possibility Google did, as well.

Moz Pro
Though you could generally make use of a url index to uncover exterior web-sites linking for you, these resources also learn URLs on your site in the procedure.


The way to utilize it:
Export your inbound inbound links in Moz Pro to acquire a fast and straightforward list of concentrate on URLs out of your web-site. In the event you’re handling a huge Web-site, think about using the Moz API to export facts further than what’s manageable in Excel or Google Sheets.

It’s imperative that you Take note that Moz Pro doesn’t confirm if URLs are indexed or found by Google. However, because most web pages apply a similar robots.txt rules to Moz’s bots because they do to Google’s, this method normally is effective nicely like a proxy for Googlebot’s discoverability.

Google Search Console
Google Look for Console features various precious sources for building your listing of URLs.

Hyperlinks experiences:


Much like Moz Professional, the Backlinks part provides exportable lists of target URLs. Regrettably, these exports are capped at 1,000 URLs Just about every. It is possible to implement filters for precise internet pages, but since filters don’t utilize into the export, you might have to depend upon browser scraping equipment—restricted to 500 filtered URLs at any given time. Not best.

Overall performance → Search engine results:


This export provides a listing of internet pages acquiring look for impressions. Whilst the export is restricted, you can use Google Search Console API for greater datasets. You will also find no cost Google Sheets plugins that simplify pulling a lot more substantial data.

Indexing → Webpages report:


This part gives exports filtered by challenge type, even though they are also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a wonderful resource for amassing URLs, with a generous Restrict of one hundred,000 URLs.


Better still, you'll be able to apply filters to develop unique URL lists, proficiently surpassing the 100k limit. Such as, if you wish to export only web site URLs, adhere to these methods:

Phase 1: Include a segment towards the report

Phase 2: Click on “Produce a new phase.”


Phase three: Outline the segment with a narrower URL pattern, like URLs made up of /site/


Note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they offer worthwhile insights.

Server log data files
Server or CDN log files are Possibly the last word Device at your disposal. These logs seize an exhaustive listing of every URL route queried by people, Googlebot, or other bots in the recorded interval.

Things to consider:

Knowledge dimension: Log data files could be huge, so many internet sites only keep the final two months of knowledge.
Complexity: Examining log files could be challenging, but different equipment can be found to simplify the method.
Mix, and very good luck
As you’ve gathered URLs from every one of these sources, it’s time to combine them. If your site is sufficiently small, use Excel or, for more substantial datasets, resources like Google Sheets or Jupyter Notebook. Ensure all URLs are continuously formatted, then deduplicate the list.

And voilà—you now have a comprehensive list of latest, outdated, and archived URLs. Excellent luck!

Report this page