There are lots of motives you may need to have to locate every one of the URLs on a web site, but your specific goal will decide Whatever you’re hunting for. By way of example, you may want to:
Discover each individual indexed URL to analyze concerns like cannibalization or index bloat
Gather recent and historic URLs Google has witnessed, specifically for web page migrations
Uncover all 404 URLs to Get better from publish-migration errors
In Just about every state of affairs, a single tool received’t Offer you anything you would like. Unfortunately, Google Look for Console isn’t exhaustive, and a “website:case in point.com” research is limited and tricky to extract data from.
In this put up, I’ll stroll you through some instruments to make your URL list and prior to deduplicating the data using a spreadsheet or Jupyter Notebook, depending on your website’s measurement.
Old sitemaps and crawl exports
When you’re looking for URLs that disappeared in the Are living web site just lately, there’s an opportunity a person on your group can have saved a sitemap file or simply a crawl export before the improvements ended up built. If you haven’t already, look for these information; they could typically supply what you will need. But, should you’re reading this, you most likely didn't get so lucky.
Archive.org
Archive.org
Archive.org is a useful Device for Web optimization responsibilities, funded by donations. In the event you search for a domain and choose the “URLs” choice, it is possible to entry as many as 10,000 listed URLs.
Nevertheless, There are some limitations:
URL limit: You may only retrieve as many as web designer kuala lumpur ten,000 URLs, and that is insufficient for larger web-sites.
Quality: A lot of URLs may be malformed or reference resource information (e.g., pictures or scripts).
No export solution: There isn’t a designed-in technique to export the list.
To bypass The shortage of an export button, make use of a browser scraping plugin like Dataminer.io. Nonetheless, these limitations necessarily mean Archive.org may well not offer an entire solution for more substantial internet sites. Also, Archive.org doesn’t suggest no matter whether Google indexed a URL—however, if Archive.org uncovered it, there’s a good likelihood Google did, also.
Moz Professional
Even though you could typically use a backlink index to discover external web-sites linking to you, these equipment also find URLs on your site in the procedure.
How to utilize it:
Export your inbound links in Moz Professional to acquire a brief and easy listing of goal URLs from your web page. In case you’re coping with a huge Web-site, think about using the Moz API to export information over and above what’s manageable in Excel or Google Sheets.
It’s important to note that Moz Professional doesn’t ensure if URLs are indexed or discovered by Google. Nonetheless, since most websites utilize exactly the same robots.txt principles to Moz’s bots as they do to Google’s, this process typically functions very well as being a proxy for Googlebot’s discoverability.
Google Lookup Console
Google Lookup Console presents quite a few useful sources for developing your list of URLs.
Hyperlinks reports:
Similar to Moz Pro, the Backlinks area offers exportable lists of concentrate on URLs. Unfortunately, these exports are capped at 1,000 URLs Every. You'll be able to utilize filters for certain web pages, but because filters don’t use on the export, you may perhaps have to trust in browser scraping applications—restricted to 500 filtered URLs at any given time. Not ideal.
Overall performance → Search engine results:
This export will give you a summary of internet pages getting research impressions. While the export is limited, You may use Google Search Console API for more substantial datasets. You can also find free Google Sheets plugins that simplify pulling more in depth info.
Indexing → Pages report:
This section presents exports filtered by challenge type, though these are definitely also constrained in scope.
Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is an excellent supply for amassing URLs, having a generous limit of 100,000 URLs.
Even better, it is possible to implement filters to generate diverse URL lists, properly surpassing the 100k limit. As an example, if you need to export only site URLs, abide by these ways:
Stage 1: Increase a section to the report
Action 2: Click on “Create a new phase.”
Stage three: Define the section having a narrower URL sample, like URLs that contains /blog/
Observe: URLs located in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide precious insights.
Server log files
Server or CDN log information are Most likely the last word Software at your disposal. These logs capture an exhaustive checklist of every URL path queried by end users, Googlebot, or other bots in the recorded period.
Concerns:
Information dimension: Log files is usually enormous, countless websites only keep the last two weeks of knowledge.
Complexity: Analyzing log data files could be complicated, but a variety of applications can be found to simplify the method.
Incorporate, and great luck
Once you’ve gathered URLs from every one of these sources, it’s time to combine them. If your internet site is sufficiently small, use Excel or, for much larger datasets, instruments like Google Sheets or Jupyter Notebook. Be certain all URLs are consistently formatted, then deduplicate the record.
And voilà—you now have an extensive list of present-day, outdated, and archived URLs. Excellent luck!