Or Better Yet -
It's been over a year and Google still doesn't have the new URLs in the Index
Just over a year ago, I started working on this website that had over 900k top level domain files. We changed the structure of the URLs to a more organized hierarchy. The pages content changed slightly, but most importantly instead of all of the site's pages residing directly under the main domain (Let's use a computer broad to longtail term structure for example - like domain.com/computer.html and domain.com/laptop-computer.html and domain.com/500gb-laptop-computer.html) we changed them to a more representative hierarchy directory to file structure (example - domain.com/computer/ to domain.com/computer/laptop-computer/ then domain.com/computer/laptop-computer/500gb.html).
Why the URL Hierarchy?
The quick and simple explanation as to why we did this is that while URLs are fairly dynamic these days, the bots like to see and understand how a website is organized on a server. Remember your old school folder and file structure back when sites were in html were built? The URLs you have today should represent that organize file structure as much as possible. I cover this in my SEO Workshop (slide 23)- but I also found a pretty good article that explains the hierarchy relatively simply and quickly.
The process in setting the 301
I mention a bit about breaking the site up into sections for analytics purposes in my previous post "SEO Issues - is it Penguin? Is it Panda? or is it me?" under "Figuring out what was hit by Penguin". The "video" to the left is a quick (and very raw) animation to help explain exactly what we did. Now that the site was organized it not only helps the bots understand the structure, but helps us understand what sections bring in what SEO traffic in Google Analytics.
How Long Does it Take Google to See New URL via 301 Redirect?
This whole undertaking was completed over a course of 2-3 months starting in June 2012 (last year) and finished up with the last of the redesigns and URL changes in August with one last directory change (no redesigned pages) in January of this year (2013). The most important ones are still are showing 550,000 pages in Google's Index (11 months later):
As I Google to see if others have a solution for speeding up the indexing of these old URLs, or if even if anyone has had the same problem I found a lot of questions in various forums (both reliable and unreliable) but no real articles, blog posts, or anything from reputable SEO's. The most common answer in the forums is to just "wait". It's, of course, what I tell others when they ask me "Be patient, Google will eventually hit those pages again and recognize that they have changed then correct the index then." But after nearly a year and so many pages, this is getting ridiculous.
I spoke with my friend (and SEO mentor) Bruce Clay who came back with the suggestion to add an .xml sitemap and submit it to Google with the old URLs we want removed.
It was kinda making sense that because those old URLs are no longer linked to, and there are so many, that Google wasn't crawling them as much anymore. They are just sitting there in the index - and not getting "updated"
Unfortunately getting a sitemap added is not an easy feat. I would have to define the strategy, present it to the powers that be with data to backup the success metrics in order to get the project prioritized. With so many other initiatives needed for SEO, all of which were more important and affect the business in a positive way, it was in my best interest to keep pushing those and not deal with the sitemap.
My work around, though, was about as black hat as I would get (Matt Cutts if you are reading this, I apologize and throw myself at your mercy, but it had to be done). One weekend over a month ago, I grabbed one of my many impulsive purchased domains and quickly set up hosting and an old school html site that consisted of one page. I then exported all of the links on the Google "site:" search through a Firefox plugin called SEOquake that exports the results into a csv file. It's not the prettiest, and there was a lot of work still needed to get to just the URLs, but it was the best solution I could find (note: if any SEO reading this knows of an easier way to do this - please add to the comments for prosperity). I then parsed out the parameters in the URLs in a separate document and used those as the anchor text for each URL. Finally, using excel I then concatenated the URLs and parameters (that were now anchor text) into an html href string.
Then copying and pasting the "string" column into the html code, the page looked like:
The page wasn't the prettiest, and it had thousands of links (the above is just an example) so it was bad all around, but the point was to get those links crawled by Google.
Of course every SEO knows that you can't just build a website and expect it to immediately get crawled - right?
So I set it up in Google Webmaster Tools and submitted the page to the the index:
I even got more fancy to ensure Google would see the page and crawl all of those old URLs and +1'd it on Google.
Did it work?
I checked the URLs this evening to see how many Google is seeing and the number has dropped from 550,000 to now only 175.
I took the domain off of the server, and now have it parked elsewhere (back where it belongs) and removed the webmaster tools account. All traces of it ever existing are now gone, and the small moment of my attempt to get those URLs removed has passed.
Thank For the Advice Jenn - Now I'm Going to Try This!
If you have come across this post and you need to do something similar - I'm going to put the same disclaimer they do when a very dangerous stunt is performed in commercials.
Do not attempt this at home - this stunt was performed by a trained professional on a closed course.
So, don't go adding a bunch of links to a random domain thinking that your attempt just weeks ago to 301 pages isn't working. The links on the external domain were too many for the domain and page, and were extremely spammy. In addition, all those links pointing to pages that were redirecting and were supposed to pass value to the new URLs, now had many spammy links pointing to them from a very spammy domain. If left up too long, or not done correctly, it could actually cause more damage than ever helping.
If you have any questions, or feel you need to try this same strategy, please don't hesitate to contact me. I'm here to help, and want to ensure that your website has considered all possible options before attempting any such trickery.
Some Helpful Links on the Very Subject:
A Year in the Life of 14 Million 301 Redirects - by Adam Sherk