Entering the Matrix: Robots.txt URL Sourcing

List diving and web scraping has definitely become an art form in recent years. I’m lucky enough to chat with Aaron Lintz once in a while and absorb some advanced techniques in unlocking the secrets of the Matrix. Usually, the stuff is too advanced for me the first time he explains it.

One day while looking for directories and attendee lists we got to talking about data extraction and some specific syntax. He pulled up this robots.txt trick I’d never seen before. When Lintz pulled up the robots file and started indexing entire websites, I thought the sentinels would seek and destroy to disallow access. However, this is publicly shared information, so anyone can view the .txt files for company websites (without fear of vengeful gigantic robots). The outcome has been pretty eye opening, particularly on the medical side of things.

 

Enter the Matrix of Robots.txt

So what the heck is robots.txt? It’s a syntax put into many organization’s websites in order to “block” or “disallow” automated bots, search engines, and web crawlers from accessing specific URLs. By accessing the robots.txt file, you are looking for clues for more information to bypass the “disallow.” Typically, one can access a public sitemap (which often times may be an index of the entire website). Sometimes you can stumble upon something bigger. I’ve had some success (and a ton of fun) while looking for indexing trends or trying promising URL variations.

Keep in mind this type of method requires much trial and error and is more appropriately used when you’ve exhausted most of the normal outlets. This is truly meant for the deep web dives.

 

Following the White Rabbit

Start with the main company website. This typically works best with larger organizations, but anything is worth a quick look:

Let’s index a hospital (based in NY).

Go to the main website: www.mskcc.org

Add /robots.txt to the tail end of the URL: https://www.mskcc.org/robots.txt

rob1

 

This pulls a pretty expansive list of programming jargon (with the nice explanation of the robots file). Just ignore this, for now, we are looking for a sitemap which would be a table of contents for the website (and usually at the bottom of the text file).

 

rob2

 

The highlighted link is the sitemap. Plug this into your browser:  https://www.mskcc.org/sitemap.xml

 

rob3

 

This sitemap leads to 31 one other sitemaps. The bio clinical URL looks promising as well as the doctor file. Trial and error lead us to this full list of clinical profiles:

https://www.mskcc.org/sitemap.xml?entity_type=node&type=bio_clinical

 

rob4

 

Click directly on the link and it sends us straight to a Nurse Practitioner.

rob5

Now you can use some of the cross-referencing techniques here to dig in further.

 

Learning Kung Fu and Trendsetting

Again, not all these searches are this clean and easy. Dead ends are common when messing around with this stuff, but you can identify certain trends when exploring sitemaps from similar organizations.

Sometimes you can try a certain URL segment and pull similar results.

For example, put this into your web browser: http://memorialhermann.org/robots.txt

rob6

Notice the following line: Disallow: /doctors.htm

Like in the Matrix, you can bend the rules a bit. Bypass the “disallow” and substitute /doctors.htm in the URL instead of /robots.txt:

Article Continues Below

http://www.memorialhermann.org/doctors.htm

rob7

We just found a zion of MDs. All those glorious links with the doctor’s names and listed departments just waiting to be sourced!

 

Speak URL, not Boolean, Neo

We can take it a step further from here. Say we need MDs based in emergency medicine. We search the matrix for “emergency medicine” and find a short list of potential candidates. Notice you’ll need to use a – between words in this search since we are speaking URL, not Boolean.

Ctrl-F (to pull up your finder) and search: emergency-medicine

rob8

We found our candidate, and it looks like she has experience with pediatrics as well.

This can be much easier than accessing a company’s contact finder. You can see the full list of results and change the keywords quickly instead of working within a profile finder that may be more limiting or that may have some restrictions on total access.

Now you know a little URL kung fu. As you can see, it can land a big score if you look at a website through another doorway. I hope this helps. Happy hunting!

Greg Hawkes

As a Senior Recruiter focusing in Sourcing in the Texas Medical Center, Greg Hawkes has worked as Technical Recruiter and Sourcing Analyst for healthcare, engineering, biotechnology, manufacturing and many other industries.  He is an ongoing contributor to SourceCon with with topics ranging from site searches, CSEs, to deep dives and URL Sourcing. While preparing to speak at SourceCon 2017, he built the HRSourcingToolbox with Free Recruitment Tools to share with the Sourcing Community. He is a huge fan of emerging technologies and Boolean syntax and always willing to share a new trick or technique to find the elusive purple squirrel.