The Art of Building a Google Custom Search Engine – Part 3: Site Searches With a Side of URL Manipulation

In the first entry of this series, we went over some basic methods in building a Custom Search Engine (CSE) to search a specific site (like Manta.com). There are a few ways to build a CSE this way, but you may need to experiment with some Boolean syntax and manipulate a few URLs. Here’s a quick review on how we built the CSE for Manta:

Getting Started:

First, you’ll need to log into a Gmail account and go here to get started:

https://cse.google.com

Click on “New Search Engine” up at the top of the page.

Next, you’ll need to enter the site(s) you want to search. For instance, let’s build a CSE to search Manta.com. You have some options listed below, but I want to search the entire domain and the entire site of Manta…so add the following code shown in the screenshot. You’ll need to name your CSE as well (toward the bottom of the screen).

cse1

Click the “CREATE” Button and you’ll move to the next screen.

NOTE: Some sites require less fancy syntax. Just remember to start with *. and end with /* to look for everything on one specific site.

Building a Stronger CSE

Now let’s go over the alternative methods for doing this. When you create a new CSE, pay attention to this syntax (per the Google CSE notation):

You can add any of the following:

Individual pages: www.example.com/page.html

Entire site: www.mysite.com/*

Parts of site: www.example.com/docs/* or www.example.com/docs/

Entire domain: *.example.com

I’ll show why this is important in a minute. What I’ve found most effective in building some CSEs is finding a specific profile or publication listing of someone on the site. The best way to find this is with a string like the following:

(profile OR /pub OR user OR “cited by”)

I used pubpdf so you can see some of the URLs with /pub. This can be an indicator of public profiles or publications, so you may need to do some experimentation to locate the right wording based on industry.

“medicinal chemistry” (profile OR /pub OR user OR “cited by”) pubpdf

cse2

 

So this is an actually great example because it has two variations of URLs that may be useful.  ook at the URL syntax, especially what comes after the main website: www.pubpdf.com

 

cse3

 

When you click on one of the hyperlinked names, it pulls up an “author” URL.

 

cse4

From this, we can determine that:

Publications are denoted by the following URL:

Article Continues Below

http://www.pubpdf.com/pub

Authors are identified by the following URL:

http://www.pubpdf.com/search/author

IMPORTANT NOTE: Sometimes google “hides” the https:// or http:// before a website, this will be important information when you build the advanced CSE.

So take this information back to your CSE:

cse5

 

And test to make sure your searches work:

 

cse6

This is where you may have to do some additional testing and move syntax around. I’ve found if the http or https method doesn’t work, *.website.com/* or using site:www.website.com/* may do the trick. Test the different syntax combinations (especially  the * and / suggestions listed above) until you get a results list. A single character change can be the key to unlocking the CSE.

So to sum up, try these methods of under “Sites to Search” to target publications (in this case):

http://www.pubpdf.com/pub

*.www.pubpdf.com/pub/*

site:www.pubpdf.com/pub/*

You can build CSEs in a similar way to target authors, users, members or profiles based on the site. You just have to understand the syntax of the URL and look for patterns.

It helps me to think about parts of a URL like operators in a Boolean string, so we are using https://, http://, *, /, and site: as a way to focus our searches in a CSE.

So there you have it, hopefully, with these tips, you’ll be able to build one heck of a Custom Search Engine. Go out and create, and let us all know what you come up with!

 

Greg Hawkes is a Strategic Talent Sourcer, Speaker, Author, and Founder of the HRSourcingToolbox. He has worked as both a Technical Recruiter and Sourcing Analyst for healthcare, engineering, biotechnology, manufacturing and many other industries. He has been in the recruitment field for over 10 years, and got into heavy sourcing and headhunting back in 2012. He is an ongoing contributor to SourceCon – with topics ranging from Site Searches and CSEs, to Deep Dives andURL Sourcing. While preparing to speak at SourceCon 2017, he built the HRSourcingToolbox with a large inventory of Free Recruitment and Sourcing Tools. He has recently joined Houghton Mifflin Harcourt as a Strategic Sourcer and loving every minute of it! He is a huge fan of emerging technologies and Boolean Syntax and always willing to share a technique or hack to find the elusive purple squirrel.

Topics