In the first entry of this series, we went over some basic methods in building a Custom Search Engine (CSE) to search a specific site (like Manta.com). There are a few ways to build a CSE this way, but you may need to experiment with some Boolean syntax and manipulate a few URLs. Here’s a quick review on how we built the CSE for Manta:
Getting Started:
First, you’ll need to log into a Gmail account and go here to get started:
Click on “New Search Engine” up at the top of the page.
Next, you’ll need to enter the site(s) you want to search. For instance, let’s build a CSE to search Manta.com. You have some options listed below, but I want to search the entire domain and the entire site of Manta…so add the following code shown in the screenshot. You’ll need to name your CSE as well (toward the bottom of the screen).
Click the “CREATE” Button and you’ll move to the next screen.
NOTE: Some sites require less fancy syntax. Just remember to start with *. and end with /* to look for everything on one specific site.
Building a Stronger CSE
Now let’s go over the alternative methods for doing this. When you create a new CSE, pay attention to this syntax (per the Google CSE notation):
You can add any of the following:
Individual pages: www.example.com/page.html
Entire site: www.mysite.com/*
Parts of site: www.example.com/docs/* or www.example.com/docs/
Entire domain: *.example.com
I’ll show why this is important in a minute. What I’ve found most effective in building some CSEs is finding a specific profile or publication listing of someone on the site. The best way to find this is with a string like the following:
(profile OR /pub OR user OR “cited by”)
I used pubpdf so you can see some of the URLs with /pub. This can be an indicator of public profiles or publications, so you may need to do some experimentation to locate the right wording based on industry.
“medicinal chemistry” (profile OR /pub OR user OR “cited by”) pubpdf
So this is an actually great example because it has two variations of URLs that may be useful. ook at the URL syntax, especially what comes after the main website: www.pubpdf.com
When you click on one of the hyperlinked names, it pulls up an “author” URL.
From this, we can determine that:
Publications are denoted by the following URL:
http://www.pubpdf.com/pub
Authors are identified by the following URL:
http://www.pubpdf.com/search/author
IMPORTANT NOTE: Sometimes google “hides” the https:// or http:// before a website, this will be important information when you build the advanced CSE.
So take this information back to your CSE:
And test to make sure your searches work:
This is where you may have to do some additional testing and move syntax around. I’ve found if the http or https method doesn’t work, *.website.com/* or using site:www.website.com/* may do the trick. Test the different syntax combinations (especially the * and / suggestions listed above) until you get a results list. A single character change can be the key to unlocking the CSE.
So to sum up, try these methods of under “Sites to Search” to target publications (in this case):
http://www.pubpdf.com/pub
*.www.pubpdf.com/pub/*
site:www.pubpdf.com/pub/*
You can build CSEs in a similar way to target authors, users, members or profiles based on the site. You just have to understand the syntax of the URL and look for patterns.
It helps me to think about parts of a URL like operators in a Boolean string, so we are using https://, http://, *, /, and site: as a way to focus our searches in a CSE.
So there you have it, hopefully, with these tips, you’ll be able to build one heck of a Custom Search Engine. Go out and create, and let us all know what you come up with!