Advertisement

Huge List of Google CSEs = Simple Data Cleaning

Article main image
May 26, 2020

You’re likely familiar with frequent SourceCon contributor Dean Da Costa‘s list of Google Custom Search Engines, CSE Utopia, containing almost 1,000 CSEs. While researching to help answer a Google CSE-related question on a sourcing-related Facebook group, I happened upon a similarly large list of CSEs or if you have trouble accessing, view the cached version. I don’t think it’s recently updated, since it includes a few of the early CSEs created by past SourceCon Grandmaster, Irina Shamaeva but not all CSEs she’s shared publicly. But it’s still useful to illustrate some simple but versatile data cleaning methods you’ll find handy for other large lists that arise in sourcing.

Data cleaning mini-lesson using Microsoft Office

The main problems preventing this SimpleNote list from being useful to sourcers are: 1) the URLs for the CSEs are not hyperlinked, 2) each line begins with sequential numbers that appear to correspond to lines of code when the author generated the list, and 3) sometimes one line contains two different CSEs.  The good news is, all these issues are easily fixable with Microsoft Word and Excel (even if you have the older Office 2013).  And there’s no VBA programming or special add-ins required:

    1. Go to the SimpleNote page (or cached URL) above, then select all (Ctrl+A, or Cmd+A on iOS), and copy (Ctrl+C).
    2. Open a new blank Microsoft Word document and do paste special (Alt+Ctrl+V) to paste the plain text values, not the default paste (Ctrl+V). This is a good practice in general, because sometimes you’ll copy over hidden special characters in regular paste that can make your data cleanup harder. If you press Ctrl+* (that’s Ctrl+Shift+8 on most keyboards), you will toggle Show/hide hidden symbols mode, revealing paragraph symbols at the end of lines and a dot where spaces would normally appear. The comparison screenshots below show the difference in hidden symbols when you do regular paste vs. paste special.
    3. Note that each row begins with a line number, followed by a space, that we don’t need. The longest versions are 3-digit numbers, so we’ll replace those first. Using Word’s replace function, click in the Find What field, then click the Special button at the bottom of the Replace popup window, which reveals a menu (see screenshot below).
    4. What immediately precedes each replacement instance is the line return on the previous line, so you will want to select Paragraph, which inserts ^p (you can also do this manually by typing the carat symbol, created by pressing Shift with number 6 key, and then lowercase p). Then follow it by selecting menu value Any Number, which inserts ^# (carat and number sign), and do it 2 more times, because we’re looking to replace 3 consecutive digits, and finally select White Space, or ^w. After that is the actual URL for the CSE, so we’ll leave that alone. In the Replace with field, enter ^p (to preserve the line return from the previous line that we used as our starting marker point).
    5. Your Find and Replace popup should look like the right-hand side of the screenshot below. Then press the Replace All button and watch the magic (should look like left-hand side of screenshot)!
    6. Now that we know that works, let’s replace all the lines that begin with a number from 10 to 99. Same steps as above, except it’ll be ^p^#^#^w (just 2 Any Digit replacements in the middle) in the Find what field. Note that if something goes wrong, you can always press Word’s Undo button (or Ctrl+Z) to go back a step, or re-copy the original data into a new Word doc and repeat all steps up to here.
    7. Finally, you could do it for rows numbered 7, 8 and 9 with just one ^# in the middle, but for only 3 rows, it’s probably just as fast to fix manually. You may have noticed after doing each of the preceding two mass-replacements that the number in the confirmation message didn’t match the expected number of rows (e.g., for the 10-99 replacement, it says “All done. We made 70 replacements”). So what’s going on? Like any good sourcer would, do a visual scan of your data and look for exceptions to the pattern. Did you find them? (The screenshot for step 5 illustrated one of the remaining problem types, though for this list, you might have noticed the issues more readily by scanning the original web page.)
    8. But let’s say you didn’t notice anything, and you thought you were done with this part of the process. Next you would want to copy everything into Excel or a Google Sheet, because it’s easier to add hyperlinks there, as well as manage big lists to share with your team. So copy all the doc’s content (Ctrl+A, then Ctrl+C) into a blank Excel file (again, as a habit I’d recommend using Paste Special, which like all MS Office apps, Excel supports, but not necessary in this case because we’re already working with clean plain-text data). I bet you noticed the problems now!
    9. Fortunately, the pattern exceptions appear to fall under only 2 categories: (a) instances where 2 different CSEs were on the same line (whoever created the original list missed inserting line returns after each CSE name in some cases) and (b) some of the original attempts to grab the CSE must have yielded no data, because we have instances of “noopener noreferrer”.
    10. To fix the first type, let’s go back to the Word doc content. Upon scanning, we notice the remaining instances are: the name of the preceding CSE followed by a space, then the number we don’t like, then another space, and finally the CSE’s URL (that we want to reside on its own line).
    11. This can be addressed with a Find What of ^w^#^#^#^w (i.e., one white space, 3 digits corresponding to any row numbers between 100 and 999, one white space) and a Replace with ^p (to make sure what’s next starts on its own line). I get a confirmation of over 200 replacement fixes; looks good.
    12. You can repeat this with just ^#^# in the middle for rows 10-99 but the danger is that maybe some CSEs have a number as part of their name (e.g., “top 50” whatever) so you may inadvertently create a new problem with your fix. This is something that data wranglers (the fancy title for people who clean large datasets as a main part of their job) have to deal with. Fortunately, the time savings benefit of this replacement fix far outweighs the risk, so go ahead! (Spoiler alert: you will only create a couple of problem instances across the whole list with this 2-digit replacement, out of only 5 rows in total that you’ll need to manually “scrub” later.) However, it is a good reminder that it’s worth reviewing a healthy subset of your data before embarking on a given fix to anticipate how much of a different problem it might create.
    13. Now let’s go back to the Word doc content and look carefully at instances of the second exception type. If you repeatedly do a Find (Ctrl+F) for noopener noreferrer, you will see all instances have Done on the previous line, and there’s a paragraph after noreferrer.
    14. So this time, the Find what will be Done^pnoopener^wnoreferrer^p (i.e., you’re looking for Done, a Paragraph mark, the text noopener, a space, the text noreferrer, and a Paragraph mark). The Replace field should be blank, because you want to replace every Find what instance with nothing!
    15. When you click Replace All, you should get a confirmation of 27 replacement fixes. Again, if something seems wrong, hit Undo (Ctrl+Z).
    16. Now it’s time to copy all the doc’s content (Ctrl+A, then Ctrl+C) into a blank Excel file (again, as a habit I’d recommend using Paste Special, which like all MS Office apps, Excel supports, but it’s not necessary in this case because we’re already working with plain-text data).
    17. The next important step is to separate the URLs from the names of the CSEs. Fortunately every line starts with the URL, then there’s a space, and the remainder of the line is the CSE name. The formula for capturing up to the first space in a cell is =LEFT(A1,FIND(” “,A1,1)-1) so (assuming you pasted your data starting in cell A1; if not, adjust formula accordingly) you would put that formula in cell B1.
    18. And to grab the right-hand part after the space, put the formula =RIGHT(A1,LEN(A1)-FIND(” “,A1,1)) in cell C1.
    19. To hyperlink the text URL in cell B1, add the HYPERLINK command and surround what was in the formula before with parentheses. It should end up as: =HYPERLINK(LEFT(A1,FIND(” “,A1,1)-1))
    20. You can now select cells B1 and C1 together to highlight them, then hold/drag the bottom rightmost corner of cell C1 all the way down the column, and the formulas will auto-adjust accordingly in columns B and C. This is called filling a formula down into adjacent cells, explained here. You should now have a set of clickable hyperlinks in column B and CSE names in column C!
    21. After you fix the few non-standard rows that remain manually (easy to spot at this point), you may have noticed that there are duplicates within the list! That is also easily fixed in Excel following the steps here. You should end up with slightly over 500 unique rows.
    22. Finally, if you prefer sharing your content in Google Sheets, in the Excel file, just select all (though you really only need columns B-C), copy, then paste into a new Google sheet! If you don’t need the formulas, open a new Google sheet, and in the Edit menu, select Paste Special –> Paste Values only.

If you want additional methods and examples to do complex filtering using Microsoft Word (it’s an equivalent to Regular Expressions), read this post.

Bonus: How to find more CSEs

To find additional CSEs, you can google any of the following (replace LinkedIn with a topical keyword related to the type of CSE you seek). Don’t forget to click the “you can repeat the search with the omitted results included” hyperlink at the bottom of the initial results page:

  1. This is Dean Da Costa’s recommended method (focusing on links that are meant to be public CSEs):
    domain:cse.google.com/cse/publicurl LinkedIn
  2. This gets you many more results, but might also add additional false positives to filter through:
    (domain:cse.google.com/cse/publicurl OR “cse.google.com/cse?cx”) LinkedIn
  3. If you are using more narrow keywords than LinkedIn (e.g., financial example below) and you don’t get enough results, you can loosen it up a bit as follows:
    “sovereign wealth fund” (domain:cse.google.com/cse/publicurl OR “cse.google.com/cse?cx” OR “google cse” OR “google custom”)
  4. You want the mega-list? This vendor (https://trends.builtwith.com/websitelist/Google-Custom-Search) claims to have compiled a list of almost every web domain that uses Google CSEs (over 552,000 as of this writing!). Unfortunately, it only displays the first 50 sites for free.
Get articles like this
in your inbox
Subscribe to our mailing list and get interesting articles about talent acquisition emailed weekly!
Advertisement