Automated scrapper to gather email addresses

This is a two fold post about how we can leverage programming, scripting and other tools in our tool belt to automate tasks and make IT work look easy (although it isn’t always easy) and about IT security.

  1. Demonstrate an automated method to automatically download and parse email addresses from a web page
  2. Security…you need to secure your email if you post them on a web page

I had a client needing to gather email addresses from a public website (that will remain unnamed).  The first thing to do was visit a few pages.  I visited the home page then poked around at their sites.  Each entity / division had a different URL but it was essentially the same.

Example:  http://[entity].[parent-company-here].com/directory

Microsoft uses the example in their documentation of “contoso.com” so in this example it would be “http://wi.contoso.com” for the entity of the parent company (contoso) that is in Wisconsin.  They had about 16 total URL’s and entities each having an employee / staff directory with the same “/directory” in the URL that had the staff listing.

We can now program against this.  I used a tool called “wget” from GNU.  wget lets you programmatically “get” or download files among other things.  I “got” a directory (which is an html file with code in it) as an example and the contents showed me the line I needed to focus on, the line with the “mailto:someone@[parent-company-here].com:

<td><span style=”font-size: 13px;”><a href=”mailto:someone@[parent-company-here].com”>someone@[parent-company-here].com</a></span></td>

What we did then was this:

  1. Create a file with just the URL’s of each site, remember they were all similar:  http://xx.parent-company-here.com/directory so all we needed to have was a list of the 16 URL’s one after another per line in our file that we’re going to read from.  Our script iterates through each line and grabs the data we need.
  2. Parse the file and chop out the email address then dump it to a new file with just email addresses in it.  We accomplished this with the “cat” command.  The command looks like this:  cat directory.txt | grep mailto | awk -F ‘mailto:’ ‘{print $2}’ | cut -d ‘”‘ -f 1 | sort -u >> email.dump.txt.  After each URL dumps just the “mailto:” lines with all that HTML into a “directory.txt” we’re reading this directory.txt file with cat then we use the “grep” command and “awk” to only print the data after the “mailto:” delimiter so our original line gets chopped into this:  “someone@[parent-company-here].com”>someone@[parent-company-here].com</a></span></td>”.  We then use the “cut” command to parse the first quote, that’s what the “-f 1” is after cut, the “-d” is the delimiter which is hard to see but it’s the first quote mark (“) and it basically says to delete everything after that quote mark.  The “output” is by a double greater than symbol “>>” which means to keep writing the data (append to the file).

The result then is a clean file with only email address in it (over 600 from 16 different websites)!  We can now run this script anytime we want, upload it somewhere, we can even script against it to send email or do any number of things with it.

Finally, to answer part 2 of this post…you now know not to use unprotected “mailto:” links on your website!  They can be gathered by hackers (like me) and exploited or abused (not like me).  Once that horse has left the barn…well, good luck getting it back into the barn.  There are ways to protect mailto links on your website that we typically implement using javascript which obscures the email address from bots, scrapers and scammers or simple scripts like mine.  We’re certainly glad they weren’t protected!

If you need any automation, programming help, a second look at systems, process, process improvement, bringing LEAN principles into your IT environment contact us!  We can help!

800-864-9497

Comments or questions are welcome.

* indicates required field

Leave a Reply

Your email address will not be published. Required fields are marked *