Friday 16 October 2015

Building A Web Scraper With Java

This guide is for those interested in learning how to build a basic web scraper that can download and parse html pages. The viewer is expected to have an understanding of tools like java and an IDE. The following guides were performed only on my own website please note it is a offence to perform on anyone else website.

The tools I am using is java jdk 1.7.0_79, Eclipse Mars and Lastly Jsoup.jar(Library to parse html documents).

Step 1: Create Project

Open up the IDE and select the file menu option and then new Java Project.



Click finish.

Step 2: Download Jsoup Library
url: http://jsoup.org/download (jsoup-1.8.3.jar)
Once we have downloaded the Jsoup library we need to add it as a dependency to the build path.

Right click on project folder and select the build path option then configure build path.

Make sure the view opened is open in the library tab and not the others.



Select the option to add a external jar, once the dialog opens find the jar you download and click ok.

Step 3: Writing Scraper Code

To write the code you need to create a new Java class. It can be called anything for now mine is called RunScraper. Create the 

public static void main(String []args){

}

method in your class. This will be the start point.

Code Snippet:



The rest of the code I will paste which is pretty self explanatory. The main work being done is by the Jsoup library that once we have connected and downloaded the specific page Jsoup creates a document object that allows us to access the html dom structure as if we were in the browser using Javascript or CSS.

The tutorial is very basic usage of Jsoup and scraping the web, however it can be scaled up quite a bit by incorporating threads and a database and some more code.