The tools I am using is java jdk 1.7.0_79, Eclipse Mars and Lastly Jsoup.jar(Library to parse html documents).
Step 1: Create Project
Open up the IDE and select the file menu option and then new Java Project.
Click finish.
Step 2: Download Jsoup Library
url: http://jsoup.org/download (jsoup-1.8.3.jar)
Once we have downloaded the Jsoup library we need to add it as a dependency to the build path.
Right click on project folder and select the build path option then configure build path.
Make sure the view opened is open in the library tab and not the others.
Select the option to add a external jar, once the dialog opens find the jar you download and click ok.
Step 3: Writing Scraper Code
To write the code you need to create a new Java class. It can be called anything for now mine is called RunScraper. Create the
public static void main(String []args){
}
method in your class. This will be the start point.
Code Snippet:
The rest of the code I will paste which is pretty self explanatory. The main work being done is by the Jsoup library that once we have connected and downloaded the specific page Jsoup creates a document object that allows us to access the html dom structure as if we were in the browser using Javascript or CSS.
The tutorial is very basic usage of Jsoup and scraping the web, however it can be scaled up quite a bit by incorporating threads and a database and some more code.