Web scraping using Jsoup (Java)

Web scraping using Jsoup (Java)

web scraping using jsoup

Web scraping is data extraction from websites and Jsoup is quite a popular tool to do it in a convenient way. It is is an open-source Java library designed to parse, extract, and manipulate data stored in HTML documents.

There are lots of use-cases. For example, you may be looking for a new apartment to rent on a website or monitoring discounts on an e-commerce store. If the website does not have a feature to subscribe to newly added records, it’s not convenient to check it regularly for changes. Actually, there is a solution – implement a scraper to extract needed information and configure regular execution (using cronjobs or some other schedulers).

For a developer, code is always much better than multiple words. So, let’s define a problem and solve it using Jsoup.

Problem: extract information about daily deals on eBay

  • Add Maven dependency for Jsoup
 <dependency>
     <groupId>org.jsoup</groupId>
     <artifactId>jsoup</artifactId>
     <version>1.7.2</version>
 </dependency>
  • Investigate the HTML structure of a website

Basically, web scraping consists of two main parts: parsing of an HTML document and querying its structure. It means, that we need to investigate the structure of a website and find required class names/tags/attributes/etc. After that, we can prepare query selectors and start writing the scrapper.

  • Implement a scrapper
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class JsoupScrapper {
    private static final String EBAY_GLOBAL_DEALS_URL = "https://www.ebay.com/globaldeals";
    
    private static final String PRODUCT_CARD_CLASS = "dne-itemtile-detail";
    private static final String PRODUCT_TITLE_CLASS = "dne-itemtile-title";
    private static final String PRODUCT_LINK_SELECTOR = ".dne-itemtile-title a";
    private static final String PRODUCT_PRICE_SELECTOR = ".dne-itemtile-price .first";

    class Product {
        private String name;
        private String link;
        private String formattedPrice;

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public String getLink() {
            return link;
        }

        public void setLink(String link) {
            this.link = link;
        }
        
        public String getFormattedPrice() {
            return formattedPrice;
        }

        public void setFormattedPrice(String formattedPrice) {
            this.formattedPrice = formattedPrice;
        }
    }
    
    public List<Product> extractProducts() {
        List<Product> products = new ArrayList<>();
        
        Document doc;
        try {
            doc = Jsoup.connect(EBAY_GLOBAL_DEALS_URL).get();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        
        Elements productElements = doc.getElementsByClass(PRODUCT_CARD_CLASS);
        for (Element productElement : productElements) {
            Product product = new Product();
            Elements titleElements = productElement.getElementsByClass(PRODUCT_TITLE_CLASS);
            if (!titleElements.isEmpty()) {
                product.setName(titleElements.get(0).attr("title"));
            }
            Elements linkElements = productElement.select(PRODUCT_LINK_SELECTOR);
            if (!linkElements.isEmpty()) {
                product.setLink(linkElements.get(0).attr("href"));
            }
            Elements priceElements = productElement.select(PRODUCT_PRICE_SELECTOR);
            if (!priceElements.isEmpty()) {
                product.setFormattedPrice(priceElements.get(0).text());
            }
            products.add(product);
        }
        
        return products;
    }
    
    public static void main(String[] args) {
        JsoupScrapper jsoupScrapper = new JsoupScrapper();
        List<Product> products = jsoupScrapper.extractProducts();
        for (Product product : products) {
            System.out.println(
                    String.format("Product:\n%s\n%s\n%s\n\n", product.getName(), product.getFormattedPrice(), product.getLink())
            );
        }
    }
}
  • Run the scrapper and check results
Product:
LEMFO W8 Smart Watch Men Women Heart Rate Blood Oxygen Pressure Fitness Bracelet
US $14.99
https://www.ebay.com/itm/LEMFO-W8-Smart-Watch-Men-Women-Heart-Rate-Blood-Oxygen-Pressure-Fitness-Bracelet/264379674226?_trkparms=5373%3A0%7C5374%3AFeatured


Product:
Bluetooth Headphones Bluedio T7 ANC Wireless Headset music with face recognition
US $35.51
https://www.ebay.com/itm/Bluetooth-Headphones-Bluedio-T7-ANC-Wireless-Headset-music-with-face-recognition/223649956355?_trkparms=5373%3A0%7C5374%3AFeatured


Product:
NSEE FJ800AC-3 800KG/1800LB Automatic Sliding Gate Door Operator Rack & Pinion
US $546.23
https://www.ebay.com/itm/NSEE-FJ800AC-3-800KG-1800LB-Automatic-Sliding-Gate-Door-Operator-Rack-Pinion/272622872854?_trkparms=5373%3A0%7C5374%3AFeatured

...

Note that Jsoup works only with HTML available on a page load. It does not wait for any JS events, i.e. if the content is loaded dynamically, it won’t be in a result document received by Jsoup.

If you need to scrap data after all JavaScript is executed, have a look at headless browsers like (PhantomJS) which we’ll cover in the next post.

So, we’ve learned that web scraping using Jsoup is not complex at all and can be easily applied to your needs. More details can be checked in the video below:

2 thoughts on “Web scraping using Jsoup (Java)

Leave a Reply

Your email address will not be published. Required fields are marked *