PhantomJS scraping on word translation example

PhantomJS scraping on word translation example

phantom.js scraping (java)

In the previous article, we’ve shown an example of web scraping using Jsoup. You may be interested in how it is different from PhantomJS scraping. Jsoup parses HTML content that is available on a page load. In most cases, it’s enough, but in some cases, modern websites have content that is loaded dynamically via JavaScript. Hence Jsoup may receive just an empty page without loaded data (e.g. products). In such cases, headless browsers are a way to go and PhantomJS is one of them.

PhantomJS is a headless web-browser scriptable with JavaScript. It runs on Windows, macOS, Linux. It can be downloaded for your platform here.

Probably, the most common usage of headless browsers is automated testing, though web scraping is also quite a popular case. Since it’s not just an HTML parser like Jsoup, we’re free to include dynamic behavior like waiting for JavaScript to be fully executed on a page load, clicking buttons, setting input values and so on.

Let’s get familiar with PhantomJS scraping on a practical example.
The idea is to implement a scraper that would translate a list of words using Google Translate.

Note that this is only an example to explain how PhantomJS can be used, you should not use it for large lists of words or any other commercial purposes. There are convenient APIs available to do that. Web scraping always includes a risk, you may violate service rules or licenses, so be careful with it. Also, it’s not recommended to do scraping from real accounts logged in, because it would make you unhappy having them banned.

Anyway, let’s get closer to details

  • Download PhantomJS, extract it and remember the path to its binary
  • Add Maven dependency for PhantomJS driver
<dependency>
    <groupId>com.codeborne</groupId>
    <artifactId>phantomjsdriver</artifactId>
    <version>1.4.1</version>
</dependency>
  • Implement a scrapper
import org.openqa.selenium.WebElement;
import org.openqa.selenium.phantomjs.PhantomJSDriver;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class GoogleTranslateScraper {
    private static String TRANSLATE_URL = "https://translate.google.com/#view=home&op=translate&sl=%s&tl=%s";

    private static final String WORD_SOURCE_ID = "source";
    private static final String TRANSLATION_SELECTOR = ".tlid-translation.translation span";

    private static final int WORD_TRANSLATION_WAIT_TIMEOUT = 1000;
    private static final int SERVICE_LOADING_TIMEOUT = 1000;

    private static final GoogleTranslateScraper INSTANCE = new GoogleTranslateScraper();

    private PhantomJSDriver driver;

    private GoogleTranslateScraper() {
        System.setProperty("phantomjs.binary.path", "<path_to_binary>");
        String userAgent = "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1";
        System.setProperty("phantomjs.page.settings.userAgent", userAgent);

        this.driver = new PhantomJSDriver();
    }

    public static GoogleTranslateScraper getInstance() {
        return INSTANCE;
    }

    public List<String> translateWords(List<String> words, String from, String to) {
        List<String> translations = new ArrayList<>(words.size());

        driver.get(String.format(TRANSLATE_URL, from, to));
        waitInterval(SERVICE_LOADING_TIMEOUT);
        for (String word : words) {
            driver.findElementById(WORD_SOURCE_ID).sendKeys(word);
            waitInterval(WORD_TRANSLATION_WAIT_TIMEOUT);
            driver.findElementById(WORD_SOURCE_ID).clear();

            WebElement translationElement = driver.findElementByCssSelector(TRANSLATION_SELECTOR);

            if (translationElement != null) {
                translations.add(translationElement.getText());
            }
        }

        return translations;
    }

    private void waitInterval(long timeInMs) {
        try {
            Thread.sleep(timeInMs);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] args) {
        final GoogleTranslateScraper translator = getInstance();
        final List<String> wordsToTranslate = Arrays.asList("dog", "cat");
        final String from = "en";
        final String to = "de";

        List<String> translations = translator.translateWords(wordsToTranslate, from, to);
        System.out.println(translations);
    }
}

As you may notice from the code above, at first we initialize PhantomJS driver setting a system property with the path of its binary. Then we load the page specifying languages to translate from and to. After that the following actions for every word in a specified list are performed:
– Set source input with the word to be translated
– Wait a configured interval of time so that service has time to translate the word
– Scrap the translation

Here is the output after executing the code above:

[Hund, Katze]

We’ve got a list of English to German translations for a passed list of words.

So, we’ve learned that web scraping using PhatnomJS is not complex at all and can be easily applied to your needs. More details can be checked in the video below:

Leave a Reply

Your email address will not be published. Required fields are marked *