Select & Query Chain

The select() method is the default interpreter that takes the Select Strings and, under the hood, builds a query chain of Selectors. Just to be clear, the previous scraper example can also be built as follows:

import Impressionist from 'impressionist';

const result = Impressionist.execute('http...', async (browser, page) => {
    return await page.evaluate(async () => {
        return await collector({
          name: select('h1'),
          content: select('.content{outerHTML}'),
          media_gallery: select('.carousel img{src}*')
        }).call();
    });
});

console.log(result);

So far, Select Strings works quite well for extracting any property of a DOM element. Now we want to go a little deeper by understanding the process.

Each Query is composed of Selectors concatenated in a chain form. Each Selector has a specific responsibility. Next, we will see the basic selectors, and then we will see how to build the same scraper of the last section using only Query Chains.

Selectors

While select() is the most used form, there are additional selectors types that can be used as well.

Extractors

css

The 'css' selector, as the name implies, uses a CSS selector to extract all matching elements in the DOM. Additionally, you can set an alternative CSS selector using the .alt() method.

return await collector({
  name: css('h1')
}).call(); // [h1]
return await collector({
  name: css('h1').alt('h2.name')
}).call(); // [h1]

xpath

The 'xpath' selector uses an Xpath expression to extract all matching elements in the DOM. Additionally, you can set an alternative Xpath expression using the .alt() method.

return await collector({
  name: xpath('//h1')
}).call();  //[h1]
return await collector({
  name: xpath('//h1').alt('//h2')
}).call();  //[h1]

property

The selector 'property' extracts a specific property from a list of DOM elements. Additionally, you can set an alternative property using the .alt() method.

return await collector({
  name: css('h1').property('innerText')
}).call(); // 'Plato Plugin'
return await collector({
  name: css('h1').property('outerHTML')
}).call(); // '<h1>Plato Plugin</h1>'

Validators

single

Ensures that only one element is matched and returned. By default, the 'single( )' selector is applied to every Query Chain.

return await collector({
  name: css('h1').property('innerText').single()
}).call(); // 'Plato Plugin'

Now let's suppose there is more than one h1 element in the DOM. In this case single() allows us to control to reduce any strange behavior or unwanted values by throwing an error informing us that there is more than one element that matches the specified selector. This gives the developer the opportunity to enhance the selector to match a single element.

return await collector({
  name: css('h1').property('innerText').single()
}).call(); // Error: There is more than one element that match the selector.

all

Returns all the values. This selector prevents 'single( )' from being applied by default.

return await collector({
  name: css('h1').property('innerText').all()
}).call(); // ['Plato Plugin']

require

Throws an error if there are no values. This is caused because the selector does not match any of the DOM elements or the property doesn't exist. By default, the 'require( )' selector is applied to every Query Chain.

In the following example, given that the selector and the property are valid we obtain the expected result:

return await collector({
  name: css('h1').property('innerText').single().require()
}).call(); // 'Plato plugin'

Here it is different, the selector does not exist, and although the property would be valid, we obtain an error generated by 'require( )'.

return await collector({
  name: css('h0').property('innerText').single().require()
}).call(); // Error: No elements, please check the query chain.

default

Returns the specified value if, for example, the selector didn't match any element. This selector prevents 'require( )' from being applied by default.

Here you get the expected result:

return await collector({
  name: css('h1').property('innerText').single().default('No name')
}).call(); // 'Plato Plugin'

Suppose that for some reason there is no h1 element on the website, in this case, although the selector and the property are valid, in many cases, you can expect to get inconsistencies in the HTML layouts. The default() method gives the developer the flexibility to get a default value instead of the error generated by require().

return await collector({
  name: css('h1').property('innerText').single().default('No name')
}).call(); // 'No name'

Use of Query Chains

If we take our previous scraper:

import Impressionist from 'impressionist';

const result = Impressionist.execute('http...', async (browser, page) => {
    return await page.evaluate(async () => {
        return await collector({
          name: 'h1',
          content: '.content{outerHTML}',
          media_gallery: '.carousel img{src}*'
        }).call();
    });
});

console.log(result);

We can use Query chains to obtain the same results:

import Impressionist from 'impressionist';

const result = Impressionist.execute('http...', async (browser, page) => {
    return await page.evaluate(async () => {
        return collector({
          name: css('h1').property('innerText').single().require(),
          content: css('.content').property('outerHTML').single().require(),
          media_gallery: css('.carousel img').property('src').all().require()
        }).call();
    });
});

console.log(result);

If we remove the default methods then our scraper looks like this:

import Impressionist from 'impressionist';

const result = Impressionist.execute('http...', async (browser, page) => {
    return await page.evaluate(async () => {
        return await collector({
          name: css('h1').property('innerText'),
          content: css('.content').property('outerHTML'),
          media_gallery: css('.carousel img').property('src').all()
        }).call();
    });
});

console.log(result);

At this point you may be wondering when to use a Select String, a select() selector or a Query chain of Selectors. The answer is, if you need something more configurable then use a query chain of selectors. For example, the Select Strings are very easy to use but it doesn't allow us to set alternatives or get elements using Xpath expressions. Also, as we will see in our next topic, Select Strings can't use Advance Selectors because those advance concepts only exist in a query chain of selectors.

Last updated