I discovered scrape-it and it is incredibly easy to use

There are many libraries to scrape content, but scrape-it makes it so simple. I had no prior experience to scrapping and it took me just some minutes to extract the information I needed.

Preface

This is the first time I face a scraping project, but I recently had some ideas that require scrapping content and at some point I should be working on this so...

This whole thing came up from my need to find new piano pieces that I could practice according to my level. If you have ever played an instrument for some time, you’ll probably felt the same way as me.

You may know tons of pieces you’d like to play, aside the “boring” repertoire we all know to improve, but don’t know enough music yet to understand at a first glance if they are way above your level, or is something affordable you can propose your teacher.

DISCLAIMER: Let me clarify that above I wrote “boring” between quotes because, even some of these pieces may seem not very interesting to learn, you should always listen to what your teacher recommends.

Idea

I know there are different scales that grade musical pieces into different levels (ABRSM, Henle Verlag, Royal Conservatory of Music, ...) taking into account different aspects, I particularly like the explanation from PianoLibrary about this.

I decided to go for Henle because it has a marketplace with enough piano works to give it a try to this scraping thing. Also, as this is just a pet project for my own I wasn’t interested either in scraping the whole pianolibrary.org page to present the same content with a different format.

I think what they do with that page is a super valuable work, and strongly recommend any piano enthusiast to visit their website, because you’ll find also IMSLP (another amazing project) links on it to download for free the pieces.

Code

Initially I thought using beautiful soup, a Python package widely adopted for scraping, but as I’m not used to Python package management (pip/virtualenvs always felt a bit magic for me), I decided to go with JS to develop faster, as I’m very used to it in Factorial.

Once I chose the language, I started checking some scraping packages and scrape-it showed up outta nowhere. Being maintained for several years (released on 2016 and last version from Aug 2021), having a good amount of downloads, and a lightweight unpacked size made it a perfect candidate for what I was looking for.

And tbh it took me less time than I expected to start scrapping the exact information I wanted. The main function I use to scrape content receives a url (obviously) and a schema with the content to be retrieved from the page. This schema can include individual elements, or nested elements if you want to scrape the same information for a set of children elements (exactly what I needed). Just indicating the selector rules to match the parent node (listItem is the naming convention for scrapeIt) and the rules to match the information you want from their children, you are good to go.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import scrapeIt from 'scrape-it' const workCollectionSchema = { workCollection: { listItem: '.result-item.clearfix', data: { title: { selector: 'h2.main-title', }, author: { selector: 'h2.sub-title', }, href: { selector: 'a.cover-wrapper', attr: 'href' } }, } } const url = 'https://www.henle.de/en/search/?Scoring=Keyboard+instruments&Instrument=Piano+solo' const res = await scrapeIt(url, workCollectionSchema) console.log(res.data.workCollection) // [{ title: ..., author: ..., href: ...}, {...}, ...]

Considerations and credits

It’s important to ensure an ethical scraping, for that reason I added a delay of 10 seconds between each scrape to ensure the stability of Henle page is not being compromised by my information retrieval. Apparently there is no standard about this, but as Henle didn’t have any specific robots.txt information on scraping, and people in different threads suggest using something between 5-15 seconds, I thought it would be a good idea to do so.

Also, I think is extremely important to credit the different creators from which I’ve learned something, or made my development faster while working on this project. For that reason, I want to thank the following libraries / people / organisations:

  • PianoLibrary: for a fantastic explanation of the difficulty scales available for piano pieces.
  • Henle Verlag: for owning all the content I extracted to make this pet project possible.
  • Ionica Bizau: creator of the scrape-it package (among others) for making my first scraping experience extremely smooth.