Web Scraping with Python

Web scraping is said to be an automated way of extracting large unstructured data from web pages for storage in a structured form by using online services, APIs, or written code. We scrape for research and information gathering purposes, digital marketing purposes, and a whole host of other purposes.

Python was used because it has a very large collection of libraries, it is easy to use and the Python Community is an active one where you can easily find help with your code. The first day of the course saw us being introduced to the apps and libraries to get started with when scraping with python, first time python users were taken through how to download and install python and IDE for the various operating systems but for others like Linux it already came installed also virtual environments for installing the scraping tools which were to be used.

Anaconda was recommended for the course. The next session looked at the basics of the Python programming language for the benefit of first-time users who did not quite know their way around the language.

In the third session participants were introduced to HTML/CSS basics to be able to reference HTML documents and CSS properties/IDs, also JavaScript basics to better understand web scraping. Since a scraping program deals with making HTTP requests and parsing HTML responses, the requests and Beautiful Soup libraries in python were installed.

The simple sequence of scraping is to first find the target URLs and inspect to get the arguments you’d need to pass in the scraping, then write and execute the code (if you are unable to get APIs or services) to extract the data required, the data can then be stored in different file formats such as XML, charts or tables using Pandas library.

When it comes to the legality of scraping, it is subject to the consent of the site to be scraped especially when it comes to the privacy of its users, generally what is allowed/disallowed can be checked using robot.txt. Social media platforms do a lot of scraping for instance in determining the activity log of a user for a period. The course proved to be very efficient as participants had hands-on practical scraping of some sites and indeed practice makes perfect.

Appreciation to PyLadies Ghana for this great initiative and to the facilitator Joy Ayittey a big thank you.