Lesson 20: Python and Requests Module - Python for Digital Humanities

Watch the video above.

In this lesson, we will begin parsing the HTML data we saw in Lesson 19. We will be working with some simple code for extracting data from Wikipedia. In many digital humanities applications, you will find that the data that you need, whether its primary sources from a website or twitter data posted online, you will need to read and parse html code. To do that, we use the Requests module. The requests module is very powerful as it can quickly scrape all html code from a website. In Python we can store that data as an object. The amazing thing about all of this is that you really only need to know a few functions to make it all happen and you can write the code in seconds.

The code above is remarkably simply, yet if you run it you will find that it has taken all of the html code from the url defined by the url object which is a string of the Wikipedia page for King Theuderic IV.

In line 1, we simply import the requests library. Then, in lines 4 and 5 we create our url and file objects. The only bit of code that comes from requests is line 7 in which we create an object s. Object is uses the function requests.get() which takes an argument of url. When we do this, requests goes to that url and the object of s is that html code.

Try to print of just s and see what happens. You should see something like this: <Response [200]>. To render the html code from object s, we must pass either s.text or s.content. When you do this, you will notice that you now have the html code. Requests cannot, however, parse this data for us. For that, we need another module: BeautifulSoup. That is what we will discuss in Lesson 21. For now, get comfortable with Requests and try getting the html code for different websites of your choice. When you are ready, move on to Lesson 21. We will have one single coding exercise for both of these modules after Lesson 21: Key Concepts.