Please download the source files first, since this post is designed to a specific url only.
Before start install BeautifulSoup library. I have described about installation in one of my previous posts.
Or you can find it in many websites.
First, import necessary libraries.
from BeautifulSoup import BeautifulSoup
import urllib2
Then define a file to write in the content.
f1 = open("Output.txt", "w")
Let’s open a website. This post is designed fnly for the given url.
soup = BeautifulSoup(urllib2.urlopen("Your url"))
You can do this in two steps also, as follows.
webContent = urllib2.urlopen("Your url")
soup = BeautifulSoup(webContent)
Now lets extract the content of the site. For example, lets extract a table and part of it’s content.
table1 = soup.findAll('table')[2]
This step will extract the table 2, as the soup.findAll returns a list.
Let’s read the rows. I think you know what is <tr> in html.
rows = table1.findAll('tr')
In order to identify the content let’s extract the title also.
title = ''.join(rows[0].findAll('td')[0].findAll(text=True))+"\n"
Here, make sure to use single quotations in ”.join which will will extract the content of first row and first <td></td> tags.
And now let’s write the title to the file.
f1.write("Table Name: %s" % title)
To extract the content we have to loop through the rows. We have looped only up to 7 rows.Then we read all the columns, which are identified by the <td></td> tags.
for tr in rows[:7]:
cols = tr.findAll('td')
Columns are read from 8 to 12 and load to a variable named tblCont. Through each loop the content is written to the file.
tblCont =""
for td in cols[8:12]:
tblCont = tblCont + "\t"+td.find(text=True)
f1.write("%s" % tblCont+"\n")
Finally, the file objects has to be closed.
f1.close()
Now save the file.
Press F5. (If you are using IDLE)