Hello IS,
PrefaceIn this section and also from now on I will sharing my little experience with NLP (Natural Language Processing). At present I am conducting research on
Question Generation which is a part of Question Answering systems in NLP. In the process of my research I decided to first try to generate questions from reading a corpus of fruits. But searching for the corpus went in vain as there are none. Therefore I decided to make a annotatd or tagged corpus of my own using wikipedia.
Things I assume you know ....
1. Basic knowledge of Python, string manipulations and file read write
2. Little experience with Python urllib library and BeautifulSoup
3. Passion to learn and patience to read.
Let's start,
What do you mean by NLP or Natural Language Processing ?"Natural language processing is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human languages. As such, NLP is related to the area of human–computer interaction."----- Taken from wikipedia
What is meant by Text Corpus ?"
In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora."
----- Taken from wikipedia
So, in order to get all the information or articles about fruits. I need to have the list of all fruits. Wikipedia does have a page where it lists out all the fruits :-
So, my first task is to extract the list of all fruits. But if you look carefully all the fruits pages links starts with a common base url and then '/Wiki/<fruit-name>' concatenated. The structure is like this :-
We define a variable with base url as :-
Example :-
base_url = "https://simple.wikipedia.org"
Therefore we have the URL Structure as :-
base_url/wiki/<Fruit_name>If you see carefully then you can see that only the fruits links start with /wiki/ and all other links have different structure.
We use BeautifulSoup for html parsing.
Here's the code that returns the list of all the fruits in the about mentioned url structure but without the base_url.
We store the fruit list url in a seperate variable just to make it more readable.
self.__list_of_fruit_url = "https://simple.wikipedia.org/wiki/List_of_fruits"
def get_fruit_list_links(self):
fruit_list_links = []
req = Request(self.__list_of_fruit_url, headers=self.__headers)
page = urlopen(req)
soup = BeautifulSoup(page)
scrap_links = soup.findAll("a")
for link in scrap_links:
tmp = link.get("href")
if str(tmp).startswith("/wiki/"):
fruit_list_links.append(self.__remove_non_ascii(tmp))
return fruit_list_links[2:87]
This function is to remove any non-ASCII characters that maybe present due to Unicode.
@staticmethod
def __remove_non_ascii(text):
return "".join([i if ord(i) < 128 else "" for i in text])
What is happening above ?First, we set headers so that our crawler is not identified as a bot (even without the headers it will for for wikipedia but I consider it a good practice) and then the html data is being read.
Then we call the BeautifulSoup class and initialize it with the html data of the web page.
Then we get all the links from the the page by using the findAll() method in the BeautifulSoup instance. Now for each link we check whether the 'href' of anchor tag startswith '/wiki/' or not. If it starts with '/wiki/' then we store them in a list and return that list.
When you test the above it will return you this list :-
['/wiki/Apple', '/wiki/Apricot', '/wiki/Avocado', '/wiki/Banana', '/wiki/Breadfruit', '/wiki/Bilberry', '/wiki/Blackberry', '/wiki/Blackcurrant', '/wiki/Blueberry', '/wiki/Boysenberry', '/wiki/Cantaloupe', '/wiki/Currant', '/wiki/Cherry', '/wiki/Cherimoya', '/wiki/Cloudberry', '/wiki/Coconut', '/wiki/Cranberry', '/wiki/Cucumber', '/wiki/Damson', '/wiki/Date_palm', '/wiki/Dragonfruit', '/wiki/Durian', '/wiki/Eggplant', '/wiki/Elderberry', '/wiki/Feijoa', '/wiki/Common_fig', '/wiki/Goji_berry', '/wiki/Gooseberry', '/wiki/Grape', '/wiki/Raisin', '/wiki/Grapefruit', '/wiki/Guava', '/wiki/Huckleberry', '/wiki/Honeydew', '/wiki/Jackfruit', '/wiki/Jambul', '/wiki/Jujube', '/wiki/Kiwi_fruit', '/wiki/Kumquat', '/wiki/Lemon', '/wiki/Lime', '/wiki/Loquat', '/wiki/Lychee', '/wiki/Mango', '/wiki/Melon', '/wiki/Cantaloupe', '/wiki/Honeydew_(melon)', '/wiki/Watermelon', '/wiki/Rock_melon', '/wiki/Miracle_fruit', '/wiki/Mulberry', '/wiki/Nectarine', '/wiki/Nut_(fruit)', '/wiki/Olive', '/wiki/Orange_(fruit)', '/wiki/Clementine', '/wiki/Mandarine', '/wiki/Blood_Orange', '/wiki/Tangerine', '/wiki/Papaya', '/wiki/Passionfruit', '/wiki/Peach', '/wiki/Pepper', '/wiki/Chili_pepper', '/wiki/Pear', '/wiki/Williams_pear', '/wiki/Bartlett_pear', '/wiki/Persimmon', '/wiki/Physalis', '/wiki/Plum', '/wiki/Pineapple', '/wiki/Pomegranate', '/wiki/Pomelo', '/wiki/Purple_Mangosteen', '/wiki/Quince', '/wiki/Raspberry', '/wiki/Rambutan', '/wiki/Redcurrant', '/wiki/Salal', '/wiki/Satsuma', '/wiki/Star_fruit', '/wiki/Strawberry', '/wiki/Tomato', '/wiki/Ugli', '/wiki/Melon']
Okay, So now we have got the list of all the urls which can be used to scrap the article from wikipedia.
Now we have the base_url from above and we have the fruit specific url from above now if we concat each of the items in the list above with the base_url we can get the complete url of the fruits article.
Example :- Lets take the first item from the list : "/wiki/Apple"
Now we concatenate it with base_url and we get the real url.
base_url + "/wiki/Apple" = "https://simple.wikipedia.org" + "/wiki/Apple" = "https://simple.wikipedia.org/wiki/Apple"
But before moving to the give the method that does so. Let me explain how to get the scrap the articles.
We take the similar concept like above when we did scrap the urls. But in this case when you look at the page source you will find that the articles we written in "p" or paragraph tags. So similarly we use BeautifulSoup and read html content but this time we use findAll() method for finding the "p" tags and not "a" tags.
We create a directory and saved the contents that we extract with the file names, as
<Fruit_name>.txtNow how to get the file name from the link. Well its very simple. The following line of code does that. I hope you will test it by yourself.
Here
link corresponds to every item in the list that we scraped earlier.
filename = link[str(link).rindex("/") + 1: len(link)]
So here's the code that does scrap the articles content and store them in the directory.
def create_corpus(self, links_list):
dir_name = "fruit_corpus_simplewiki"
dir_loc = self.__dir_location + dir_name + "/"
base_url = "https://simple.wikipedia.org"
if not os.path.exists(dir_loc) :
try:
os.mkdir("fruit_corpus_simplewiki")
except OSError as e:
print(e.message)
for link in links_list:
fruit_url = base_url + link
print("Working with %s" % fruit_url)
soup = BeautifulSoup(urlopen(fruit_url).read())
html_content = soup.findAll("p")
main_content = str(self.__remove_non_ascii("\n".join(["".join(w.text) for w in html_content])))
filename = link[str(link).rindex("/") + 1: len(link)]
try:
with open(dir_loc + filename + ".txt", "wb") as f:
f.write(main_content)
except IOError as e:
print "Error in writing to file", e.message
print("File \"%s.txt\" saved in %s" % (filename, dir_loc))
In the above code you might see some variable descriptions are unknown but you will know them once you have seen the full code.
The above code will create a directory if the directory is not set. You can change the path by giving a different directory. By default it will create a sub directory in the same location on the file. We get the directory location by the following line of code.
__dir_location = os.path.dirname(os.path.realpath(__file__)) + "/"
So, the rest should be pretty readable and I hope that the variables used are very user friendly.
So here's the log you will get when downloading of all the files are complete.
Working with https://simple.wikipedia.org/wiki/Apple
File "Apple.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Apricot
File "Apricot.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Avocado
File "Avocado.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Banana
File "Banana.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Breadfruit
File "Breadfruit.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Bilberry
File "Bilberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Blackberry
File "Blackberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Blackcurrant
File "Blackcurrant.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Blueberry
File "Blueberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Boysenberry
File "Boysenberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Cantaloupe
File "Cantaloupe.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Currant
File "Currant.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Cherry
File "Cherry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Cherimoya
File "Cherimoya.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Cloudberry
File "Cloudberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Coconut
File "Coconut.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Cranberry
File "Cranberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Cucumber
File "Cucumber.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Damson
File "Damson.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Date_palm
File "Date_palm.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Dragonfruit
File "Dragonfruit.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Durian
File "Durian.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Eggplant
File "Eggplant.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Elderberry
File "Elderberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Feijoa
File "Feijoa.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Common_fig
File "Common_fig.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Goji_berry
File "Goji_berry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Gooseberry
File "Gooseberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Grape
File "Grape.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Raisin
File "Raisin.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Grapefruit
File "Grapefruit.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Guava
File "Guava.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Huckleberry
File "Huckleberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Honeydew
File "Honeydew.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Jackfruit
File "Jackfruit.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Jambul
File "Jambul.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Jujube
File "Jujube.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Kiwi_fruit
File "Kiwi_fruit.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Kumquat
File "Kumquat.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Lemon
File "Lemon.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Lime
File "Lime.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Loquat
File "Loquat.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Lychee
File "Lychee.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Mango
File "Mango.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Melon
File "Melon.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Cantaloupe
File "Cantaloupe.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Honeydew_(melon)
File "Honeydew_(melon).txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Watermelon
File "Watermelon.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Rock_melon
File "Rock_melon.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Miracle_fruit
File "Miracle_fruit.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Mulberry
File "Mulberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Nectarine
File "Nectarine.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Nut_(fruit)
File "Nut_(fruit).txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Olive
File "Olive.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Orange_(fruit)
File "Orange_(fruit).txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Clementine
File "Clementine.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Mandarine
File "Mandarine.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Blood_Orange
File "Blood_Orange.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Tangerine
File "Tangerine.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Papaya
File "Papaya.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Passionfruit
File "Passionfruit.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Peach
File "Peach.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Pepper
File "Pepper.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Chili_pepper
File "Chili_pepper.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Pear
File "Pear.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Williams_pear
File "Williams_pear.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Bartlett_pear
File "Bartlett_pear.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Persimmon
File "Persimmon.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Physalis
File "Physalis.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Plum
File "Plum.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Pineapple
File "Pineapple.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Pomegranate
File "Pomegranate.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Pomelo
File "Pomelo.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Purple_Mangosteen
File "Purple_Mangosteen.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Quince
File "Quince.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Raspberry
File "Raspberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Rambutan
File "Rambutan.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Redcurrant
File "Redcurrant.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Salal
File "Salal.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Satsuma
File "Satsuma.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Star_fruit
File "Star_fruit.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Strawberry
File "Strawberry.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Tomato
File "Tomato.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Ugli
File "Ugli.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
Working with https://simple.wikipedia.org/wiki/Melon
File "Melon.txt" saved in /root/PycharmProjects/Ques_&_Ans-Draft/WikiFruitCorpus/fruit_corpus_simplewiki/
So here's code of complete working class :-
import os
from urllib2 import urlopen, Request
from bs4 import BeautifulSoup
__author__ = "Psycho_Coder"
__date__ = "30th July, 2014"
class WikiFruitCorpus:
__list_of_fruit_url = ""
__headers = ""
__dir_location = os.path.dirname(os.path.realpath(__file__)) + "/"
def __init__(self, header="Mozilla/5.0"):
self.__list_of_fruit_url = "https://simple.wikipedia.org/wiki/List_of_fruits"
self.set_headers(header)
def set_headers(self, header):
self.__headers = {"User-Agent": header}
def get_fruit_list_links(self):
fruit_list_links = []
req = Request(self.__list_of_fruit_url, headers=self.__headers)
page = urlopen(req)
soup = BeautifulSoup(page)
scrap_links = soup.findAll("a")
for link in scrap_links:
tmp = link.get("href")
if str(tmp).startswith("/wiki/"):
fruit_list_links.append(self.__remove_non_ascii(tmp))
return fruit_list_links[2:87]
@staticmethod
def __remove_non_ascii(text):
return "".join([i if ord(i) < 128 else "" for i in text])
def create_corpus(self, links_list):
dir_name = "fruit_corpus_simplewiki"
dir_loc = self.__dir_location + dir_name + "/"
base_url = "https://simple.wikipedia.org"
if not os.path.exists(dir_loc) :
try:
os.mkdir("fruit_corpus_simplewiki")
except OSError as e:
print(e.message)
for link in links_list:
fruit_url = base_url + link
print("Working with %s" % fruit_url)
soup = BeautifulSoup(urlopen(fruit_url).read())
html_content = soup.findAll("p")
main_content = str(self.__remove_non_ascii("\n".join(["".join(w.text) for w in html_content])))
filename = link[str(link).rindex("/") + 1: len(link)]
try:
with open(dir_loc + filename + ".txt", "wb") as f:
f.write(main_content)
except IOError as e:
print "Error in writing to file", e.message
print("File \"%s.txt\" saved in %s" % (filename, dir_loc))
Sample Code to use the class :-
def main():
cor = WikiFruitCorpus()
header = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) " \
"Ubuntu Chromium/34.0.1847.116 Chrome/34.0.1847.116 Safari/537.36"
cor.set_headers(header)
fruit_list = cor.get_fruit_list_links()
cor.create_corpus(fruit_list)
if __name__ == "__main__":
main()
So, thats all folks.
Note :- Simple Wikipedia is a simple version of wikipedia with less content. But the above code will work on the true version of wikipedia as well. All you need to do is set :-
base_url = "https://en.wikipedia.org"
In the above code 'en' is for english version, if you replace with different language code it will still work.
ConclusionSo, I hope you enjoyed the tutorial and learned a few things. I used this corpus to create a POS tagged corpus from this and another NER Tagged version of this corpus but since its for research purpose I cannot disclose the tagged ones for now.
Also I wrote a tutorial after a long time and so I just hope that I still have the same spirit with which I did write my earlier tutorials. It should reflect based on the above explanation. So, thank you, I love to have your feedback.
Thank you,
Sincerely,
Psycho_Coder.