In the previous part we saw some of the basic functionalities like Tokenisation, Stemming, stop word filtering etc. In this part of the tutorial we are going to learn some more challenging functionalities that NLTK has to offer to us.
Part of Speech(PoS) : PoS is the abbreviation of Part Of Speech. Now PoS Tagging means labeling words in a sentence as nouns, adjectives, verbs…etc. also labelling by tense.
#PART OF SPEECH TAGGING
import nltk
from nltk import word_tokenize
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw('2006-GWBush.txt')
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
try:
for i in tokenized[:5]:
words = word_tokenize(i)
tagged = nltk.pos_tag(words)
print(tagged)
except Exception as e:
print(str(e))
process_content()
Named Entity Recognition(NER) : NER is a process of recognising and pulling out all the named “entities” from any given sentence/corpora. Entities can be organisation names, people, places, locations, financial figures.
#NAMED ENTITY RECOGNITION
import nltk
from nltk import word_tokenize
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw('2006-GWBush.txt')
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=True)
namedEnt.draw()
except Exception as e:
print(str(e))
process()
Chunking : As the name suggests, chunking means grouping a sentence in to ‘chunks’ of individual sentences. Chunking works on the top of PoS tagging. Chunking in NLTK uses regular expressions to group sentences.
#CHUNKING
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw('2006-GWBush.txt')
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
"""
"""
def process_content():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""Chunk: {
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
#ITERATING THROUGH ALL THE SUBTREES AND FILTERING ONLY THE CHUNKS
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
print(subtree)
chunked.draw()
except Exception as e:
print(str(e))
process_content()
Chinking : Chinking is a lot like Chunking except that it removes unwanted chunks from the chunks obtained through Chunking. So a chink is a chunk that is removed from a chunk itself.
#CHINKING
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw('2006-GWBush.txt')
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
'''
REMOVING FROM THE CHINK ONE OR MORE VERBS, PREPOSITIONS, DETERMINERS, OR THE WORD 'to'
'''
def process_content():
try:
for i in tokenized[5:]:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""Chunk: {<.*>+}}
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
chunked.draw()
except Exception as e:
print(str(e))
process_content()
Understanding all of the above concepts will take a bit of a time and effort(It took me lot more than I thought it would). Therefore it would be wise to conclude this post here. In the next part of the post we are going to learn even more advance concepts in NLP using NLTK.
I have a Github repository of the above explained code in a very well commented structure. I will make sure to update it as this series of posts advances.
Stay tuned. Until next time…!