Blog

Musings of a Developer

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 1

It is quite possible that you might have heard of the term ‘Natural Language Processing’ or NLP(abbreviated). Natural Language processing is a very hot topic in the field of Artificial intelligence and particularly in machine learning. The reason being its enormous applications in day to day life.

These applications include Chatbots, Language translation, Text Classification, Paragraph summarization, Spam filtering and many more. There are a few open-source NLP libraries, that do the job of processing text, like NLTK, Stanford NLP suite, Apache Open NLP, etc. NLTK is the most popular as well as an easy to understand library.

OK. Enough chit - chat. Let us start with some of the basics scripts in order to get our hands adept with NLTK.

#PART OF SPEECH TAGGING
import nltk
from nltk import word_tokenize
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw('2006-GWBush.txt')

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    try:
        for i in tokenized[:5]:
            words = word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

    except Exception as e:
        print(str(e))

process_content()

Named Entity Recognition(NER) : NER is a process of recognising and pulling out all the named “entities” from any given sentence/corpora. Entities can be organisation names, people, places, locations, financial figures.

#NAMED ENTITY RECOGNITION
import nltk
from nltk import word_tokenize
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw('2005-GWBush.txt')

sample_text = state_union.raw('2006-GWBush.txt')

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=True)
            namedEnt.draw()

    except Exception as e:
        print(str(e))

process()

Chunking : As the name suggests, chunking means grouping a sentence in to ‘chunks’ of individual sentences. Chunking works on the top of PoS tagging. Chunking in NLTK uses regular expressions to group sentences.

#CHUNKING
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw('2005-GWBush.txt')

sample_text = state_union.raw('2006-GWBush.txt')

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

"""
* = "0 or more of any tense of adverb," followed by:
* = "0 or more of any tense of verb," followed by:
+ = "One or more proper nouns," followed by
? = "zero or one singular noun."
"""

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            chunkGram = r"""Chunk: {**+?}"""
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            print(chunked)

            #ITERATING THROUGH ALL THE SUBTREES AND FILTERING ONLY THE CHUNKS
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)

            chunked.draw()

    except Exception as e:
        print(str(e))

process_content()

Chinking : Chinking is a lot like Chunking except that it removes unwanted chunks from the chunks obtained through Chunking. So a chink is a chunk that is removed from a chunk itself.

#CHINKING
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw('2005-GWBush.txt')

sample_text = state_union.raw('2006-GWBush.txt')

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

'''
REMOVING FROM THE CHINK ONE OR MORE VERBS, PREPOSITIONS, DETERMINERS, OR THE WORD 'to'
'''

def process_content():
    try:
        for i in tokenized[5:]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)

            chunkGram = r"""Chunk: {<.*>+}}+{"""

            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            chunked.draw()

    except Exception as e:
        print(str(e))

process_content()

Understanding all of the above concepts will take a bit of a time and effort(It took me lot more than I thought it would). Therefore it would be wise to conclude this post here. In the next part of the post we are going to learn even more advance concepts in NLP using NLTK.

I have a Github repository of the above explained code in a very well commented structure. I will make sure to update it as this series of posts advances.

Stay tuned. Until next time…!