Blog

Musings of a Developer

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 2

In the previous part we saw some of the basic functionalities like Tokenisation, Stemming, stop word filtering etc. In this part of the tutorial we are going to learn some more challenging functionalities that NLTK has to offer to us.

Part of Speech(PoS) : PoS is the abbreviation of Part Of Speech. Now PoS Tagging means labeling words in a sentence as nouns, adjectives, verbs…etc. also labelling by tense.

#PART OF SPEECH TAGGING import nltk from nltk import word_tokenize from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw('2005-GWBush.txt') sample_text = state_union.raw('2006-GWBush.txt') custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process_content(): try: for i in tokenized[:5]: words = word_tokenize(i) tagged = nltk.pos_tag(words) print(tagged) except Exception as e: print(str(e)) process_content()

Named Entity Recognition(NER) : NER is a process of recognising and pulling out all the named “entities” from any given sentence/corpora. Entities can be organisation names, people, places, locations, financial figures.

#NAMED ENTITY RECOGNITION import nltk from nltk import word_tokenize from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw('2005-GWBush.txt') sample_text = state_union.raw('2006-GWBush.txt') custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) def process(): try: for i in tokenized[5:]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) namedEnt = nltk.ne_chunk(tagged, binary=True) namedEnt.draw() except Exception as e: print(str(e)) process()

Chunking : As the name suggests, chunking means grouping a sentence in to ‘chunks’ of individual sentences. Chunking works on the top of PoS tagging. Chunking in NLTK uses regular expressions to group sentences.

#CHUNKING import nltk from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw('2005-GWBush.txt') sample_text = state_union.raw('2006-GWBush.txt') custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) """ * = "0 or more of any tense of adverb," followed by: * = "0 or more of any tense of verb," followed by: + = "One or more proper nouns," followed by ? = "zero or one singular noun." """ def process_content(): try: for i in tokenized: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) chunkGram = r"""Chunk: {**+?}""" chunkParser = nltk.RegexpParser(chunkGram) chunked = chunkParser.parse(tagged) print(chunked) #ITERATING THROUGH ALL THE SUBTREES AND FILTERING ONLY THE CHUNKS for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'): print(subtree) chunked.draw() except Exception as e: print(str(e)) process_content()

Chinking : Chinking is a lot like Chunking except that it removes unwanted chunks from the chunks obtained through Chunking. So a chink is a chunk that is removed from a chunk itself.

#CHINKING import nltk from nltk.corpus import state_union from nltk.tokenize import PunktSentenceTokenizer train_text = state_union.raw('2005-GWBush.txt') sample_text = state_union.raw('2006-GWBush.txt') custom_sent_tokenizer = PunktSentenceTokenizer(train_text) tokenized = custom_sent_tokenizer.tokenize(sample_text) ''' REMOVING FROM THE CHINK ONE OR MORE VERBS, PREPOSITIONS, DETERMINERS, OR THE WORD 'to' ''' def process_content(): try: for i in tokenized[5:]: words = nltk.word_tokenize(i) tagged = nltk.pos_tag(words) chunkGram = r"""Chunk: {<.*>+}}+{""" chunkParser = nltk.RegexpParser(chunkGram) chunked = chunkParser.parse(tagged) chunked.draw() except Exception as e: print(str(e)) process_content()

Understanding all of the above concepts will take a bit of a time and effort(It took me lot more than I thought it would). Therefore it would be wise to conclude this post here. In the next part of the post we are going to learn even more advance concepts in NLP using NLTK.

I have a Github repository of the above explained code in a very well commented structure. I will make sure to update it as this series of posts advances.

Stay tuned. Until next time…!