Blog

Musings of a Developer

Python’s Natural Language Tool Kit (NLTK) Tutorial part - 3

In the previous part we saw some of the advance concepts like Chunking, PoS tagging, Chinking and NER. In this part of the tutorial we are going to learn the basics of text classification.

Text classification is the process of categorising text into groups/organisations. Its like tagging a group of text for classification. For example sentiment analysis is a form of text classification wherein the text is classified either positive or negative. Language detection is another form of text classification where the text is to be tagged with English, German, French and so on.

You will need to have scikit-learn, scipy and NLTK installed in order to perform the following program. Without further due, lets dive into the code.

WHOAAA!. That seems to be one big complex chunk of code. Not to worry at all. We will break it down bit by bit. Let’s start by:

First and foremost thing will be to import all the necessary libraries

import random import pickle import nltk from nltk.corpus import movie_reviews from nltk.classify.scikitlearn import SklearnClassifier from nltk.classify import ClassifierI from sklearn.naive_bayes import MultinomialNB,BernoulliNB from sklearn.linear_model import LogisticRegression,SGDClassifier from sklearn.svm import SVC, LinearSVC, NuSVC from statistics import mode

Second step will be making a list of tokenized word forms of file id along with its category, Shuffling them so that our data becomes unbiased, and printing the first list.

listings = [] for category in movie_reviews.categories(): for fileid in movie_reviews.fileids(category): listings.append(list(movie_reviews.words(fileid)), category) random.shuffle(listings) print(listings[1])

In the third step we will be creating a list of all the words inside our data and converting those into an NLTK specific frequency distribution. Converting to a frequency distribution helps us get insights on our data such as what are the most common words or which are the least common words or how many times is a particular word repeated through out etc.

word = [] for item in movie_reviews.words(): word.append(item.lower()) word = nltk.FreqDist(word) print(word.most_common(5)) print(word['stupid'])

Fourth step will be finding the top 2000 words inside our data. Finding these 2000 words in our pos/neg listings and then marking their presence as either positive or negative. This is nothing but creating a feature set.

word_feature = list(word.keys())[:2000] def find_feature(listing): words = set(listing) features = {} for w in word_feature: features[w] = (w in words) return features print((find_features(movie_reviews.words('neg/cv000_29416.txt')))) featuresets = [(find_features(rev), category) for (rev, category) in listing]

Step five. Application of the classifier. We apply the NAIVE BAYES CLASSIFIER to our dataset. We first create a training dataset for our model, then we create a testing dataset to test it against. We then call the classifier from the Scikit-Learn and train it against the training dataset. At last we print the accuracy of the classifier. We then print out the 15 most informative features that our model used.

train = featuresets[:1900] test = featuresets[1900:] classifier = nltk.NaiveBayesClassifier.train(train) print("Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, test))*100) classifier.show_most_informative_features(15)

Step six is to save our classifier. Now, Classifier is a python object, hence it cannot be saved in a CSV or excel or text file. So we serialise our python object and save it using PICKLE module from python. To pickle we first need to open a file and give it write permission. We then dump the classifier object into that file. Close it. Ta - Daaaa!!! Our object is now saved and can be called upon necessary. Of course one will have to all again if they want to make changes in the classifier. To load the object we do almost perform similar function with the only difference being instead of dump, we use load.

save_classifier = open("naivebayes.pickle","wb") pickle.dump(classifier, save_classifier) save_classifier.close() classifier_f = open("naivebayes.pickle", "rb") classifier = pickle.load(classifier_f) classifier_f.close()

Naive Bayes is not the only classification technique available to use. There are quite a few of classification algorithms available at our disposal. Following shows a comparison of accuracy of different classification algorithms on our dataset.

''' COMPARING SOME OF THE ALGORITHMS SK-LEARN HAS TO OFFER FOR CLASSIFICATION THE NAMES OF ALL THE METHODS MENTIONED BELOW ARE PRETTY SELF EXPLANATORY ''' MNB_classifier = SklearnClassifier(MultinomialNB()) MNB_classifier.train(training_set) print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNB_classifier, test)*100) BNB_classifier = SklearnClassifier(BernoulliNB()) BNB_classifier.train(training_set) print("BernoulliNB accuracy percent:",nltk.classify.accuracy(BNB_classifier, test)*100) LogisticRegression_classifier = SklearnClassifier(LogisticRegression()) LogisticRegression_classifier.train(training_set) print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, test))*100) SGDClassifier_classifier = SklearnClassifier(SGDClassifier()) SGDClassifier_classifier.train(training_set) print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, test))*100) SVC_classifier = SklearnClassifier(SVC()) SVC_classifier.train(training_set) print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, test))*100) LinearSVC_classifier = SklearnClassifier(LinearSVC()) LinearSVC_classifier.train(training_set) print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, test))*100) NuSVC_classifier = SklearnClassifier(NuSVC()) NuSVC_classifier.train(training_set) print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, test))*100)

There we have it. The basics of text classification using NLTK. The important thing to understand here, alongside applying a classifier, is the methodology adopted to reach till the output and the logic behind performing each and every step. Farther down the road to learn Machine learning, applying various models will only be a small part of the development. There will be data gathering, data cleaning, choosing the model, training the model, testing the model, tuning the hyper-parameters and much more.

I have a Github repository containing all of the above explained code’s in a well commented structure.

Stay tuned. Until next time…!