In the previous part we saw some of the advance concepts like Chunking, PoS tagging, Chinking and NER. In this part of the tutorial we are going to learn the basics of text classification.
Text classification is the process of categorising text into groups/organisations. Its like tagging a group of text for classification. For example sentiment analysis is a form of text classification wherein the text is classified either positive or negative. Language detection is another form of text classification where the text is to be tagged with English, German, French and so on.
You will need to have scikit-learn, scipy and NLTK installed in order to perform the following program. Without further due, lets dive into the code.
WHOAAA!. That seems to be one big complex chunk of code. Not to worry at all. We will break it down bit by bit. Let’s start by:
import random
import pickle
import nltk
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import ClassifierI
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from statistics import mode
listings = []
for category in movie_reviews.categories():
for fileid in movie_reviews.fileids(category):
listings.append(list(movie_reviews.words(fileid)), category)
random.shuffle(listings)
print(listings[1])
word = []
for item in movie_reviews.words():
word.append(item.lower())
word = nltk.FreqDist(word)
print(word.most_common(5))
print(word['stupid'])
word_feature = list(word.keys())[:2000]
def find_feature(listing):
words = set(listing)
features = {}
for w in word_feature:
features[w] = (w in words)
return features
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))
featuresets = [(find_features(rev), category) for (rev, category) in listing]
train = featuresets[:1900]
test = featuresets[1900:]
classifier = nltk.NaiveBayesClassifier.train(train)
print("Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, test))*100)
classifier.show_most_informative_features(15)
save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()
classifier_f = open("naivebayes.pickle", "rb")
classifier = pickle.load(classifier_f)
classifier_f.close()
'''
COMPARING SOME OF THE ALGORITHMS SK-LEARN HAS TO OFFER FOR CLASSIFICATION
THE NAMES OF ALL THE METHODS MENTIONED BELOW ARE PRETTY SELF EXPLANATORY
'''
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNB_classifier, test)*100)
BNB_classifier = SklearnClassifier(BernoulliNB())
BNB_classifier.train(training_set)
print("BernoulliNB accuracy percent:",nltk.classify.accuracy(BNB_classifier, test)*100)
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, test))*100)
SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, test))*100)
SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, test))*100)
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, test))*100)
NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, test))*100)
There we have it. The basics of text classification using NLTK. The important thing to understand here, alongside applying a classifier, is the methodology adopted to reach till the output and the logic behind performing each and every step. Farther down the road to learn Machine learning, applying various models will only be a small part of the development. There will be data gathering, data cleaning, choosing the model, training the model, testing the model, tuning the hyper-parameters and much more.
I have a Github repository containing all of the above explained code’s in a well commented structure.
Stay tuned. Until next time…!