A Review on Feature Selection Techniques for Sentiment Analysis

Sentiment Analysis is the technique of identifying and categorizing emotions in order to examine how people feel about services such as movies, products, events, and politics. It is a widely researched on topic in text mining. This paper presents a review and evaluation results for some feature selection techniques such as TFIDF, document frequency, word frequency, sparsity reduction and chi square statistics. To test these feature selection techniques, the study used twitter data on stock market and Naïve Bayes Classifier for classification because of its computational simplicity and effectiveness. The implementation of the study has been done in R.


Introduction
Sentiment Analysis is the technique of identifying and categorizing emotions in order to examine how people feel about services such as movies, products, events, and politics. Enterprises benefit from research in the subject of sentiment analysis since they can accurately comprehend users' opinions about their product and make improvements as a result. Natural Language Processing (NLP) is used in Sentiment Analysis to interpret human language in both written and spoken form. NLP is divided into four subtasks that allow a computer to interpret language by evaluating sentence structure and grammar. Summarization, Part-of-Speech (PoS) Tagging, Text Categorization, and Sentiment Analysis are the sub-tasks of NLP.
Machine learning or lexicon-based algorithms can be used to conduct sentiment analysis. The sentiment is calculated using a lexicon-based technique, which takes into account the semantic orientation of the words in the text. To put it another way, the words in the text are divided and given scores. The final score, which indicates the sentence's sentiment, is calculated by adding these scores together. Whereas in machine learning, classification is performed on two sets of documents: trained datasets and test datasets. There is a slew of classifier algorithms that have been trained on emotional samples. Without human input, a machine learns to recognize emotions and categorizes them into negative and positive feelings. In machine learning, one such classifier is the Naive Bayes algorithm.
The Bayes theorem is used to create the Naive Bayes Classifier. It's a type of supervised learning method that's frequently used to solve classification problems. Naive Bayes is used for spam filtration, sentiment analysis, and article classification, among other things. Because it is a probabilistic classifier, it makes predictions based on the probability of an object. Text classification problems with high-dimensional datasets are predicted using Naive Bayes. It's called naive because it believes that the appearance of one characteristic has nothing to do with the appearance of another. The basis of Naïve Bayes' Classification is Bayes' rule, given as: Where, P(A|B) is Posterior probability, P(A) and P(B) are class priors, P(B|A) is conditional probability Given a set of attributes values X (instances) and class c, the probability of each attribute ai relative to the class needs to be estimated. Here, we employ the product rulethat is, asssume conditional independence amongst the attribute values P(ai|c). This gives us the following formulae: The prediction task is reduced to: The Confusion Matrix is a table that displays the predictions of a classifier. The matrix is N x N, with N being the number of classes. The row represents the projected class, whereas the column represents the actual class.
The Confusion Matrix compares the actual target values to the predictions of the machine learning model. This section summarizes the categorization model's performance as well as its flaws.

Actual Classes Predicted Classes
Accuracy is a unitless statistic that spans from 0 to 1. It can be stated as a percentage ranging from 0 to 100%.
For a k-class problem, multi-class problems increase the dimensions of the matrix to k x k. Multi-class accuracy is calculated by dividing the diagonal of the confusion matrix by the sum of all entries in the matrix. When evaluating accuracy for a multi-class, weight each of the k-classes' accuracy by the number of occurrences in that class, then divide by the total number of instances.
Another common practice is to take the predicted class probabilities and scale them such that they sum up to 1.
This is referred to as normalizing the predictions so they can be interpreted as percentages.
One of the drawbacks of Naive Bayes is that in real-world datasets, the assumption of completely independent conditional probabilities is frequently inaccurate, resulting in poor performance. As a result, while utilizing Naive Bayes, feature interaction should be taken into account. The performance of the Nave Bayes classifier can be improved in a variety of ways. One of the ways is feature selection.

TF-IDF (Term Frequency -Inverse Document Frequency)
Term Frequency is the number of times a term appears in a document divided by the total number of words. It is given by the following formula: Because each sentence is varied in length, a word may appear more frequently in a longer sentence than in a shorter one. As a result, the total number of times a word appears in a document is divided by the total number of words.
The document is described using term frequencies. To put it another way, the more a term is used, the more it defines the document. However, phrases like "the" and "and," which contain no substantial information, appear frequently, contradicting the preceding assumption. As a result, the classifier's performance suffers because of presence of these terms. To resolve the problem, stopwords should be removed from the dataset. Popular words might also be filtered, and surface topic terms could be multiplied by term frequency and inverse document frequency.
The Inverse Document Frequency (IDF) is a metric that counts how many times a term appears in a document.
If it is discovered to be commonly utilized across papers, it is given a lesser weight. The prominence of key words can be considerably improved by removing terms with lower weights.

Document Frequency
Document Frequency is a common text classification technique that uses a basic word reduction technique. It is straightforward to build and is used in feature selection for large-scale corpora because of its linear complexity.
Document Frequency refers to the number of documents in the dataset that contain the phrase. The word that is taken into account for subsequent processing is the one that appears in sufficient documents. For example, the word 'sensex' is considered a feature in our dataset if it appears in at least 5 pages. Terms having a Document Frequency of less than a certain threshold are removed to reduce space and work for the classifier while also increasing accuracy.

Reduction in Sparsity
Data features with a lot of zero values are known as sparse data. Vectors of one-hot-encoded words, for example, or categorical data counts. The tendency of sparse features to enhance the space and temporal difficulties of models is a common challenge. The model will fit the noise in the training data if there are too many features.
This is referred to as overfitting. Models that have been overfitted are unable to generalize to newer data. This has a negative impact on the model's predictability.
The problem of sparsity is addressed using a variety of approaches. The removal of sparse features, which can create noise and raise the model's memory requirements, is a frequent strategy.

Chi Square Statistics
There are two variables in feature selection. One relates to the frequency of occurrence of feature t, while the other refers to the likelihood of occurrence of category C. We primarily look at whether t and C are independent in text classification. If they're independent, the feature can't be used to identify whether or not a text belongs in category C. However, determining whether t and C are independent is difficult in practice. As a result, Chi Square Statistics is used to describe the applicability of the method. A bidirectional queue is used to represent a textual feature t and a category C, it is shown in The higher the chi square score for category C, the relevancy between feature t and category C is greater. When the score is 0, the feature t and category C are independent. can provide more insight into how firm aap's stocks will fare in the future. In Chi-Square Test, if the score is higher that means the terms are closely related.
In our study the features having a weight of less than 0.1584780 were removed. This resulted in 2289 features being preserved and an accuracy of 68.65%, which is lower than the standard classifier, was obtained. Again, the classifier was evaluated using features that had a weight of less than 0.04918127. This resulted in a 70.12% accuracy with 2289 features. This demonstrates that some characteristics are critical for a classifier to get more accurate results.
The classifier was trained with Document Frequency with word lengths ranging from 5 to 15 and number of documents it appeared in ranging from 5-90, yielding an accuracy of 65.45% with 945 features. The above graph shows the comparison of accuracies of all the methods used in the experiment. From the graph we can make out that the highest accuracy achieved was 74.00% when the features with less than frequency, 3, were dropped.

Conclusion
Sparsity Reduction, TF-IDF, Document Frequency, Word Frequency, and Chi-Square Statistics were all investigated in this research. Extensive testing revealed that Word Frequency was the most accurate method,