Sarcasm Detection From Twitter Database Using Text Mining Algorithms

Article History: Received: 10 January 2021; Revised: 12 February 2021; Accepted: 27 March 2021; Published online: 10 May 2021 Abstract Sarcasm is well-defined as a cutting, frequently sarcastic remark intended to fast ridicule or dislike. Irony detection is the assignment of fittingly labeling the text as’ Sarcasm’ or ’nonSarcasm.’ There is a challenging task owing to the deficiency of facial expressions and intonation in the text. Social media and micro-blogging websites are extensively explored for getting the information to extract the opinion of the target because a huge of text data existence is put out into the open field into social media like Twitter. Such large, openly available text data could be utilized for a variety of researches. Here we applied text data set for classifying Sarcasm and experiments have been made from the textual data extracted from the Twitter data set. Text data set downloaded from Kaggle, including 1984 tweets that collected from Twitter. These data already have labels here. In this paper, we apply these data to train our model Classifiers for different algorithms to see the ability of model machine learning to recognize sarcasm and non-sarcasm through a set of the process start by text pre-processing feature extraction (TFIDF) and apply different classification algorithms, such as Decision Tree classifier, Multinomial Naïve Bayes Classifier, Support vector machines, and Logistic Regression classifier. Then tuning a model fitting the best results, we get in (TF-IDF) we achieve 0.94% in Multinomial NB, Decision Tree Classifier we achieve 0.93%, Logistic Regression we achieve 0.97%, and Support vector machines (SVM) we achieve 0.42%. All these result models were improved, except the SVM model has the lowest accuracy. The results were extracted, and the evaluation of the results has been proved above to be good in accuracy for identifying sarcastic impressions of people.


Introduction
Platforms as Microblog are easy tools for a person to express his or her opinions, thoughts, while, sarcasm is a sophisticated and informative human nature to express sentimentally refined viewpoints in an implicit way. The irony is phenomenally expressed in various online textual expressions. Here the Sarcasm is widely considered to be the linguistic phenomenon to express significant sentimental tasks predominantly found in online content writings. Sarcasm is predominantly found in social media platforms like Twitter, Facebook, Instagram, and others. Sarcasm is popularly known as the eloquent process of denoting the expressions or words for uttering an opposite suggestion from the words they used in accordance with the context. Sarcasm is used predominantly to express two-faced quality illustrations, i.e., comic and mean. Here sentimental expression is widely used on twitter, a social expression platform, and more attention from the researchers. Twitter is growing with more popularity to reveal the manifestations of the common men to leaders. The twitter is rich with sarcasm expressions to show their genuine testimonials with an expression of negative feelings using positive words to criticize the people (Hayat et al., 2019) . Sarcasm is rich with features related to sentiment, punctuation, syntactic, and various expressional values. These features are the most useful elements in the training data. These features are predominantly used for the classification of different algorithms (Siddiqui & Alam, 2009). Automatic sarcasm detection is widely used by researchers by using feature selection (Siddiqui et al., 2019). The researchers have used various applications for detecting Sarcasm applying the data collected from social media. Detection of Sarcasm is possible from the online postings with capital letters, emoticons, and preceded by exclamation marks, and it is predominantly used for sentiment analysis (Liu et al., 2014). During an automatic detection of Sarcasm through extract features, the data process plays a predominant role. The identifying of Sarcasm can be made with the presence of hashtags within the sentence. Once the words preceded by the hashtags are identified, these words are taken as input data for processing the data analysis (Ahmad et al., 2018). The input data can be taken from the data extracted from different social media platforms (Siddiqui et al., 2019). These platforms are, namely, the Twitter data set. The sample data can be made from this data set, and the analysis is usually done from this data set by performing the sarcastic sentences and non-sarcastic sentences (Rajeswari & Shanthibala, 2018). While doing the classification of data from the data sets, the word level of polarity is predominant. It can be performed with the polarity judgment dictionary. The sentimental summarization can be done with the negative meaning and positive meaning of the word distinguished from the sample data set, preceded by the hashtags. The polarity detector is going to play a vital role in determining the word as negative by mistake that text has distinguished as a positive meaning. The sarcastic texts are influencing the classification accuracy of the polarity detector(Yahya et al., 2020). The polarity detector also identifies the motions placed before the proposed phrases based on the number of words with the emotion included in the terms. The main goal of the classification and bifurcation is to identify the words with negative expression and express the positive meaning or expression in the sentences available in the sample data taken for the process (A. Joshi et al., 2017). The rest of the paper is prepared as follows: Section two displays Literature Survey. Section Three presents supervised classification Machine Learning Techniques. Section four the design and modeling of our Proposed Methodology equations to measure the performance text data sets, pre-processing, feature extraction, Training, Test Data Set. Section four displays equations to measure the performance of the proposed model and Algorithm Algorithms Twitter Extract Sentiment. In Section five, we discussion of the results of the experiments, and the confusion Matrix. Finally the conclusion and future research in In Section six.

Literature Survey
The research work done by Peng Liu et al. has focused on the imbalanced classification to perform the detection of Sarcasm from social media information. In this research work, a novel multi-strategy ensemble learning method is implemented to handle imbalance problems in finding the sarcasm words from the targeted data. In this work, the word from English Language and Chinese language data sets have been evaluated, and the results have been extracted (Liu et al., 2014). Bala Durga Dharmavarapu et al. have developed a research concept to detect the Sarcasm from the twitter data by implementing sentiment analysis. In this paper, the sarcasm detection is done by using sentiment analysis, naïve Bayes Classification, and AdaBoost algorithms. In this process, the tweets have been categorized into sarcastic and non-sarcastic woods. This process has successfully demonstrated the sarcasm words from targeted twitter database information (Dharmavarapu & Bayana, 2019). The sarcasm detection method is proposed to improve review analysis by Shota Suzuki et.al. This work has focused on a sequential approach to detect Sarcasm. This approach is starting by applying dependency parsing to the data. Then the method has classified expressions in the sentences into the proposed word based on the structure of parts of speech. Then the analysis is done with sentiment phrases to determine it as Sarcasm. This has given successful results in finding the sarcasm words from the given data (Suzuki et al., 2017). The IEEE research work has been predominant in sarcasm detection presented by Le Hoang Son et.al. In this work, sarcasm detection is incorporated with the help of soft attention based bidirectional log short term memory model within the convolution Network. This paper is predominantly concentrated to detect Sarcasm with the help of a deep learning model that has been developed on the basis of a hybrid of soft attention-based bidirectional long short-term memory. This model is incorporated a convolution neural network, then applying global vectors for word illustration for building semantic word embedding. This work has exemplified the concept with the help of maps for punctuation-based auxiliary features (

Supervised classification Machine Learning Techniques
Supervised learning techniques can be used to sufficient labeled training text corpora are accessible. Sarcastic detection query can be made as given training text document set TD = {TD1, TD2, TD3, …. TDn}, with each text document, selected one of the binary class labels, Sarcasm' or 'non-Sarcasm. A classification model x relates a feature set of documents to the class label. After that, it was given the new text document 'td'; a model x is used in order to predict the class label for new text documents. There are some of the main methods explained as follows.

Multinomial Naïve Bayes (MNB)
It is a Machine Learning algorithm associated with the supervised classification. A multinomial distribution is helpful to our model feature extraction vectors, where each value expresses the number of the appearance of a term or its relative rate. When feature vector has n elements where each of them, we can assume k, other values with likelihood pk. x_1= x_1∩ x_2 = x_2∩……∩x_k = x_k (1) The conditional probabilities P (xi|y) here are counted with a frequency computed (corresponds to implementing a maximum probability approach) (

Proposed Methodology
The proposed research methodology is taken from the previous research papers. The main goal of our research is to classify the primary data and use the kaggle obtaining the secondary data for the classification of Sarcasm from the social media twitter platforms (Bhanap & Kawthekar, n.d.).

Dataset
Data set are downloaded from Kaggle contains two columns, one tweet, and the other one label. The first tweet contains a tweet that brings from API, and the next column label contains binary 0 or 1, extracted manually in CSV file contains 1984rows of tweets. It consists of 1024 Sarcasm tweets and 962 as non-Sarcasm tweets were taken to train a model.

Text Preprocessing
Pre-processing and feature extraction are critical steps for text classification applications. In this division, we introduce procedures for cleaning the text data set, thus noise removal implicit, and allowing for useful information.
Pre-processing is enriched with text cleaning with mentions identification, URL's classification, and symbols of points, hashtags, commas, and other kinds of symbols identification, Identification of white spaces, Identification of lower case words, fixing words, Tokenization, stop words, and Text vectorization. In the preprocessing stage, the noise would be cleaning. Basically, sarcasm detection is done with three essential elements. These are Lexical, Hyperbole, and Pragmatic. Lexical classification can be done with unigram, Bigram, and N-gram. The Hyperbole classification can be incorporated with the interjection, Punctuation Marks, Quotes, and intensifier (Hayat et al., 2019).

Feature Extraction and Sarcasm Detection
After finished from the preprocessing stage, the data must be prepared and made ready for the next step. There are five phases, (1) Words tokenization, (2) Parts-Of-Speech (POS), (3) lemmatization, (4) Feature extraction, and (5) New Representation. The first stage is the word Tokenization. Tokenization is implemented on tweets to reveal them to words slipping into perfect expressive modules from a decision as follows :{ "After" "four," "sleeping," "hours," "for"}.

Training, Test Data Set
We use a dataset to train our model by 70%, the classifier that contains 1984 tweets, out of which 1024 are sarcasm tweets, and the residual is non-sarcasm tweets. Test this set includes 30% tweets after those results are compared with a label that has been created manually in order to predict the performances of our proposed classification (Kharde, 2016) (Salloum et al., 2018).

Proposed model and Algorithm
Sarcasm in pure text data is being detected by following practically step by step approaches as arranged in figure  4 below. In figure (4) shown above, the initial step is to import a library related to machine learning and Natural language processing (NLP) related tasks. The next step is to load CSV file as data set, and preprocess data set with some steps such as the Filtration of stop words, slang, and abbreviation, noise removal, HTML tags, non-letters, Stemming each word. The next step is to train the model data set, bag of word, Feature Extraction the data (TF-IDF) vectorizer, then split data into train and test, after that chose classification model such as Multinomial NB, and fit model, and confusion matrix, classification report last setups predict accuracy

Experimentation and Discussion Result
The experimental section consists of a windows system, and the anaconda consists of Jupyter Notebook. The code can be run by a python virtual environment. Jupyter Notebook can be installed with the help of pip or conda library executing the code to build a Machine learning model. The code can be divided into four stages. The first stage is to import the library; the second one is to load and preprocess the data set, the third stage is to train the model data set & bag of word classifier train, and the last stage is to predict evaluation. These four mentioned steps above can import all libraries related to machine learning and Natural language processing (NLP) related tasks. Another step is to load CSV files as data set and preprocess data set with some steps such as the Filtration of stop words, slang, and abbreviation, noise removal, HTML tags, non-letters, Stemming each word. Then, to train model data set bag of word, Feature Extraction, the data (TF-IDF) vectorizer then split data into train and test. After that, a classification model is to be chosen, such as Multinomial NB, Logistic Regression, Decision Tree Classifier, and SVM. Another step is to fit the model, confusion matrix, and classification report; at last, there is a prediction of Accuracy for results that are presented below in Table 1        These results imbalance between accuracy and precision in Logistic Regression, Decision Tree Classifier, but in SVM has shown different results. Though SVM seems so poor model with the given data, all procedures are parallel. Here it can be seen Logistic Regression shows better accuracy in classifying models than SVM. This is often called feature extraction, which affords better results. The same technique, which is inattentive in SVM, leads to low accuracy. The Decision Tree Classifier and Multinomial NB are achieved by carrying weights in training and modifying them through one only line repetition. The SVM fails in this classification. The dataset though, when expanded, can be used Count-Vectorizer to obtain better results. Accuracy and precision accuracy precision

Conclusion and Future Scope of the Study
Sarcasm is a complex form of irony that was widely observed on social media such as the Twitter platform. Detecting offensive tweets is an essential matter in textural classification also thus has several implications. This current paper successfully demonstrated the sarcasm detection classification by including enhanced preprocessing, feature extraction, and text mining techniques, which is valuable in Machine learning for investigating the social media opinion on a specific organization. The data supplied by Twitter is analyzed and extracted the sarcasm words from the data set to get scrams or non-scrams sentences. In our model applied some classification algorithms such as the Decision Tree Classifier, Multinomial Naïve Bayes, Logistic Regression, and Support vector machines (SVM) Algorithm. We achieved good accuracy results compare with label data, and our model showed improved accuracy with Logistic Regression by 0.97%, Decision Tree Classifier 0.93%, and Multinomial Naïve Bayes 0.94%. It frailer with Support vector machines (SVM) shown 0.42 % only. This paper has examined several approaches that can extract the Sarcasm from the targeted social media. Among all processes, machine learning, especially deep machine learning algorithms enriched with Multinomial Naïve Bayes, is implemented to extract the results. The proposed Algorithm is proved to be useful to obtain the sarcasm words from the social media documents obtained from a twitter data set. This study focused on the detection of Sarcasm by plain text with the highest accuracy. In the future, it is recommended to pay attention and do more researches related to a comprehensive hybrid classification method to detect Sarcasm in deep machine learning in massive data. We suggest applying another classification model with CountVectorizer.