Sentiment based classification of the web textsJože Bučar
Abstract: It has always been a challenging task to predict events in the near or distant future. People are interested in forecasting weather, earthquakes, floods, predicting economic, political and social changes, as well as the development of technology, sales products and sports outcomes. On the web, an enormous quantity of data is generated daily. We are practically deluged by all kinds of data - scientific, medical, financial, historical, health care, demographic, business, and other. Usually, there are not enough human resources to examine this data. However, from this chaotic cluster of data we strive to obtain valuable information, which may significantly impact strategic decisions of both business and individuals in the future. Predicting future trends and events has become easier and more efficient especially with the collaboration among scientists from various fields.
Sentiment analysis of web texts is an interesting and relevant research topic in this field. The aim of research described in this dissertation was to create specific language resources for sentiment analysis in Slovene, evaluate performance of sentiment based classification techniques and monitor the dynamics of sentiment, especially for the purpose of improving and contributing to computational analysis of texts in Slovene.
Here, we introduce the construction of Slovene web-crawled news corpora and a lexicon for sentiment analysis in Slovene. Besides their availability, we describe the methodology and the tools that were required for their construction. The corpora contain more than 250,000 documents with political, business, economic and financial content from five Slovenian media resources on the web that were published between 1st of September 2007 and 31st of January 2016. They include sentiment annotation on three levels of granularity: sentence, paragraph and document level. More than 10,000 of them were manually annotated as positive, negative or neutral. A Slovene sentiment lexicon, which is based on the annotated documents, contains more than 25,000 words with sentiment ratings,
and is the first of this kind for Slovene. In detail, we describe the construction of these language resources, the manual annotation process and its characteristics. All developed resources are publicly available under Creative Commons copyright license. We used the annotated documents to assess the sentiment classification approaches. Experimental performance evaluation of sentiment based classification techniques gives encouraging results. When classifying documents, in terms of time consumption and performance, the Multinational Naive Bayes and the Support Vector Machines approaches outperform the other classifiers. Also, consideration of smaller text segments, such as
sentences, improves the performance. Models achieve F1-score value of 97,85 % within the two-class (positive and negative) and 77,76 % within the three-class (positive, negative and neutral) document-level sentiment based classification. The sentiment analysis methodology was successfully used in the real-world applications for estimating the proportions of positive, negative and neutral news in the selected web media, and for monitoring the dynamics of sentiment. When estimating the proportions of positive, negative and neutral news, approximately half of the retrieved news is neutral. In general, the proportion of negative news is twice as high as the proportion of
positive news. The study of sentiment dynamics shows that sentiment is on average more explicit at the beginning of documents and loses sharpness towards the end.
Found in: ključnih besedah
Keywords: news corpus, sentiment analysis, lexicon, corpus linguistics, machine learning, document classification, monitoring sentiment dynamics
Published: 22.08.2018; Views: 2488; Downloads: 166
Fulltext (4,05 MB)