Sentiment based classification of the web texts : doctoral dissertation

Bučar, Jože

Show document
A+ | A- | | SLO | ENG

Title:	Sentiment based classification of the web texts : doctoral dissertation
Authors:	ID Bučar, Jože (Author) ID Povh, Janez (Mentor) More about this mentor... ID Žnidaršič, Martin (Comentor)
Files:	DR_Bucar_Joze_i2017.pdf (4,05 MB) MD5: FFFC63A6D6F0136B572F4E8994B11BC5
Language:	English
Work type:	Doctoral dissertation
Typology:	2.08 - Doctoral Dissertation
Organization:	FIŠ - Faculty of Information Studies in Novo mesto
Abstract:	It has always been a challenging task to predict events in the near or distant future. People are interested in forecasting weather, earthquakes, floods, predicting economic, political and social changes, as well as the development of technology, sales products and sports outcomes. On the web, an enormous quantity of data is generated daily. We are practically deluged by all kinds of data - scientific, medical, financial, historical, health care, demographic, business, and other. Usually, there are not enough human resources to examine this data. However, from this chaotic cluster of data we strive to obtain valuable information, which may significantly impact strategic decisions of both business and individuals in the future. Predicting future trends and events has become easier and more efficient especially with the collaboration among scientists from various fields. Sentiment analysis of web texts is an interesting and relevant research topic in this field. The aim of research described in this dissertation was to create specific language resources for sentiment analysis in Slovene, evaluate performance of sentiment based classification techniques and monitor the dynamics of sentiment, especially for the purpose of improving and contributing to computational analysis of texts in Slovene. Here, we introduce the construction of Slovene web-crawled news corpora and a lexicon for sentiment analysis in Slovene. Besides their availability, we describe the methodology and the tools that were required for their construction. The corpora contain more than 250,000 documents with political, business, economic and financial content from five Slovenian media resources on the web that were published between 1st of September 2007 and 31st of January 2016. They include sentiment annotation on three levels of granularity: sentence, paragraph and document level. More than 10,000 of them were manually annotated as positive, negative or neutral. A Slovene sentiment lexicon, which is based on the annotated documents, contains more than 25,000 words with sentiment ratings, and is the first of this kind for Slovene. In detail, we describe the construction of these language resources, the manual annotation process and its characteristics. All developed resources are publicly available under Creative Commons copyright license. We used the annotated documents to assess the sentiment classification approaches. Experimental performance evaluation of sentiment based classification techniques gives encouraging results. When classifying documents, in terms of time consumption and performance, the Multinational Naive Bayes and the Support Vector Machines approaches outperform the other classifiers. Also, consideration of smaller text segments, such as sentences, improves the performance. Models achieve F1-score value of 97,85 % within the two-class (positive and negative) and 77,76 % within the three-class (positive, negative and neutral) document-level sentiment based classification. The sentiment analysis methodology was successfully used in the real-world applications for estimating the proportions of positive, negative and neutral news in the selected web media, and for monitoring the dynamics of sentiment. When estimating the proportions of positive, negative and neutral news, approximately half of the retrieved news is neutral. In general, the proportion of negative news is twice as high as the proportion of positive news. The study of sentiment dynamics shows that sentiment is on average more explicit at the beginning of documents and loses sharpness towards the end.
Keywords:	news corpus, sentiment analysis, lexicon, corpus linguistics, machine learning, document classification, monitoring sentiment dynamics, doctoral dissertation
Place of publishing:	Novo mesto
Place of performance:	Novo mesto
Publisher:	[J. Bučar]
Year of publishing:	2017
Year of performance:	2017
Number of pages:	XXXIII, 151 str.
PID:	20.500.12556/ReVIS-5016
COBISS.SI-ID:	2048474131
UDC:	004.85:004.774:81'322(043.2)
Note:	Na ov.: Doctoral Dissertation; Besedilo v angl., obsežen povzetek v slov.;
Publication date in ReVIS:	22.08.2018
Views:	5988
Downloads:	220
Metadata:
:	Copy citation

Share:

Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Secondary language

Language:	Slovenian
Title:	Klasifikacija spletnih besedil na osnovi izraženosti sentimenta
Abstract:	Napovedovanje dogodkov v bližnji ali daljni prihodnosti je od nekdaj veljalo za zahtevno. Ljudje se zanimajo za napovedi vremena, bližajočih se naravnih katastrof, gospodarskih, političnih in socialnih sprememb, kot tudi za trende v razvoju tehnologij, prodajo izdelkov in napovedovanje športnih izidov. Na svetovnem spletu se vsak dan objavi ogromna količina podatkov. Praktično smo zasuti z različnimi vrstami podatkov, ki izhajajo iz področij znanosti, zdravstva, financ, poslovanja, demografije, zgodovine in drugih, pričemer nam v postopkih obdelave podatkov običajno primanjkuje človeških virov. Kljub vsemu si prizadevamo pridobiti dragocene informacije iz tega kaotičnega skupka podatkov, z namenom, da bi lahko v prihodnje izboljšali strateške odločitve tako posameznikov kot podjetij. Napovedovanje trendov in dogodkov v prihodnosti je postalo laže in bolj učinkovito, še zlasti s sodelovanjem med znanstveniki z različnih področij. Analiza sentimenta spletnih besedil je zanimivo in relevantno raziskovalno področje. Cilj raziskav v sklopu te disertacije je izdelava posebnih jezikovnih virov za analizo sentimenta, ocena učinkovitosti klasifikacijskih metod in spremljanje dinamike sentimenta, z namenom, da pripomoremo k boljšemu računalniškem razumevanju besedil v slovenskem jeziku. V okviru te raziskave so opisani postopki za izgradnjo (s sentimentom) označenih korpusov novic in leksikona za analizo sentimenta v slovenskem jeziku. Poleg dostopnosti do razvitih jezikovnih virov so opisani tudi metodologija in orodja, ki so bila za to potrebna. Korpusi vsebujejo več kot 250 tisoč spletnih besedil ter vsebujejo politična, gospodarska in finančna besedila, ki so bila objavljena med 1 septembrom 2007 in 31 januarjem 2016 s strani petih spletnih medijev v Sloveniji. Dokumenti so bili označeni na treh nivojih, tj. na ravni dokumenta, na ravni odstavkov in na ravni stavkov. Več kot deset tisoč dokumentov je bilo ročno označenih kot pozitivni, negativni in nevtralni. Leksikon je bil zgrajen na osnovi označenega korpusa besedil. Vsebuje več kot 25 tisoč besed z dodeljenim sentimentom. Je prvi leksikon za analizo sentimenta v slovenščini, ki temelji na ročnem označevanju slovenskih besedil. Podrobno so opisani postopki izgradnje jezikovnih virov, ročnega označevanja ter njihove lastnosti. Vsi viri so javno dostopni pod licenco Creative Commons. V nadaljevanju je predstavljena študija ocene učinkovitosti klasifikacijskih metod, ki daje spodbudne rezultate. Pri klasifikaciji dokumentov se Naivni (večrazsežnostni) Bayesov klasifikator in Metoda podpornih vektorjev izkažeta kot najbolj učinkoviti metodi z vidika časovne zahtevnosti in različnih mer točnosti. Prav tako segmentacija besedil na manjše dele, kot na primer stavke, pripomore k boljšim rezultatom klasifikacije. Pri klasifikaciji dokumentov v dva razreda (pozitiven in negativen) dosežemo F1-oceno 97,85%, pri klasifikaciji dokumentov v tri razrede (pozitiven, negativen in nevtralen) pa 77,76%. Principe analize sentimenta smo uspešno uporabili tudi pri ocenjevanju deleža pozitivnih, negativnih in nevtralnih novic izbranih spletnih medijev ter pri spremljanju dinamike sentimenta. V okviru ocenjevanja pozitivnih, negativnih in nevtralnih novic je bilo ugotovljeno, da je približno polovica izmed vseh pridobljenih novic nevtralnih. V splošnem je delež negativnih novic dvakrat večji od deleža pozitivnih novic. Študija dinamike sentimenta je pokazala, da je v povprečju sentiment močneje izražen na začetku dokumentov in izgublja svojo izraženost proti koncu dokumentov.
Keywords:	korpus novic, analiza sentimenta, leksikon, korpusna lingvistika, strojno učenje, klasifikacija dokumentov, spremljanje dinamike sentimenta, doktorska disertacija

Back

Show document A+ | A- | | SLO | ENG

Secondary language

Show document
A+ | A- | | SLO | ENG