Repository of colleges and higher education institutions

Search the repository
A+ | A- | Help | SLO | ENG

Query: search in
search in
search in
search in

Options:
  Reset


1 - 2 / 2
First pagePrevious page1Next pageLast page
1.
Sentiment based classification of the web texts
Jože Bučar, 2017

Abstract: It has always been a challenging task to predict events in the near or distant future. People are interested in forecasting weather, earthquakes, floods, predicting economic, political and social changes, as well as the development of technology, sales products and sports outcomes. On the web, an enormous quantity of data is generated daily. We are practically deluged by all kinds of data - scientific, medical, financial, historical, health care, demographic, business, and other. Usually, there are not enough human resources to examine this data. However, from this chaotic cluster of data we strive to obtain valuable information, which may significantly impact strategic decisions of both business and individuals in the future. Predicting future trends and events has become easier and more efficient especially with the collaboration among scientists from various fields. Sentiment analysis of web texts is an interesting and relevant research topic in this field. The aim of research described in this dissertation was to create specific language resources for sentiment analysis in Slovene, evaluate performance of sentiment based classification techniques and monitor the dynamics of sentiment, especially for the purpose of improving and contributing to computational analysis of texts in Slovene. Here, we introduce the construction of Slovene web-crawled news corpora and a lexicon for sentiment analysis in Slovene. Besides their availability, we describe the methodology and the tools that were required for their construction. The corpora contain more than 250,000 documents with political, business, economic and financial content from five Slovenian media resources on the web that were published between 1st of September 2007 and 31st of January 2016. They include sentiment annotation on three levels of granularity: sentence, paragraph and document level. More than 10,000 of them were manually annotated as positive, negative or neutral. A Slovene sentiment lexicon, which is based on the annotated documents, contains more than 25,000 words with sentiment ratings, and is the first of this kind for Slovene. In detail, we describe the construction of these language resources, the manual annotation process and its characteristics. All developed resources are publicly available under Creative Commons copyright license. We used the annotated documents to assess the sentiment classification approaches. Experimental performance evaluation of sentiment based classification techniques gives encouraging results. When classifying documents, in terms of time consumption and performance, the Multinational Naive Bayes and the Support Vector Machines approaches outperform the other classifiers. Also, consideration of smaller text segments, such as sentences, improves the performance. Models achieve F1-score value of 97,85 % within the two-class (positive and negative) and 77,76 % within the three-class (positive, negative and neutral) document-level sentiment based classification. The sentiment analysis methodology was successfully used in the real-world applications for estimating the proportions of positive, negative and neutral news in the selected web media, and for monitoring the dynamics of sentiment. When estimating the proportions of positive, negative and neutral news, approximately half of the retrieved news is neutral. In general, the proportion of negative news is twice as high as the proportion of positive news. The study of sentiment dynamics shows that sentiment is on average more explicit at the beginning of documents and loses sharpness towards the end.
Found in: ključnih besedah
Keywords: news corpus, sentiment analysis, lexicon, corpus linguistics, machine learning, document classification, monitoring sentiment dynamics
Published: 22.08.2018; Views: 3341; Downloads: 204
.pdf Fulltext (4,05 MB)

2.
Multilevel complex systems approaches to computational linguistics
Kristina Ban, 2018

Abstract: Complex systems are omnipresent in nature, society as well as in human culture. Last few decades saw an increase of interest for their study, particularly by using graph-theoretic methodologies. By identifying systems' units as nodes and modelling interactions between the units as links, the study of complex networks spread to a number of disciplines including sociology, biology and linguistics, to just mention a few. The research done in this doctoral dissertation falls in this context. The core of this doctoral work is the data-driven multilevel analysis of major human languages, which was done in two stages. First, we looked at the speed of growth of Wikipedias in 26 different languages over the span of 15 years. This involved creating and analysing a dataset with 14962 articles, each of which exists in all 26 languages. We found six well-defined clusters of Wikipedias that share common growth patterns, with their make-up robust against the method used for their determination. Interestingly, the identified clusters were found to have little correlation with the respective language families. Rather, our results suggest that growth of Wikipedias is primarily governed by an intricate set of other factors, from culture to information literacy. Second, to approach human languages at another independent level, we gathered a dataset comprising a list of syllables and a list of syllables words in 10 different languages, specifically: English, Dutch, German, Russian, Slovenian, Croatian, French, Spanish, Latin and Basque. These datasets were obtained from recognized repositories for each language and benchmarked in the same way. Syllable networks were created by looking at pairs of syllables that jointly compose at least one word. We then carried out a systematic network analysis, relying on both standard network analysis methods and more recent techniques, such as K-core analysis and graphlet statistics. Research revealed striking similarities between the architectures of syllable networks that belong to the same language family, along with expected differences between the families. Indeed, structures of syllable networks were found to well quantify the linguistic similarities among these 10 languages, exactly as known from classical linguistics. Most interestingly, we found that Basque language, whose classification is as of today still unknown, bares a strong resemblance to Latin, at least when syllable network representation is concerned. Earlier stages of this doctoral work involved comparing the performance of network alignment algorithms, used in bioinformatics for studying protein networks. Several alignment algorithms were compared by scoring their performance on standard protein datasets. It was found that three algorithms, HUBALIGN, L-GRAAL and NATALIE, regularly produce the most topologically and biologically coherent alignments. Due to the change of doctoral adviser, this research topic was abandoned in favour of language/syllable networks. In sum, this doctoral work involved two distinct directions of research in network science, one related to developing the methodology of network analysis (alignment algorithms), and the other devoted to extracting new information from specifically designed datasets (syllable networks). Therefore, the original contribution of this work to science includes both theory and methodology. Future research avenues include advancement along both directions, most interesting being the application of network alignment methods to syllable datasets, which could reveal more precise quantification of structural differences among syllable networks.
Found in: ključnih besedah
Keywords: computational statistics, biostatistics, bioinformatics, machine learning, computational linguistics
Published: 21.12.2018; Views: 2968; Downloads: 135
.pdf Fulltext (17,51 MB)

Search done in 0 sec.
Back to top