Title: | Multilevel complex systems approaches to computational linguistics : doctoral dissertation |
---|
Authors: | ID Ban, Kristina (Author) ID Levnajić, Zoran (Mentor) More about this mentor... ID Boshkoska, Biljana Mileva (Comentor) |
Files: | DR_2018_Kristina_Ban.pdf (17,51 MB) MD5: FB4C4E62DA4D2522F3C9807F2CB9D764
|
---|
Language: | English |
---|
Work type: | Doctoral dissertation |
---|
Typology: | 2.08 - Doctoral Dissertation |
---|
Organization: | FIŠ - Faculty of Information Studies in Novo mesto
|
---|
Abstract: | Complex systems are omnipresent in nature, society as well as in human culture. Last few decades saw an increase of interest for their study, particularly by using graph-theoretic methodologies. By identifying systems' units as nodes and modelling interactions between the units as links, the study of complex networks spread to a number of disciplines including sociology, biology and linguistics, to just mention a few. The research done in this doctoral dissertation falls in this context. The core of this doctoral work is the data-driven multilevel analysis of major human languages, which was done in two stages. First, we looked at the speed of growth of Wikipedias in 26 different languages over the span of 15 years. This involved creating and analysing a dataset with 14962 articles, each of which exists in all 26 languages. We found six well-defined clusters of Wikipedias that share common growth patterns, with their make-up robust against the method used for their determination. Interestingly, the identified clusters were found to have little correlation with the respective language families. Rather, our results suggest that growth of Wikipedias is primarily governed by an intricate set of other factors, from culture to information literacy. Second, to approach human languages at another independent level, we gathered a dataset comprising a list of syllables and a list of syllables words in 10 different languages, specifically: English, Dutch, German, Russian, Slovenian, Croatian, French, Spanish, Latin and Basque. These datasets were obtained from recognized repositories for each language and benchmarked in the same way. Syllable networks were created by looking at pairs of syllables that jointly compose at least one word. We then carried out a systematic network analysis, relying on both standard network analysis methods and more recent techniques, such as K-core analysis and graphlet statistics. Research revealed striking similarities between the architectures of syllable networks that belong to the same language family, along with expected differences between the families. Indeed, structures of syllable networks were found to well quantify the linguistic similarities among these 10 languages, exactly as known from classical linguistics. Most interestingly, we found that Basque language, whose classification is as of today still unknown, bares a strong resemblance to Latin, at least when syllable network representation is concerned. Earlier stages of this doctoral work involved comparing the performance of network alignment algorithms, used in bioinformatics for studying protein networks. Several alignment algorithms were compared by scoring their performance on standard protein datasets. It was found that three algorithms, HUBALIGN, L-GRAAL and NATALIE, regularly produce the most topologically and biologically coherent alignments. Due to the change of doctoral adviser, this research topic was abandoned in favour of language/syllable networks. In sum, this doctoral work involved two distinct directions of research in network science, one related to developing the methodology of network analysis (alignment algorithms), and the other devoted to extracting new information from specifically designed datasets (syllable networks). Therefore, the original contribution of this work to science includes both theory and methodology. Future research avenues include advancement along both directions, most interesting being the application of network alignment methods to syllable datasets, which could reveal more precise quantification of structural differences among syllable networks. |
---|
Keywords: | computational statistics, biostatistics, bioinformatics, machine learning, computational linguistics |
---|
Place of publishing: | Novo mesto |
---|
Place of performance: | Novo mesto |
---|
Publisher: | [K. Ban] |
---|
Year of publishing: | 2018 |
---|
Year of performance: | 2018 |
---|
Number of pages: | XX, 135 str. |
---|
PID: | 20.500.12556/ReVIS-5369 |
---|
COBISS.SI-ID: | 297952768 |
---|
UDC: | 004.8:519.765:81'322(043.3) |
---|
Publication date in ReVIS: | 21.12.2018 |
---|
Views: | 3878 |
---|
Downloads: | 139 |
---|
Metadata: | |
---|
:
|
Copy citation |
---|
| | | Share: | |
---|
Hover the mouse pointer over a document title to show the abstract or click
on the title to get all document metadata. |