Repository of colleges and higher education institutions

Show document
A+ | A- | SLO | ENG

Title:Multilevel complex systems approaches to computational linguistics
Authors:Ban, Kristina (Author)
Levnajić, Zoran (Mentor) More about this co-author... New window
Boshkoska, Biljana Mileva (Co-mentor)
Language:English
Work type:Doctoral dissertation
Tipology:2.08 - Doctoral Dissertation
Organization:FIŠ - Faculty of Information Studies in Novo mesto
Abstract:Complex systems are omnipresent in nature, society as well as in human culture. Last few decades saw an increase of interest for their study, particularly by using graph-theoretic methodologies. By identifying systems' units as nodes and modelling interactions between the units as links, the study of complex networks spread to a number of disciplines including sociology, biology and linguistics, to just mention a few. The research done in this doctoral dissertation falls in this context. The core of this doctoral work is the data-driven multilevel analysis of major human languages, which was done in two stages. First, we looked at the speed of growth of Wikipedias in 26 different languages over the span of 15 years. This involved creating and analysing a dataset with 14962 articles, each of which exists in all 26 languages. We found six well-defined clusters of Wikipedias that share common growth patterns, with their make-up robust against the method used for their determination. Interestingly, the identified clusters were found to have little correlation with the respective language families. Rather, our results suggest that growth of Wikipedias is primarily governed by an intricate set of other factors, from culture to information literacy. Second, to approach human languages at another independent level, we gathered a dataset comprising a list of syllables and a list of syllables words in 10 different languages, specifically: English, Dutch, German, Russian, Slovenian, Croatian, French, Spanish, Latin and Basque. These datasets were obtained from recognized repositories for each language and benchmarked in the same way. Syllable networks were created by looking at pairs of syllables that jointly compose at least one word. We then carried out a systematic network analysis, relying on both standard network analysis methods and more recent techniques, such as K-core analysis and graphlet statistics. Research revealed striking similarities between the architectures of syllable networks that belong to the same language family, along with expected differences between the families. Indeed, structures of syllable networks were found to well quantify the linguistic similarities among these 10 languages, exactly as known from classical linguistics. Most interestingly, we found that Basque language, whose classification is as of today still unknown, bares a strong resemblance to Latin, at least when syllable network representation is concerned. Earlier stages of this doctoral work involved comparing the performance of network alignment algorithms, used in bioinformatics for studying protein networks. Several alignment algorithms were compared by scoring their performance on standard protein datasets. It was found that three algorithms, HUBALIGN, L-GRAAL and NATALIE, regularly produce the most topologically and biologically coherent alignments. Due to the change of doctoral adviser, this research topic was abandoned in favour of language/syllable networks. In sum, this doctoral work involved two distinct directions of research in network science, one related to developing the methodology of network analysis (alignment algorithms), and the other devoted to extracting new information from specifically designed datasets (syllable networks). Therefore, the original contribution of this work to science includes both theory and methodology. Future research avenues include advancement along both directions, most interesting being the application of network alignment methods to syllable datasets, which could reveal more precise quantification of structural differences among syllable networks.
Keywords:computational statistics, biostatistics, bioinformatics, machine learning, computational linguistics
Year of publishing:2018
Publisher:[K. Ban]
Source:Novo mesto
COBISS_ID:297952768 Link is opened in a new window
UDC:004.8:519.765:81'322(043.3)
Views:3038
Downloads:136
Files:.pdf DR_2018_Kristina_Ban.pdf (17,51 MB)
 
Metadata:XML RDF-CHPDL DC-XML DC-RDF
Licenca:Priznanje avtorstva-Nekomercialno-Brez predelav Novo okno
  
Average score:(0 votes)
Your score:Voting is allowed only for logged in users.

Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Secondary language

Language:Slovenian
Title:Večnivojski kompleksni pristopi k računalniški lingvistiki
Abstract:Kompleksni sistemi so vseprisotni v naravi, družbi in v človeški kulturi. V zadnjih nekaj desetletjih se je povečalo zanimanje za njihovo preučevanje, zlasti z uporabo metod teorije grafov. S predstavitvijo enot sistema kot vozlišč in modeliranja interakcij med enotami kot povezav, študija kompleksnih omrežij se razširila na številne discipline, vključno s sociologijo, biologijo in lingvistiko, da bi omenili le nekaj. Raziskovalno delo v tej doktorski disertaciji sodi v ta kontekst. Jedro tega doktorskega dela je večplastna analiza glavnih svetovnih jezikov, ki temelji na podatkih, kar je narejeno v dveh fazah. Najprej smo pogledali hitrost naraščanja Wikipedij v 26 različnih jezikih v obdobju 15 let. To je vključevalo izdelavo in analizo podatkovji s 14962 članki, od katerih vsaki obstaja v vseh 26 jezikih. Našli smo šest jasno opredeljenih klastrov Wikipedij, ki imajo skupne vzorce rasti, njihova sestava pa je presenetljivo robustna glede na metodo njihove določitve. Zanimivo je, da so ugotovljeni klastri zelo malo korelirani z jezikovnimi družinami 26 jezikov. Nasprotno, naši rezultati kažejo, da je rast Wikipedij predvsem določen zapletenim nizom drugih dejavnikov, od kulture do informacijske pismenosti. V drugi smo fazi pristopili k svetovnim jezikom na novem nivoju, in sicer smo zbrali podatkovja s seznamom zlogov in seznamom zlogiziranih besed v desetih različnih jezikih: angleščini, nizozemščini, nemščini, ruščini, slovenščini, hrvaščini, francoščini, španščini, latinščini in baskovščini. Ti nabori podatkov so bili pridobljeni iz priznanih podatkovnih skladišč za vsaki jezik in ustrezno poenoteni. Omrežja zlogov so bila ustvarjena tako, da so pari zlogov, ki skupaj sestavljajo vsaj eno besedo, predstavljeni kot povezan par vozlišč. Nato smo izvedli sistematično analizo omrežij, ki se je opirala na standardne metode analize omrežja ter na novejše tehnike, kot sta analiza K-jedra in statistika grafkov. Raziskava je pokazala presenetljive podobnosti med arhitekturami omrežij zlogov, ki pripadajo isti jezikovni družini, skupaj s pričakovanimi razlikami med različnimi jezikovnimi družinami. Najbolj zanimivo je, da je zlogovna struktura baskovskega jezika, katerega klasifikacija je še danes neznana, močno podobna latinščini.
Keywords:računska statistika, biostatistika, bioinformatika, strojno učenje, računalniška lingvistika


Back