A Workshop on Machine Learning in Natural Language Processing

 

Organizers: Shalom Lapin and Ido Dagan

 

Workshop Abstracts

 

Universal scaling of semantic information revealed from IB word clusters

Presentation (pps)


Naftali Tishby
Hebrew University

 


We quantitatively examine the hypothesis that human languages are adaptive structures, in which new words are generated through a trade-off between the complexity of the lexicon and its predictive power. By applying the Information Bottleneck method to the word-topic statistics in 15 different languages we obtain an essentially universal scaling relation between the word entropy of the lexicon and its accuracy in identifying topics, measured by the mutual information between the lexicon and the topics. This intriguing linguistic scaling law is different from previously studied universal statistics of words ( e.g. Zipf's law) as it truly depends on the usage and semantics of the language. It therefore reflects a new underlying universal cognitive mechanism for word generation and usage. While this type of analysis may seem like a relic from the past, modern machine learning methods and the availability of larger corpora enable us to discover new statistical linguistic regularities which may provide deeper understanding of human language acquisition and evolution.
 
Based on work with Dimitry Davidov, Amir Navot and Josemine Magdalen.

 

 

Back