Today I read a paper titled “Bayesian Grammar Induction for Language Modeling”
The abstract is:
We describe a corpus-based induction algorithm for probabilistic context-free grammars.
The algorithm employs a greedy heuristic search within a Bayesian framework, and a post-pass using the Inside-Outside algorithm.
We compare the performance of our algorithm to n-gram models and the Inside-Outside algorithm in three language modeling tasks.
In two of the tasks, the training data is generated by a probabilistic context-free grammar and in both tasks our algorithm outperforms the other techniques.
The third task involves naturally-occurring data, and in this task our algorithm does not perform as well as n-gram models but vastly outperforms the Inside-Outside algorithm..