Totally, I collect approximate 50 papers including those of basic model description such as HMM, MEMM, and CRFs and those of application oriented papers such as IE and NER, as well those of model selection related papers.
The papers are mainly gathered from ICML, ACL, NIPS, KDD, and some AI/CL related conference or journals.
- Generative model.
- Joint probability p(x, s)
![]()
- Graph structure

Figure 1. Dependency graph structure for first-order HMM for sequences
- Independent assumption
- Labeling sequential data (by Viterbi algorithm)
![]()
Model:
Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic functions of finite state Markov chains. Annual of Mathematical statistics, 37:1554-1563, 1966
Christopher Manning and Hinrich Schutze. Markov Models. Chapter 9. 1999.
Applications:
- POS Tagging
J. Kupiec. Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6:225.242, 1992.
- Shallow parsing
Antonio Molina and Ferran Pla. Shallow parsing using specialized hmms. Journal of Machine Learning Research, 2:595.613, March 2002.
Ferran Pla, Antonio Molina, and Natividad Prieto. Improving text chunking by means of lexical-contextual information in statistical language models. In Proceedings of CoNLL-2000, Lisbon, Portugal, 2000.
GuoDong Zhou, Jian Su, and TongGuan Tey. Hybrid text chunking. In Proceedings of CoNLL-2000, Lisbon, Portugal, 2000.
- Speech recognition
L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257.285, 1989.
L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series. Prentice-Hall, Inc., 1993.
- Gene sequence analysis
R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.
- Information extraction
Many papers…
Limitation:
- Joint probability distribution p(x, s). It is useful if the trained model is to be used to generate data. However, defining a joint distribution over label and observation sequences means that all possible observation sequences must be enumerated. (A task is intractable when considering long distance dependencies.)
- Cannot represent overlapping features. These features are not necessary independent.
- Conditional model.
- Joint probability p(x, y)
![]()
- Graph structure

Figure 2. Dependency graph structure for first-order MEMM for sequences
- Probability of each transition may depend on non-independent, interacting features of the observation sequence.
- State-observation transition function
![]()
- Parameter estimation by using GIS and IIS.
- Labeling sequential data (by dynamic programming algorithm)
![]()
Model:
A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71, 1996.
A. Ratnaparkhi. A simple introduction to maximum entropy models for natural language processing, 1997b.
A. Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD thesis, University of Pennsylvania, 1998.
Applications:
- Information extraction
Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation. In Proc. ICML 2000, 2000.
- Text segmentation
McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. Proc. 17th International Conf. on Machine Learning (pp. 591–598). Morgan Kaufmann, San Francisco, CA.
Limitation:
The Label Bias Problem

Figure 3. Finite-state acceptor for shallow parsing two sentences
The finite-state acceptor is designed to shallow parse the sentences (chunk/phrase parsing)
1) the robot wheels Fred round
2) the robot wheels are round
Decoding it by:
![]()
0123456
0127896
Assuming the probabilities of each of the transitions out of state 2 are approximately equal, the label bias problem means that the probability of each of these chunk sequences given an observation sequence x will also be roughly equal irrespective of the observation sequence x.
On the other hand, had one of the transitions out of state 2 occurred more frequently in the training data, the probability of that transition would always be greater. This situation would result in the sequence of chunk tags associated with that path being preferred irrespective of the observation sentence.
- Generative models such as HMM do not suffer from the label bias problem.
- Conditional model.
- Conditional independence.
![]()
The two variables A and B are conditional independent given C.
In the undirected graph, conditional independence means that every path from a node in A to a node in B must pass through at least one node in C.
- Graph structure

Figure 1. Graph structure of the first-order chain-structure case of CRFs for sequences
- Local function
- Potential function.
- CRFs definition. Let X and Y be jointly distributed random variables respectively ranging over observation sequences to be labeled and their corresponding label sequences, a conditional random field (X, Y) is an undirected graphical model globally conditioned on X.
- Given an undirected graph G=(V, E) such that Y={Yv|v∈V}, if
![]()
is equal to the probability
, the
probability of Yv given X and those random variables
corresponding to nodes neighboring v in G. Then (X, Y)
is a conditional random field.
- Labeling sequential data (by dynamic programming algorithm)
![]()
Model:
John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML 2001, 2001.
University of Massachusetts
Hanna M. Wallach. Conditional Random Fields: An Introduction. Technique Report
University of Pennsylvania
- Efficient training
Hanna Wallach. Efficient Training of Conditional Random Field. Master Thesis.
University of Edinburgh
Limited memory variable-metric methods for training
University of Edinburgh
Thomas G. Dietterich. Training Conditional Random Fields via Gradient Tree Boosting. ICML 2004.
Oregon State University
The paper applies the Friedman’s gradient tree boosting method for training CRFs. In tree boosting, the CRF potential functions are represented as weighted sums of regression trees. Regression trees are learned by stage-wise optimizations similar to Adaboost, but with the objective of maximizing the conditional likelihood P(Y|X) of the CRF model.
By growing regression trees, interactions among features are introduced only as needed. As a result, gradient tree boosting scales linearly in the order of the Markov model and in the order of the feature interactions.
Yuan Qi, Martin Szummer, Thomas P. Minka. Bayesian Conditional Random Fields. AISTATS 2005.
MIT, MICROSOFT, MICROSOFT.
The BCRF is proposed for classifying interdependent and structured data, such as sequences, images, or webs. ML has the problem of overfitting. The paper proposes a method to train and inference on CRFs. In training, BCRFs approximate the posterior distribution of the parameters using a variant of the power EP model. BCRFs flatten approximation structures to increase the algorithmic stability, efficiency, and prediction accuracy. In testing, BCRFs use approximate model averaging.
Applications:
- Name entity
Andrew McCallum and Wei Li. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. Seventh Conference on Natural Language Learning (CoNLL), 2003.
The paper has investigated named entity extraction with CRFs.
- Shallow parsing
Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL.
University of Pennsylvania
Using CRFs for shallow parsing.
Using voted perceptron for training.
- Table extraction
Pinto, D., McCallum, A., Wei, X., & Croft, W. B. (2003). Table extraction using conditional random fields. Proceedings of the ACM SIGIR.
University of Massachusetts
Objectives:
1) Locate the table
2) Identify the row positions and types
3) Identify the column positions and types
4) Segment the table into cells
5) Tag the cells as data or headers
6) Associate data cells with their corresponding headers
- Information extraction
Trausti Kristjansson, Aron Culotta, Paul Viola, and Andrew McCallum. Interactive Information Extraction with Constrained Conditional Random Fields.
Microsoft
University of Massachusetts
1) Interactive information extraction system: the user can select the correctly extracted data, correct the mistakenly extracted data, and fill the missing data.
2) Using linear-chain random fields with two extension: a) a constrained Viterbi decoding which finds the optimal field assignments consistent with the fields explicitly specified or corrected by the user; b) a mechanism for estimating the confidence of each extracted field, so that low-confidence extractions can be highlight.
Fuchun Peng and Andrew McCallum. Accurate Information Extraction from Research Papers using Conditional Random Fields. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004.
University of Massachusetts
Applies CRFs to extraction from research paper headers and reference sections, to obtain current best-in-the-world accuracy. Also compares some simple regularization methods.
- Object Recognition
Ariadna Quattoni, Michael Collins, and Trevor Darrell. Conditional Random Fields for Object Recognition. NIPS 2004
MIT
The authors present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). They propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes. They illustrate the potential of the model in the task of recognizing cars from rear and side views.
Given a training set of n pairs (xi, yi), where xi is the ith image and yi is the category of the object present in xi, the goal is to learn a model that maps images to object categories.
They add hidden variables h into the CRFs model.
![]()
![]()
Parameter estimation using belief propagation.
- Biomedical Named Entities identification.
Tzong-han Tsai, Wen-Chi Chou, Shih-Hung Wu, Ting-Yi Sung, Sunita Sarawagi, Jieh Hsiang, and Wen-Lian Hsu. Integrating Linguistic Knowledge into a Conditional Random Field Framework to Identify Biomedical Named Entities. Journal of Expert Systems with Applications. 2005
Institute of Information Science, Acdemia Sinica, TaiPei.
The paper makes use of CRFs for solving biomedical named entities identification. In this work, they try to utilize available resources including dictionaries, web corpora, and lexical analyzers, and represent them as linguistic features in the CRFs model.
Limitation:
Huge computational cost in parameter estimation
1.Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proc. of HLT-NAACL. 2003, pages 252-259.
The paper proposes a new POS tagger. In the tagger, the authors make use of both preceding and following tag context via a dependency network representation. By dependency network, they mean bidirectional dependency relationships. That is, the current state can be dependent on the preceding states and the following states.
For example, in the dependency network, the state TO is dependent on t1 and t3.

They also use broad lexical features. For example, the current state is not only predicted by the current word, but also the preceding words and the following words.
For example: t denotes the state, and w denotes the word.

In experiments, they tested kinds of features combination
Their experiments show that rich features based MEMM outperforms the traditional MEMM for the task of POS tagging.
For example:


|
Model |
Features |
Token |
Unknown |
Sentence |
|
L+L2 |
32,935 |
96.05% |
85.92% |
44.04% |
|
R+R2 |
33,423 |
95.25% |
84.49% |
37.20% |
|
L+R |
32,610 |
96.57% |
87.15% |
49.50% |
|
Model |
Features |
Token |
Unknown |
Sentence |
|
L+LL+LLL |
118,752 |
96.20% |
86.52% |
45.14% |
|
L+LL+LR+R+RR |
81,049 |
96.92% |
87.91% |
53.23% |
2. Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data
The paper follows the above work. It tries all possible feature combination (so called decomposition structure) for the local classifier and finds the ‘best’ one.
Except for finding the best decomposition structure, the idea of the paper is similar to that in the above paper.
2.1 The paper extends the decoding algorithm in dependency networks so as to support the structure decomposition. (However, there are some faults in its decomposition formula and the algorithms for bidirectional inference for the first-order conditional markov models.)
2.2 The paper also proposes a decoding method by using the easiest-first strategy. The algorithm can be described as follows:
1) Find the “easiest” word to tag.
2) Tag the word.
3) Go back to step 1), until all the words are tagged.
The “easiest” word to tag is the word for which the local classifier outputs the highest probability.
The algorithm, in its first iteration, selects the word by P(s|xi). And then, in the second iteration, it selects the second “easiest” word by P(s|s’xi), where s’ is the tag to the first “easiest” word.
The algorithm uses maximum entropy classifier as the local classifier.
Thinking:
1) Whether it is useful to replace the local classifier (ME) by the SVMs.
1. Dana Ron, Yoram Singer, and Naftali Tishby. Learning Probabilistic Automata with Variable Memory Length. COLT’94.
Hebrew University, Israel.
2. Gill Bejerano. Algorithm for variable length Markov chain modeling.
3. Gill Bejerano. Automata Learning and Stochastic Modeling for Biosequence Analysis. Ph.D. Thesis.
Hebrew University, Israel.

For probability prediction, the VMM model matches the longest memorized context at every point in the observation sequence.
![]()
For learning models, there are two phrases:
1) choosing the high-confidence symbol, and add it to the string set.
2) constructing the PST by the string set.
4. Yevgeny Seldin, Gill Bejerano, and Naftali Tishby. Unsupervised Sequence Segmentation by a Mixture of Switching Variable Memory Markov Sources.
Hebrew University, Israel.
The paper presents an application of VMM.
The authors make use of MDL for learning the PSTs (Prediction Suffix Trees).

Then the goal is to minimize TotalSize(λ) which is the total description length of the whole tree together with all coded data (as all data passes through the root node λ, also denoted as e in some other papers).
The algorithm works in two steps:
1) Extending all the nodes that are potentially beneficial, i.e. by using them, the description length may be decreased. In this way, those nodes whose description size is smaller than the code length of data passing through them when that data is coded using the parent node distribution are of interest.
2) Pruning nodes. If a child subtree Ts of some node s gives better compression than that of its parent nodes, keep the subtree, otherwise prune it.
Sequence Segmentation Algorithm
Given a string x generated by repeatedly switching between several different k PST models with some upper bound on the alternation rate. The task it to segment the string x into l>=k contiguous segments, with length of each segment greater than some constant value L.
They aim to find the optimal segmentation via minimizing the mutual information between x and T
![]()
![]()
Repeat until convergence:
![]()

5. Applications
Hinrich Schutze and Yoram Singer. PART-OF-SPEECH TAGGING USING A VARIABLE MEMORY MARKOV MODEL. 1995
Stanford, and Hebrew, Israel
The paper makes use of VMM for POS tagging.
1. Charles Sutton, Khashayar Rohanimanesh, and Andrew McCallum. Dynamic conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. ICML 2004
University of Massachusetts
The paper presents a generalization of linear-chain conditional random fields in which each time slice contain
The proposed sequence model can be used to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or long-range dependencies exist. The proposed model is called dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices.
What is DCRF?


Figure 5. Examples of DCRFs.
DCRF is a joint model (also can be viewed as an unified model), which is different from the cascaded model. For example, in cascaded model, the two tasks: POS Tagging and Noun-Phrase Chunking are performed in sequence, i.e. first POS Tagging and then Noun-Phrase Chunking. While in DCRF, the two tasks are carried out in a single factorial CRF.
1. Yasemin Altun, Alex J. Smola, and Thomas Hofmann. Exponential Families for Conditional Random Fields. ICML 2004
Brown University
The paper has investigated the relationship between kernelized exponential families and graphical models. It also proposes an efficient estimation algorithm for kernelized CRFs.
2. Yasemin Altun, Ioannis Tsochantaridis, and Thomas Hofmann. Hidden markov support vector machines. In International Conference on Machine Learning (ICML), pages 3-10, 2003.
Brown University
The paper combines the SVM and HMM algorithms. The proposed architecture handles dependencies between neighboring labels using Viterbi decoding. The learning procedure is based on the maximum margin criterion.
It can be used to learn non-linear discriminant function via kernel functions. It also can have the overlapping features.
3. B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Neural Information Processing Systems 2003, 2003.
Stanford University
The paper is similar to the above paper.
4. John Lafferty, Xiaojin Zhu, and Yan Liu. Kernel Conditional Random Fields: Representation and Clique Selection. ICML 2004.
CMU
Graph kernel. Here the graph is different from the graph representing the explicit dependencies between labels in a CRF.
The paper defines risk minimization procedures using Mercer kernels on labeled graphs.
Supporting semi-supervised learning.
Goal is to annotate structured data, where the structure is represented by a graph. Labels are to be assigned to the nodes in the graph in order to minimize a loss function.
The main idea is as follows:
Initialize with loss f=0, and iterate: (first, represent each clique by a base function)
1) for each candidate h∈Hk, supported by a single labeled clique, calculate the functional derivative dR(f, h).
To evaluate a candidate h, one strategy is to compute the gain sup
, and to
choose the candidate h having the largest gain.
Another strategy is the functional gradient descent approach, which evaluates a
small change to the current function. For a given candidate h, consider adding
h to the current model with small weight ε; thus
. Then
. Here, the
functional derivative of R at f in the direction h is computed as
![]()
where
is
the empirical expectation and
is
the model expectation conditioned on x, combined with the
empirical distribution on x. The idea is that in directions h
where the functional gradient dR(f, h) is large, the model
is mismatched with the labeled data; this direction should be added to the
model to make a correction.
2) Select the
candidate
having
the largest gradient direction. Set
.
3) Estimate
parameters
for
each active f by minimizing R(f) by using quasi-Newton
method.
Prediction is carried out using the forward-backward algorithm to compute marginals rather than using the Viterbi algorithm.
Semi-Markov chain models extend Hidden Markov Model by allowing each state si to persist for a non-unit length of time di. After the time di, the system will transit to a new state s’, which depends only on si. And during the “segment” of time between i to i+di, the behavior of the system may be non-Markovian.
1. Sunita Sarawagi and William W. Cohen. Semi-Markov Conditional Random Fields for Information Extraction. NIPS 2004.
CMU
The authors extend the CRF, and give a semi-Markov version of the CRF models. The semi-model is used for NER and the experimental results find benefits.
Many other papers about Semi-HMM
X. Ge. Segmental Semi-Markov Models and Applications to Sequence Analysis. PhD thesis, University of California, Irvine, December 2002.
J. Janssen and N. Limnios. Semi-Markov Models and Applications. Kluwer Academic, 1999.
R. Sutton, D. Precup, and S. Singh. BetweenMDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181–211, 1999.
1. Andrew McCallum. Efficiently Inducing Features of Conditional Random Fields. Conference on Uncertainty in Artificial Intelligence (UAI), 2003.
University of Massachusetts Amherst
This paper proposes a feature induction method for CRFs. The method is founded on the principle of iteratively constructing feature conjunctions that would significantly increase conditional log-likelihood if added to the model.
If using standard approaches to build large set of feature conjunctions according to hand-built, general patterns, it can lead to extremely large feature sets, with million of features. For example, sometimes, a word tri-gram is important, but there is not sufficient memory or computation to include all word tri-grams.
The algorithm can be described:
Input: (1)training set: (2) a finite state machine
Algorithm: (1) Begin with no features in the model, K=0.
(2) Create a list of candidate features.
(3) Evaluate all candidate features, and add to the model some subset of candidates with highest gain, thereby increasing K.
(4) Use quasi-Newton method to estimate the parameters.
(5) Goto step (2) until some convergence criteria is met.
Output: A finite state CRF model that finds the most likely label sequence given an input sequence by using its induced features, learned weights and the Viterbi algorithm.
The gain is calculated by:
For arbitrarily-structured CRFs
![]()
For Linear-chain CRFs

1. Razvan C. Bunescu and Raymond J. Mooney. Relational Markov Networks for Collective Information Extraction, SRL2004: Statistical Relational Learning and its Connections to Other Fields of ICML 2004 Workshop. 2004
University of Texas at Austin
The paper presents a IE model that employs Relational Markov Networks, which can represent arbitrary dependencies between extractions. This allows for “collective information extraction” that exploits the mutual influence between possible extractions.
2. Taskar, B., Abbeel, P., & Koller, D. (2002). Discriminative probabilistic models for relational data. Proceedings of 18th Conference on Uncertainty in Arti[1]cial Intelligence (UAI-02) (pp. 485-492). Edmonton, Canada.
Stanford
In many supervised learning tasks, the entities to be labeled are related to each other in complex ways and their labels are not independent. For example, in hypertext classification, the labels of linked pages are highly correlated.
In this paper, the authors present an alternative framework that builds on (conditional) Markov networks and addresses two limitations of the previous approach. First, undirected models do not impose the acyclicity constraint that hinders representation of many important relational dependencies in directed models. Second, undirected models are well suited for discriminative training, where we optimize the conditional likelihood of the labels given the features, which generally improves classification accuracy. They show how to train these models effectively, and how to use approximate probabilistic inference over the learned model for collective classification of multiple related entities.
3. Tapani Raiko. Nonlinear Relational Markov Networks with an Application to the Game of Go. International Conference on Artificial Neural Networks. 2005
Helsinki University of Technology
This paper combines two kinds of models: relational Markov networks and hierarchical nonlinear factor analysis, resulting in joining nonlinear models in a structure determined by the relations in the data. The experiments on collective regression in the board game go suggest that regression accuracy can be improved by taking into account both relations and nonlinearities.
Kristina toutanova, Aria Haghighi, and Christopher D. Manning. Joint Learning Improves Semantic Role Labeling. ACL 2005
Stanford University
Local classifiers
![]()
The method separates the task of semantic role labeling into identification and classification phrases. In identification, the task is to classify nodes of t as either core argument labels ARG or NONE. (ARG denotes) in classification, the task is to assign each argument in t with one appropriate semantic role.
Joint classifiers:
Main idea: adding a dependency from each node label to the labels of all other nodes.
1) Top N selections by using local classifiers.
2) Parametric models:

3) Final pipeline
![]()
Sugato Basu, Mikhail Bilenko, and Raymod J. Mooney. A Probabilistic Framework for Semi-Supervised Clustering. KDD’04. Best Paper.
University of Texas at Austin.
The authors aim to improve unsupervised clustering by using supervision in the form of pairwise constraints, e.g. pairs of instances labeled as belonging to same or different clusters. They propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering.

Figure 6. A Hidden Markov Random Field
Given
observable set
, and the
hidden variables
which
take values from {1, …, K}, which are the indices of the clusters, the
goal is to search for
having
the maximum probability.
The paper defines a objective function for semi-supervised clustering, and proposes a EM-based partitional clustering algorithm to find the minimum of this objective function.
These two papers can be viewed as further reading. By roughly reviewing, they are not that relevant to state selection. But they seem helpful to understand Markov Random Fields, so I list them here.
Artur Yakovlevich Fridman. Mixed Markov Fields. PhD thesis. 2000
University of Brown
Brani Vidakovic. Markov Random Fields
1. MDL
![]()
![]()
where, xn denotes a data sequence, M denotes all the models, and θ are parameters in the model m. In the second formula, the k denotes the number of parameters. [Li, 1998]
2. Maximizing log-likehood (ML)
Problems: overfitting

3. Maximum a posteriori (MAP)
It can reduce overfitting, but provides no guidance on the choice of parameter prior.
![]()
Actually, MAP would reduce to maximum likelihood (ML) estimation if Pr(Y) is constant.
4. KL
![]()
![]()
One paper used the KL as criteria for model selection. Recently, seldom use it.
5. Error-Minimization
MLE does not minimize error rate
Directly minimize error rate
It depends on how to define the error function (loss function).
6. Maximum margin criteria to kernelize CRFs.
Modeling the whole network as an optimization problem. And then
1) The naïve method is to iteratively try increase the state length as much as possible for each state, until some stop condition is met.
2) Iteratively selecting the state that has the largest gradient descent.