Generative (TrOCR) models calibration
Teaching a generative TrOCR model to know what it doesn't know
Introduction to model calibration
Calibrating a model means aligning the confidence predictions with output probabilities. Essentially, when the model’s softmax output for a certain class (or a certain token in the generative scenario) is high, we want the model to be correct most of the time. Likewise, when the confidence is low, the model should have a hard time choosing the correct class. Often, this is not the case. Simply training the model using the cross-entropy loss for the generative next token prediction does not anyhow guarantee that the model will be calibrated. Thus, calibrated models have the confidence and probability distributions aligned, i.e. high confidence examples are mostly classified correctly and vice versa.
There are numerous simple techniques for calibrating the classification problems, such as text classification or named entity recognition. Histogram binning, isotonic regression, and temperature scaling are just some of the techniques. However, such approaches cannot be used for generative models for three reasons:
the number of classes is the size of the vocabulary, which is usually ~50k tokens (classes);
ground truth text can be tokenized differently compared to the output tokens of our model, even though they, when combined, make up the same text. Example: ground truth "SCHMITZ" can be obtained via different combinations of generated tokens, for instance, S, CH, MIT, Z and S, CH, M, IT, Z.
output tokens need to be aligned with the tokenized ground truth. Example i): ground truth: S, CH, MIT, Z; obtained: SCH, N, IT, Z (only 1 letter differs but 3/4 tokens are incorrect). Example ii): ground truth: S, CH, MIT, Z; obtained: S, CH, N, IT, Z (lengths differ and it is unclear how to align the two).
Generative calibration
A different approach needs to be taken for generative models. A paper examining this field is Calibrating sequence likelihood improves conditional language generation. The idea is to utilize the beam search from the model’s generate function, which is normally used during inference, to get multiple candidate outputs for a single image input. The procedure is as follows:
After fine-tuning your generative model, introduce a calibration stage, which is also done on the training data.
Choose a similarity function that will later be used to either rank the candidates for loss calculation (rank loss) or to determine the scale of the loss (margin loss). Character Error Rate (CER) was used for this metric.
Using the input pixel values, y ground truth and possible y target candidates outputted by the model's generate function, calculate the calibration loss. This loss aims to align models' decoded candidate's sequence likelihood according to their similarity with the ground truth target sequence, using the CER metric. Rank loss optimizes the ranking order of positive and negative candidate pairs y+, y− uniformly sampled where s(y+, y; x) > s(y−, y; x). Margin loss maximizes the sequence probability gap of positive and negative candidate pairs. List-rank and reward losses gave the worst results in the paper, so they weren't considered.
(Optional) Apply the regularization loss to prevent models from deviating significantly from their fine-tuned MLE objective. Cross entropy is the standard fine-tuning MLE objective. KL divergence directly minimizes the probability distribution distance between the calibrated model and the fine-tuned model at each token on the observed target sequence. The main difference is cross entropy loss regularizes the model toward the gold reference while KL divergence regularizes the model toward the fine-tuned-only model. Aligning the probabilities of an initial (fine-tuned) and calibrated model is difficult due to the often phenomenon of different output lengths. However, to calculate the KL divergence loss, these probability tensors need to be of the same length. Thus, KL divergence loss wasn't considered.
The main manipulation was in different loss calculations. Initial experimentation showed that CE regularization loss is not beneficial for the TrOCR task. In total, 6 model variations were evaluated: initial (fine-tuned) model, rank loss, rank loss with if clause variants 1, 2, 3, and margin loss.
Rank loss and margin loss are calculated as explained in the above picture equations. Calibration loss using rank loss with if clause variants 1, 2, and 3 are calculated as follows:
Variant 1
Variant 2
Variant 3
These loss variations are not mentioned in the paper but were examined from our side. There were several reasons for this modification of the rank loss from the paper. First, when both y candidates are correct (have CER 0), we don’t necessarily want to increase the generation probability of one of them, while decreasing the other. This is what happens with the default rank loss function. Instead, we motivate an increase in generation probabilities of either the first beam (variants 2 and 3) or both beams (variant 1). Second, when both y candidates are incorrect, we don’t necessarily want to increase the probability of any of them. Instead, we reduce both beams (variants 1 and 3) or just the first one (variant 2). Third, when the candidates are incorrect and an improvement between them, i.e. absolute difference of CERs, is minimal, we again might not want to increase one while decreasing the other (variants 1 and 2). For these reasons, different rank loss variations were added to mitigate these effects.
Results
Below is the overview of the performance of each of the models. All models, except the first (uncalibrated) one, are calibrated on the train set. All models are tested on the test set. Originally published trocr-base-handwritten model and IAM dataset were used. An example is skipped during evaluation if any of the predicted tokens' probability is less than 50%. These examples are still included in the graphs, but are below denoted as skipped. Examples that aren’t skipped and have any CER > 0 are classified as incorrect. Although TrOCR is a generative model with a vision encoder and text decoder, the behavior differs from the tasks used in the paper. Namely, they used abstractive summarization, question answering, and structured data-to-text datasets. These are more open-ended than the TrOCR task which has a strict ground truth. Thus, a different calibration loss proved to work better in our scenario.
Initial (fine-tuned) model
Average CER: 0.0385
Average sequence probability: 0.5496
Correct : 902
Incorrect: 408
Skipped : 551
Calibrated - rank loss
Average CER: 0.0334
Average sequence probability: 0.6627
Correct : 1001
Incorrect: 451
Skipped : 409
Calibrated - rank loss if clause variant 1
Average CER: 0.0313
Average sequence probability: 0.4965
Correct : 972
Incorrect: 375
Skipped : 514
Calibrated - rank loss if clause variant 2
Average CER: 0.0338
Average sequence probability: 0.6724
Correct : 1001
Incorrect: 480
Skipped : 380
Calibrated - rank loss if clause variant 3
Average CER: 0.0385
Average sequence probability: 0.6386
Correct : 972
Incorrect: 461
Skipped : 428
Calibrated - margin loss
Average CER: 0.0299
Average sequence probability: 0.6321
Correct : 1059
Incorrect: 381
Skipped : 421
Conclusion
Generative models also proved to be suitable for calibration. For the TrOCR problem specifically, margin loss yielded the best results. Compared to the initial, fine-tuned-only model, test set metrics improved by a large margin. Average CER decreased from 0.0385 to 0.0299 (22.3% decrease), average sequence probability increased from 0.5496 to 0.6321, number of correctly classified examples increased by 17.4%, number of incorrect ones decreased by 6.6%, and number of skipped examples dropped by 23.6%. Its graph also shows a more desirable 2-D CER-probability example distribution compared to the initially fine-tuned-only model. Thus, the method proved useful, not only for calibrating a generative model but also for increasing its performance in general.