The human race produces copious amounts of documents daily to aid its effective functioning. Even though digitalization allows the automatic processing of documents when they are stored in an appropriate format, this is not the case for documents that are printed or hand-written. The introduction of Large Language Models (LLMs) in conjunction with Optical Character Recognition (OCR) systems allows us to effectively extract the relevant information from these documents. However, these models are quite expensive to train from the ground up (e.g., training of GPT-4 costs about $100M). Due to this fact, the standard paradigm for using such a big and expensive model consisted of two distinct phases: pre-training and fine-tuning. The pre-training phase constitutes training the model using a self-supervised pre-training objective such as masked language modeling, sequence reshuffling, and next token prediction. During this phase, the model is presented with a large corpus (e.g., LLaMA 2 was pre-trained using ~2T tokens) from which the model gains general knowledge. The model is further specialized during the fine-tuning phase to perform a specific task such as document classification or Named Entity Extraction (NER).