

LLMs Explained,
OPT
Meta AI first introduced OPT (Open Pre-trained Transformer) Language Model and released it in metaseq's repository on May 3rd, 2022. OPT was pre-trained primarily with English text, but some non-English data remains in the training corpus via CommonCrawl. A causal language modeling (CLM) objective was used to train the model. OPT is a decoder-only model in the same family as GPT-3. As such, the self-supervised causal language modeling objective was used to train it. All models with parameters ranging from 125M to 66B are released. Full research access to OPT-175B will be granted upon request to academic researchers, those affiliated with the government, civil society, and academia, and those working in industry research laboratories. The model creation logbook and codebase, metaseq, are also released, allowing OPT-175B to be trained on 992 80GB A100 GPUs with 147 TFLOP/s utilization per GPU.
Model Details

Technical Experts

50 Custom AI projects

4.8 Minimum Rating
An Overview of OPT
OPT models are useful for many natural language processing tasks and can potentially advance the field significantly. The OPT models are centered on sustainability and responsibility, and their creators intend to share them fully and responsibly with interested researchers.

OPT requires only 1/7th Carbon Footprint to develop
1/7th carbon footprint
Researchers show that OPT-175B is comparable to GPT-3 in terms of performance while requiring only one-seventh the carbon footprint to develop.

OPT-175B was trained on 992 80GB A100 GPUs
992 80GB A100 GPUs
OPT-175B was trained on 992 80GB A100 GPUs using Fully Sharded Data Parallel with Megatron-LM Tensor Parallelism and achieved up to 147 TFLOP/s per GPU utilization.

OPT was trained with 180B tokens of data
180B tokens of data
The training data for OPT contains 180B tokens, corresponding to 800 GB of data. It is a collection of the data used in RoBERTa, the Pile, as well as the PushShift.io Reddit
Blockchain Success Starts here


Introduction
Business Applications
Model Features
Model Tasks
Fine-tuning
Benchmarking
Sample Codes
Limitations
Other LLMs
About OPT Model
The OPT (Open Pretrained Transformers) model developed by Meta's Facebook AI research team is a cutting-edge transformer-based language model. The model has been pre-trained on massive amounts of text data from various domains and can be fine-tuned to perform a wide range of natural language processing (NLP) tasks, including text classification, question answering, and language generation.
The OPT model is available under an open-source license, meaning anyone can use and modify it as they see fit. The model is also highly scalable, which means it can be trained on large datasets and applied to massive amounts of text data. OPT models are useful for many natural language processing tasks and can potentially advance the field significantly. The OPT models are centered on sustainability and responsibility, and their creators intend to share them fully and responsibly with interested researchers. OPT-175B, the suite's largest model, has been shown to perform similarly to GPT-3 while requiring one-seventh of the carbon footprint to develop.
OPT models are useful for many natural language processing tasks and can potentially advance the field significantly. The OPT models are centered on sustainability and responsibility, and their creators intend to share them fully and responsibly with interested researchers. OPT-175B, the suite's largest model, has been shown to perform similarly to GPT-3 while requiring one-seventh of the carbon footprint to develop. A linear learning rate schedule was followed, warming up from 0 to the maximum learning rate in OPT-175B, or over 375M tokens in smaller baselines, and decaying to 10% of the maximum LR over 300B tokens.
Model type: Decoder-only pre-trained transformers
Languages: English. Limited success with Chinese, German, French, Spanish
License: MIT license
Model highlights
Open Pre-trained Transformers (OPT), a collection of decoder-only pre-trained transformers. Following are the key highlights of the OPT model.
- OPT-175B is comparable to GPT-3 in terms of performance while requiring only one-seventh carbon footprint to develop.
- The OPT suite is freely and responsibly distributed to interested researchers. All released models' code is being made available for experimentation.
- OPT-175B is evaluated over 16 standard prompting-based NLP tasks, including HellaSwag, StoryCloze, ARC Easy and Challenge, OpenBookQA, WinoGrad, WinoGrande, and SuperGLUE.
- OPT-175B is trained by utilizing Fully Sharded Data Parallel training. In this method, the model is partitioned into multiple shards, and each shard is assigned to a different accelerator device. This approach allows for the efficient parallelization of transformer models.

Training Details
Training data
The training dataset consists of five filtered datasets: BookCorpus, CC-Stories, The Pile (which includes Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics, and HackerNews), Pushshift.io Reddit dataset, and CCNewsV2.


Training Procedure
The texts are tokenized with a vocabulary size of 50272 and the GPT2 byte-level version of Byte Pair Encoding (BPE) (for unicode characters). The inputs are 2048 consecutive token sequences. 992 80GB A100 GPUs were used to train the 175B model.


Training dataset size
The training data set contains 180 billion tokens and 800 gigabytes of data. The validation split consists of 200MB of pretraining data proportionally sampled from the pretraining corpus's dataset sizes.


Training time and resources
The training data was collected between September 2016 and September 2021 in the CC-News dataset containing English news articles. The OPT 175 B model was trained for approximately 33 days of continuous training, though the time for the other models is not specified.


Model Types
OPT has eight variants with parameters ranging from 125 million to 175 billion. These models have varying strengths and are appropriate for various NLP tasks. Improved performance can be obtained by fine-tuning these pre-trained models on specific downstream tasks. Model architecture details. Number of layers (#L), number of attention heads (#H), and the embedding size (model), peak Learning Rate (LR), and global batch size in number of tokens (Batch).
Model | #L | #H | dmodel | LR | Batch |
125M | 12 | 12 | 768 | 6.0e−4 | 0.5M |
350M | 24 | 16 | 1024 | 3.0e−4 | 0.5M |
1.3B | 24 | 32 | 2048 | 2.0e−4 | 1M |
2.7B | 32 | 32 | 2560 | 1.6e−4 | 1M |
6.7B | 32 | 32 | 4096 | 1.2e−4 | 2M |
13B | 40 | 40 | 5120 | 1.0e−4 | 4M |
30B | 48 | 56 | 7168 | 1.0e−4 | 4M |
66B | 64 | 72 | 9216 | 0.8e−4 | 2M |
175B | 96 | 96 | 12288 | 1.2e−4 | 2M |
Business Applications
OPT shows the optimal results in tasks like Sequence Classification, Question Answering. The pretrained-only model can be used for prompting downstream task evaluation and text generation. There are multiple business applications. Some are listed below:
Sequence Classification | Question Answering |
Sentiment analysis | Chatbots and virtual assistants |
Spam detection | Customer support and helpdesk automation |
Topic classification | E-commerce product recommendations and personalization |
Intent recognition | Search engines and information retrieval systems |
Language identification | Medical diagnosis and treatment recommendation |
Text classification for customer service and support | Educational and training systems |
Model Features
The Open Pre-trained Transformers (OPT) model is a family of large language models pre-trained on large amounts of text data and based on the transformer architecture. Here are some of OPT model features.
Multi-layer transformer architecture
The OPT model is based on a multi-layer transformer architecture, a deep neural network that can process sequential data such as text. Transformers are particularly effective in capturing long-range dependencies in sequences, which is important in natural language understanding.
Linear learning rate schedule
A linear learning rate schedule is used to train the OPT model. In the process, the learning rate starts from 0 to the maximum over the first 2000 steps in OPT-175B, or over 375M tokens in smaller baselines, and decays down to 10% of the maximum learning rate over 300B tokens.
FSDP training
OPT-175B is trained by utilizing Fully Sharded Data Parallel training. In this method, the model is partitioned into multiple shards, and each shard is assigned to a different accelerator device. This approach allows for the efficient parallelization of transformer models.
Megatron-LM Tensor Parallelism
OPT-175B is trained using Fully Sharded Data Parallel training with Megatron-LM Tensor Parallelism. In tensor parallelism, the model is divided into multiple model-parallel units or tensors, and these units are distributed across multiple GPUs or accelerator devices.
Autoregressive architecture
OPT uses autoregressive architecture. It generates one token at a time, conditioned on the previously generated tokens. The model uses a transformer-based architecture to encode the context of the previously generated tokens and predict the next token in the sequence.
Prompting
OPT-175B is evaluated over 16 standard prompting-based NLP tasks, including HellaSwag, StoryCloze, ARC Easy and Challenge, OpenBookQA, WinoGrad, WinoGrande, and SuperGLUE.
Model Tasks
OPT-175B is pretrained on several NLP tasks. Future experimentation of the model for dialogue should contain explicit fine-tuning on curated datasets to improve the safety profile. The main tasks are;
Common-sense inference
Model can be used for the task of common-sense inference, which involves reasoning about real-world knowledge and making inferences based on that knowledge. The model must understand the nuances of language and make predictions about the likely outcomes of events based on prior knowledge and context.
Story completion
Model can be used for the task of story completion, where the model is given a partially completed story and is tasked with predicting the most likely ending based on the context and prior knowledge. This task requires the model to have a deep understanding of narrative structures and the ability to generate coherent and plausible endings to stories.
Question Answering
OPT language model can be used for the task of question answering, where the model is given a question and a passage of text and is tasked with answering the question based on the information contained in the text. This task requires the model to have a strong understanding of language and the ability to reason about textual information to identify the correct answer.
Reasoning
Model can be used for reasoning tasks, which involve the ability to make logical deductions based on the information provided. This task requires the model to have a deep understanding of language and the ability to reason about textual information to identify relationships and make logical inferences.
Word-in-context
Model can be used for the task of word-in-context, which involves predicting the meaning of a word based on its context within a sentence. This task requires the model to have a strong understanding of language and the ability to identify the relationships between words in a sentence to predict the correct meaning of a word.
Pronoun reference ambiguity
Model can be used for pronoun reference ambiguity, which involves disambiguating pronouns in a sentence and identifying the correct antecedent. This task requires the model to have a deep understanding of language and the ability to reason about textual information to identify the correct antecedent for a pronoun.
Multi-Sentence Reading Comprehension
Model can be used for the task of multi-sentence reading comprehension, where the model is given a passage of text and is tasked with answering questions based on the information contained in the passage. This task requires the model to have a strong understanding of language and the ability to reason about textual information across multiple sentences.
Recognizing Textual Entailment
Model can be used for the task of recognizing textual entailment, which involves determining whether one sentence logically follows from another sentence. This task requires the model to have a deep understanding of language and the ability to reason about textual information to identify the logical relationships between sentences.
Fine-tuning
One can fine-tune OPT model on new data, often with a lower learning rate, to adjust the model parameters to fit the new task better. There are several methods for fine-tuning the model:
Causal Language Modeling
OPT can be fine-tuned using Causal Language Modeling (CLM) loss. In CLM, the task of predicting the token after a sequence of tokens is known as causal language modeling. In this case, the model is just concerned with the left context (tokens on the left of the mask).
Parameter-Efficient Fine-Tuning
Parameter-Efficient Fine-Tuning (PEFT) is a method for fine-tuning pre-trained transformer models with limited labeled data. In traditional fine-tuning, all pre-trained model parameters are updated using the labeled data, which can be computationally expensive and data-intensive. PEFT addresses this issue by only updating a subset of the model's parameters during fine-tuning while keeping the remaining ones fixed.
LOw Resource fine-tuning Approach
LORA (LOw Resource fine-tuning Approach) is a fine-tuning method for transformers with limited labeled data. It is a recent PEFT method. This method uses a combination of transfer learning and data augmentation techniques to improve the performance of pre-trained transformer models when trained on smaller datasets.
Benchmarking
The table below shows the evaluation results for the CrowS-Pairs analysis. Lower values indicate greater fairness in all categories. In most categories, the OPT-175B model outperforms Davinci.
Category | GPT-3 | OPT-175B |
Gender | 62.6 | 65.7 |
Religion | 73.3 | 68.6 |
Race/Color | 64.7 | 68.6 |
Sexual orientation | 76.2 | 78.6 |
Age | 64.4 | 67.8 |
Nationality | 61.6 | 62.9 |
Disability | 76.7 | 76.7 |
Physical appearance | 74.6 | 76.2 |
Socioeconomic status | 73.8 | 76.2 |
Overall | 67.2 | 69.5 |
The table below shows the StereoSet Evaluations. Davinci and OPT175B perform similarly across all evaluations.
Sample Codes
How to use: You can use OPT model directly with a pipeline for text generation.
from transformers import pipeline generator = pipeline('text-generation', model="facebook/opt-350m") generator("Hello, I'm am conscious and")
OPT For QuestionAnswering
from transformers import AutoTokenizer, OPTForQuestionAnswering import torch torch.manual_seed(4) tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") # note: we are loading a OPTForQuestionAnswering from the hub here, # so the head will be randomly initialized, hence the predictions will be random model = OPTForQuestionAnswering.from_pretrained("facebook/opt-350m") question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet" inputs = tokenizer(question, text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) answer_start_index = outputs.start_logits.argmax() answer_end_index = outputs.end_logits.argmax() predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] predicted = tokenizer.decode(predict_answer_tokens) predicted
OPT For Causal LM
from transformers import AutoTokenizer, OPTForCausalLM model = OPTForCausalLM.from_pretrained("facebook/opt-350m") tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m") prompt = "Hey, are you consciours? Can you talk to me?" inputs = tokenizer(prompt, return_tensors="pt") # Generate generate_ids = model.generate(inputs.input_ids, max_length=30) tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
Limitations
The authors commented that OPT-175B is too premature for commercialization. Despite including data sheets and model cards, additional data characterization and selection criteria should be applied to training data to use data responsibly. OPT model does have some limitations. Here are a few:
Declarative instructions
OPT-175B does not work well with declarative instructions or point-blank interrogatives and tends to simulate a dialogue rather than execute the instruction. This limitation may be alleviated by future work into instruction learning.
Tends to be repetitive
OPT-175B tends to be repetitive and can get stuck in a loop, even with sampling. Future work may wish to incorporate modern strategies for reducing repetition and improving diversity, such as unlikelihood training or best-first decoding.
Can produce factually statements
OPT-175B can produce factually incorrect statements, which can be particularly harmful in critical applications such as healthcare and scientific discovery. Retrieval-augmented models have been found to improve factual correctness, and OPT-175B may benefit from retrieval-augmentation in future iterations.
High propensity to generate toxic language
OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when provided with a relatively innocuous prompt. Mitigations for toxicity and biases exist, but future uses of OPT-175B may need to employ these or novel mitigation approaches, especially before any real-world deployment.