LLMs Explained,
Galactica
Galactica is a large-scale language model developed by Meta AI in collaboration with Papers with Code. It has been trained with 48 million papers, textbooks and lecture notes, millions of compounds and proteins, scientific websites, encyclopedias, and more from the "NatureBook" dataset. Galactica's computational performance is comparable to other state-of-the-art language models, making it a promising technology for real-world applications in natural language processing. With its exceptional efficiency and performance, Galactica represents a significant step forward in developing NLP systems, enabling researchers and practitioners to tackle complex language-related challenges and unlock new opportunities for innovation.
Model Details View All Models100+ Technical Experts
50 Custom AI projects
4.8 Minimum Rating
An Overview of Galactica
Galactica is a large-scale language model developed by by Meta AI in collaboration with Papers with Code.
Scales up to 120B parameters and improves training time!
120B parameters
The model has multiple architecture variations, ranging from the base architecture with 125M parameters to larger models with up to 120B parameters.
Outperforms the latest GPT-3 on various NLP tasks
Outperforms GPT3
On technical knowledge probes such as LaTeX equations, the Galactica model outperforms the latest GPT-3 large language model by 68.2% versus 49.0%.
Outperforms the BLOOM on various NLP tasks
Outperforms BLOOM
Galactica outperforms BLOOM and OPT-175B on BIG-bench. It also sets new state-of-the-art downstream tasks such as PubMedQA and MedMCQA.
Blockchain Success Starts here
-
Introduction
-
Business Applications
-
Model Features
-
Model Tasks
-
Fine-tuning
-
Benchmarking
-
Sample Codes
-
Limitations
-
Other LLMs
About Model
Galactica is trained using a combination of transformer models and graph neural networks and uses a contrastive learning framework to distinguish between real and fake input-output pairs, ensuring high-quality outputs. The model uses a transformer-based architecture with multiple self-attention mechanisms to capture long-term dependencies between words. With its exceptional efficiency and performance, Galactica represents a significant step forward in NLP systems, enabling researchers and practitioners to tackle complex language-related challenges and unlock new opportunities for innovation. outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench
Model highlights
Following are the key highlights of the Galactica model.
- Can store, combine, and reason about scientific knowledge.
- Trained on a large scientific corpus of papers, reference material, knowledge bases, and other sources.
- Outperforms existing models on various scientific tasks, including technical knowledge probes, reasoning, and downstream tasks such as PubMedQA and MedMCQA dev.
- Sets a new state-of-the-art on BIG-bench despite not being trained on a general corpus.
- Demonstrates the potential for language models as a new interface for science.
- Open-sourced for the benefit of the scientific community.
Training Details
Training data
Galactica models are trained on 106 billion tokens of publicly available scientific text and data. Papers, textbooks, scientific websites, encyclopedias, reference material, knowledge bases, and other materials fall into this category.
Training Procedure
Galactica was trained on a large corpus of text data using a supervised learning approach. The training data set included many written and spoken languages, totaling 1.2 terabytes of data.
Training dataset size
According to the paper, Galactica model was trained on a multilingual corpus that were approximately 1.2 terrabytes in size. The training data was gathered from various publicly accessible web pages, documents, and other sources.
Training time and resources
The training time and resources required for training Galactica depend on the specific configuration used for training. The paper's authors used 512 GPUs with 32GB memory each, totaling 16,384 GB memory, to train Galactica on the C4 dataset. The training process lasted four days and employed mixed precision training. The final model had 6.5 billion parameters.
Model Types
The Galactica language model has several architecture variations with varying numbers of parameters. Here's a brief explanation of each of them:
Model | Parameters | Highlights |
GAL 125M | 125 million | Natural Language Processing |
GAL 1.3B | 1.3 billion | Natural Language Processing |
GAL 6.7B | 6.7 billion | More capacity and improved performance |
GAL 30B | 30 billion | Complex patterns |
GAL 120B | 120 billion | Require more computational resources |
Business Applications
Galactica is a multilingual language model that provides accurate and efficient natural language processing capabilities for a wide range of languages. Anachronism and Bias Detection is the main business applications of the model.
Task | Business Use Cases | Examples |
Protein Function Description | Biotechnology, Medical Research | Describing the function of proteins, predicting the impact of genetic mutations, designing new therapies |
Functional Keyword Prediction | Biotechnology, Medical Research | Predicting the function of genes, identifying disease biomarkers, designing new therapies |
Protein Keyword Prediction | Biotechnology, Medical Research | Identifying the function of unknown proteins, predicting protein interactions, designing new treatments |
Drug Discovery Tasks | Pharmaceutical Industry, Biotechnology | Identifying potential drug candidates, predicting drug interactions, optimizing drug delivery systems |
IUPAC Name Prediction | Chemical Industry, Drug Discovery | Generating standardized names for chemicals, identifying potential drug candidates |
Empirical Citation Distribution | Scientific Research, Academic Publishing | Analyzing citation patterns in scientific literature, identifying emerging areas of research, tracking the impact of individual researchers |
Citation Prediction | Scientific Research, Academic Publishing | Predicting which papers are likely to be cited in the future, identifying influential research, recommending articles for readers |
Knowledge Probing | Knowledge Management, Education | Generating summaries of long texts, extracting relevant information from documents, assisting in academic research |
Model Features
The Galactica language model has several unique features contributing to its state-of-the-art performance on various natural language processing tasks. These features include
GeLU Activation
The model uses GeLU activations for all model sizes. The GELU activation function is a type of activation function commonly used in deep learning models. The GELU function applies the Gaussian cumulative distribution function (CDF) to the input, allowing it to be a smooth approximation of the ReLU (Rectified Linear Unit) function while being differentiable for all values of the input.
Context Window
The model uses a 2048-length context window for all model sizes. Large language models use a context window to consider the surrounding words when predicting the next word or generating text. The context window refers to the number of words or tokens the model considers when making predictions. A larger context window can help the model capture more contextual information.
No Biases
Similar to PALM model, Galactica does not use biases in any of the dense kernels or layer norms.
Learned Positional Embeddings
In order to process sequences of text, Galactica uses positional embeddings to encode the position of each token in the sequence. Traditional approaches use fixed embeddings, but newer models use learned positional embeddings, which allow the model to better capture the relationships between tokens.
Vocabulary
Galactica team constructed a vocabulary of 50k tokens using BPE. The vocabulary was generated from a randomly selected 2% subset of the training data.Large language models have much larger vocabularies than traditional models, which allows them to capture a wider range of language and produce more natural-sounding text.
Model Tasks
Galactica is a language model designed for scientific and technical writing, and it can be fine-tuned for specific domains such as chemistry, biology, or medicine. The model's training datasets include PubMed, CORD-19, ChEMBL, PubChem, UniProt, and Gene Ontology. This training makes Galactica a powerful tool for scientific research, drug discovery, and technical writing.
Knowledge Probing
Knowledge probing refers to evaluating a language model's knowledge and understanding of a particular domain. This task involves presenting the model with various knowledge-based questions, which it should be able to answer accurately based on its understanding of the domain. The model can be trained to perform knowledge-probing tasks across various domains, such as science, history, literature, and more.
Question Answering
It is a task in which a language model is given a question and must provide an accurate answer. The model can be trained to perform question-answering tasks across various domains, such as science, history, literature, and more. It can answer questions accurately based on understanding the underlying domain and available knowledge.
Citation Prediction
The model predicts which papers will likely be cited by a given paper. The model can be trained to perform citation prediction tasks using machine learning techniques such as deep learning, neural networks, and natural language processing. This task can be useful for researchers and scientists looking to identify relevant research papers to include in their research work.
Chemical Understanding
The model understands the properties and behavior of different chemicals. Based on available data and knowledge, the model can be trained to perform chemical understanding tasks by analyzing the structure, properties, and behavior of different chemicals. This can be useful in chemistry, materials science, and drug discovery.
IUPAC Name Prediction
The model predicts the correct International Union of Pure and Applied Chemistry (IUPAC) name for a given chemical compound. This model can be trained to perform IUPAC name prediction tasks by analyzing the chemical structure of a given compound and identifying the correct IUPAC name based on available rules and knowledge.
Drug discovery tasks:
The model involves identifying and designing new drugs that can be used to treat various diseases. The model can be trained to perform drug discovery tasks by analyzing large amounts of data related to drug compounds, their properties, and their interactions with various biological systems. This task can be useful for pharmaceutical companies and researchers looking to develop new and effective drugs.
Biological Understanding
It is the task of understanding the properties and behavior of living organisms. The model can be trained to perform biological understanding tasks by analyzing data related to various biological systems, such as genetics, physiology, and biochemistry. This task can be useful in biology, medicine, and biotechnology.
Protein Keyword Prediction
The model predicts the keywords or terms associated with a given protein sequence. It can be trained to perform protein keyword prediction tasks by analyzing the structure and properties of different proteins and identifying the relevant keywords based on available data and knowledge.
Functional Keyword Prediction
The model predicts different biological systems' functional properties and behavior based on available data and knowledge. The model can be trained to perform functional keyword prediction tasks across various domains, such as genetics, physiology, and biochemistry. This task can be useful in biology, medicine, and biotechnology.
Fine-tuning
The Galactica language model can be fine-tuned using various techniques to adapt it to specific tasks or domains. Here are some fine-tuning techniques that can be applied to the Galactica language model:
Feature-based fine-tuning
This involves adding task-specific features to the pre-trained model and training it on the task-specific dataset.
Transfer learning
This involves fine-tuning the pre-trained model on a smaller dataset related to the target task.
Domain adaptation
It involves fine-tuning the pre-trained model on a domain-specific dataset to improve its performance on a specific domain.
Multi-task learning
It involves training the pre-trained model on multiple tasks simultaneously to improve performance.
Curriculum learning
This involves gradually increasing the difficulty of the training examples during fine-tuning to help the model learn more effectively.
Benchmarking
Fine-tuning results. Fine-tuning results of T5 baselines and Switch models across a diverse set of natural language tests (validation sets; higher numbers are better). Test compare FLOP-matched Switch models to the T5-Base and T5-Large baselines. For most tasks considered, Results show significant improvements of the Switchvariants. Result shows gains across both model sizes and across both reasoning and knowledge-heavy language tasks.
Model | Params (bn) | A.Algebra | ELem | HS | College | F. logic | Average |
GAL 1.3B | 1.3 | 28% | 27.2% | 26.7% | 30% | 24.6% | 27.1% |
GAL 6.7B | 6.7 | 28% | 28.9% | 26.7% | 36% | 31% | 29.2% |
GAL 30B | 30 | 30% | 30.2% | 26.3% | 36% | 31.7% | 29.9% |
GAL 120B | 120 | 33% | 38.1% | 32.6% | 43% | 32.5% | 35.8% |
Sample Code 1
Running the model on a CPU
from transformers import AutoTokenizer, OPTForCausalLM tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b") model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b") input_text = "The Transformer architecture [START_REF]" input_ids = tokenizer(input_text, return_tensors="pt").input_ids outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0]))
Sample Code 2
Running the model on a GPU
# pip install accelerate from transformers import AutoTokenizer, OPTForCausalLM tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b") model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto") input_text = "The Transformer architecture [START_REF]" input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0]))
Sample Code 3
Running the model on a GPU using different precisions - FP16
# pip install accelerate import torch from transformers import AutoTokenizer, OPTForCausalLM tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b") model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto", torch_dtype=torch.float16) input_text = "The Transformer architecture [START_REF]" input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0]))
Sample Code 4
Running the model on a GPU using different precisions - INT8
# pip install bitsandbytes accelerate from transformers import AutoTokenizer, OPTForCausalLM tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-6.7b") model = OPTForCausalLM.from_pretrained("facebook/galactica-6.7b", device_map="auto", load_in_8bit=True) input_text = "The Transformer architecture [START_REF]" input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda") outputs = model.generate(input_ids) print(tokenizer.decode(outputs[0]))
Model Limitations
The following points outline the constraints and potential biases associated with the Galactica language model
- The Galactica model was designed for use in science fiction, which may limit its effectiveness in processing text from other domains.
- There may be limitations in the model's ability to handle rare or novel terminology specific to the science fiction genre that may not be well-represented in the training data.
- The Galactica model may also have limitations in handling more nuanced or complex language usage within the science fiction genre.