LLMs Explained,
Galactica

Galactica is a large-scale language model developed by Meta AI in collaboration with Papers with Code. It has been trained with 48 million papers, textbooks and lecture notes, millions of compounds and proteins, scientific websites, encyclopedias, and more from the "NatureBook" dataset. Galactica's computational performance is comparable to other state-of-the-art language models, making it a promising technology for real-world applications in natural language processing. With its exceptional efficiency and performance, Galactica represents a significant step forward in developing NLP systems, enabling researchers and practitioners to tackle complex language-related challenges and unlock new opportunities for innovation.

Model Details

View All Models

100+ Technical Experts

50 Custom AI projects

4.8 Minimum Rating

An Overview of Galactica

Galactica is a large-scale language model developed by by Meta AI in collaboration with Papers with Code.

Scales up to 120B parameters and improves training time!

120B parameters

The model has multiple architecture variations, ranging from the base architecture with 125M parameters to larger models with up to 120B parameters.

Outperforms the latest GPT-3 on various NLP tasks

Outperforms GPT3

On technical knowledge probes such as LaTeX equations, the Galactica model outperforms the latest GPT-3 large language model by 68.2% versus 49.0%.

Outperforms the BLOOM on various NLP tasks

Outperforms BLOOM

Galactica outperforms BLOOM and OPT-175B on BIG-bench. It also sets new state-of-the-art downstream tasks such as PubMedQA and MedMCQA.

Introduction
Business Applications
Model Features
Model Tasks
Fine-tuning
Benchmarking
Sample Codes
Limitations
Other LLMs

About Model

Galactica is trained using a combination of transformer models and graph neural networks and uses a contrastive learning framework to distinguish between real and fake input-output pairs, ensuring high-quality outputs. The model uses a transformer-based architecture with multiple self-attention mechanisms to capture long-term dependencies between words. With its exceptional efficiency and performance, Galactica represents a significant step forward in NLP systems, enabling researchers and practitioners to tackle complex language-related challenges and unlock new opportunities for innovation. outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench

Research Paper

Model Repository

HuggingFace

Developed By

Papers with code

Demo

Model highlights

Following are the key highlights of the Galactica model.

Can store, combine, and reason about scientific knowledge.
Trained on a large scientific corpus of papers, reference material, knowledge bases, and other sources.
Outperforms existing models on various scientific tasks, including technical knowledge probes, reasoning, and downstream tasks such as PubMedQA and MedMCQA dev.
Sets a new state-of-the-art on BIG-bench despite not being trained on a general corpus.
Demonstrates the potential for language models as a new interface for science.
Open-sourced for the benefit of the scientific community.

Training Details

Training data

Galactica models are trained on 106 billion tokens of publicly available scientific text and data. Papers, textbooks, scientific websites, encyclopedias, reference material, knowledge bases, and other materials fall into this category.

Training Procedure

Galactica was trained on a large corpus of text data using a supervised learning approach. The training data set included many written and spoken languages, totaling 1.2 terabytes of data.

Training dataset size

According to the paper, Galactica model was trained on a multilingual corpus that were approximately 1.2 terrabytes in size. The training data was gathered from various publicly accessible web pages, documents, and other sources.

Training time and resources

The training time and resources required for training Galactica depend on the specific configuration used for training. The paper's authors used 512 GPUs with 32GB memory each, totaling 16,384 GB memory, to train Galactica on the C4 dataset. The training process lasted four days and employed mixed precision training. The final model had 6.5 billion parameters.

Model Types

The Galactica language model has several architecture variations with varying numbers of parameters. Here's a brief explanation of each of them:

Model	Parameters	Highlights
GAL 125M	125 million	Natural Language Processing
GAL 1.3B	1.3 billion	Natural Language Processing
GAL 6.7B	6.7 billion	More capacity and improved performance
GAL 30B	30 billion	Complex patterns
GAL 120B	120 billion	Require more computational resources

Business Applications

Galactica is a multilingual language model that provides accurate and efficient natural language processing capabilities for a wide range of languages. Anachronism and Bias Detection is the main business applications of the model.

Task	Business Use Cases	Examples
Protein Function Description	Biotechnology, Medical Research	Describing the function of proteins, predicting the impact of genetic mutations, designing new therapies
Functional Keyword Prediction	Biotechnology, Medical Research	Predicting the function of genes, identifying disease biomarkers, designing new therapies
Protein Keyword Prediction	Biotechnology, Medical Research	Identifying the function of unknown proteins, predicting protein interactions, designing new treatments
Drug Discovery Tasks	Pharmaceutical Industry, Biotechnology	Identifying potential drug candidates, predicting drug interactions, optimizing drug delivery systems
IUPAC Name Prediction	Chemical Industry, Drug Discovery	Generating standardized names for chemicals, identifying potential drug candidates
Empirical Citation Distribution	Scientific Research, Academic Publishing	Analyzing citation patterns in scientific literature, identifying emerging areas of research, tracking the impact of individual researchers
Citation Prediction	Scientific Research, Academic Publishing	Predicting which papers are likely to be cited in the future, identifying influential research, recommending articles for readers
Knowledge Probing	Knowledge Management, Education	Generating summaries of long texts, extracting relevant information from documents, assisting in academic research

Model Features

The Galactica language model has several unique features contributing to its state-of-the-art performance on various natural language processing tasks. These features include

GeLU Activation

The model uses GeLU activations for all model sizes. The GELU activation function is a type of activation function commonly used in deep learning models. The GELU function applies the Gaussian cumulative distribution function (CDF) to the input, allowing it to be a smooth approximation of the ReLU (Rectified Linear Unit) function while being differentiable for all values of the input.

Context Window

The model uses a 2048-length context window for all model sizes. Large language models use a context window to consider the surrounding words when predicting the next word or generating text. The context window refers to the number of words or tokens the model considers when making predictions. A larger context window can help the model capture more contextual information.

No Biases

Similar to PALM model, Galactica does not use biases in any of the dense kernels or layer norms.

Learned Positional Embeddings

In order to process sequences of text, Galactica uses positional embeddings to encode the position of each token in the sequence. Traditional approaches use fixed embeddings, but newer models use learned positional embeddings, which allow the model to better capture the relationships between tokens.

Vocabulary

Galactica team constructed a vocabulary of 50k tokens using BPE. The vocabulary was generated from a randomly selected 2% subset of the training data.Large language models have much larger vocabularies than traditional models, which allows them to capture a wider range of language and produce more natural-sounding text.

Model Tasks

Galactica is a language model designed for scientific and technical writing, and it can be fine-tuned for specific domains such as chemistry, biology, or medicine. The model's training datasets include PubMed, CORD-19, ChEMBL, PubChem, UniProt, and Gene Ontology. This training makes Galactica a powerful tool for scientific research, drug discovery, and technical writing.

Knowledge Probing

Knowledge probing refers to evaluating a language model's knowledge and understanding of a particular domain. This task involves presenting the model with various knowledge-based questions, which it should be able to answer accurately based on its understanding of the domain. The model can be trained to perform knowledge-probing tasks across various domains, such as science, history, literature, and more.

Question Answering

It is a task in which a language model is given a question and must provide an accurate answer. The model can be trained to perform question-answering tasks across various domains, such as science, history, literature, and more. It can answer questions accurately based on understanding the underlying domain and available knowledge.

Citation Prediction

The model predicts which papers will likely be cited by a given paper. The model can be trained to perform citation prediction tasks using machine learning techniques such as deep learning, neural networks, and natural language processing. This task can be useful for researchers and scientists looking to identify relevant research papers to include in their research work.

Chemical Understanding

The model understands the properties and behavior of different chemicals. Based on available data and knowledge, the model can be trained to perform chemical understanding tasks by analyzing the structure, properties, and behavior of different chemicals. This can be useful in chemistry, materials science, and drug discovery.

IUPAC Name Prediction

The model predicts the correct International Union of Pure and Applied Chemistry (IUPAC) name for a given chemical compound. This model can be trained to perform IUPAC name prediction tasks by analyzing the chemical structure of a given compound and identifying the correct IUPAC name based on available rules and knowledge.

Drug discovery tasks:

The model involves identifying and designing new drugs that can be used to treat various diseases. The model can be trained to perform drug discovery tasks by analyzing large amounts of data related to drug compounds, their properties, and their interactions with various biological systems. This task can be useful for pharmaceutical companies and researchers looking to develop new and effective drugs.

Biological Understanding

It is the task of understanding the properties and behavior of living organisms. The model can be trained to perform biological understanding tasks by analyzing data related to various biological systems, such as genetics, physiology, and biochemistry. This task can be useful in biology, medicine, and biotechnology.

Protein Keyword Prediction

The model predicts the keywords or terms associated with a given protein sequence. It can be trained to perform protein keyword prediction tasks by analyzing the structure and properties of different proteins and identifying the relevant keywords based on available data and knowledge.

Functional Keyword Prediction

The model predicts different biological systems' functional properties and behavior based on available data and knowledge. The model can be trained to perform functional keyword prediction tasks across various domains, such as genetics, physiology, and biochemistry. This task can be useful in biology, medicine, and biotechnology.

Fine-tuning

The Galactica language model can be fine-tuned using various techniques to adapt it to specific tasks or domains. Here are some fine-tuning techniques that can be applied to the Galactica language model:

Feature-based fine-tuning

This involves adding task-specific features to the pre-trained model and training it on the task-specific dataset.

Transfer learning

This involves fine-tuning the pre-trained model on a smaller dataset related to the target task.

Domain adaptation

It involves fine-tuning the pre-trained model on a domain-specific dataset to improve its performance on a specific domain.

Multi-task learning

It involves training the pre-trained model on multiple tasks simultaneously to improve performance.

Curriculum learning

This involves gradually increasing the difficulty of the training examples during fine-tuning to help the model learn more effectively.

Benchmarking

Fine-tuning results. Fine-tuning results of T5 baselines and Switch models across a diverse set of natural language tests (validation sets; higher numbers are better). Test compare FLOP-matched Switch models to the T5-Base and T5-Large baselines. For most tasks considered, Results show significant improvements of the Switchvariants. Result shows gains across both model sizes and across both reasoning and knowledge-heavy language tasks.

Model	Params (bn)	A.Algebra	ELem	HS	College	F. logic	Average
GAL 1.3B	1.3	28%	27.2%	26.7%	30%	24.6%	27.1%
GAL 6.7B	6.7	28%	28.9%	26.7%	36%	31%	29.2%
GAL 30B	30	30%	30.2%	26.3%	36%	31.7%	29.9%
GAL 120B	120	33%	38.1%	32.6%	43%	32.5%	35.8%

Sample Code 1

Running the model on a CPU

Sample Code 2

Running the model on a GPU

Sample Code 3

Running the model on a GPU using different precisions - FP16

Sample Code 4

Running the model on a GPU using different precisions - INT8

Model Limitations

The following points outline the constraints and potential biases associated with the Galactica language model

The Galactica model was designed for use in science fiction, which may limit its effectiveness in processing text from other domains.
There may be limitations in the model's ability to handle rare or novel terminology specific to the science fiction genre that may not be well-represented in the training data.
The Galactica model may also have limitations in handling more nuanced or complex language usage within the science fiction genre.

Other LLMs

OPT

Meta AI first introduced OPT (Open Pre-trained Transformer) Language Model and released it in metaseq's repository on May 3rd, 2022

Galactica

Galactica is a large-scale language model developed by the research team at Meta Platforms, Inc.

LLaMA

Meta first introduced LLaMA in February 2023. LLaMA (Large Language Model Meta AI)

White Papers

Products

MENU

LLMs Explained,Galactica