LLMs Explained,
GLM 130B

GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks. The model is trained on a diverse and extensive corpus of text data. GLM-130B, with 130 billion parameters, has demonstrated cutting-edge performance in various language tasks, including question-answering, sentiment analysis, and machine translation. On a wide range of tasks across Natural Language Understanding, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT on common testing conditions.

Model Details

View All Models

100+ Technical Experts

50 Custom AI projects

4.8 Minimum Rating

An Overview of GLM

GLM's largest variant, GLM-130B, is a large-scale language model with 130 billion parameters, trained on a diverse and extensive corpus of text data.

GLM has variants' parameter range from 110M to 130B

130B parameters

Largest GLM, GLM-130B, is a pre-trained language model developed with 130 billion parameters. It can capture complex linguistic patterns and nuances.

GLM-130B is trained on English as well as Chinese.

Bilingual Lang Model

GLM-130B is also evaluated on Chinese benchmarks as a bilingual LLM with Chinese. Trained on 1.0T Chinese WudaoCorpora, and 250G Chinese corpora.

GLM-130B outperforms ERNIE TITAN 3.0

Outperforms ERNIE TITAN

GLM-130B consistently outperforms ERNIE Titan 3.0 (largest Chinese LLM). Model outperforms ERNIE by at least 260% on two abstractive MRC datasets.

Model Details

GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks. Its largest variant, GLM-130B, with 130 billion parameters, is trained on a diverse and extensive corpus of text data. GLM-130B has achieved state-of-the-art performance on various language tasks, including question-answering, sentiment analysis, and machine translation. GLM-130B uses a transformer-based architecture similar to other large-scale language models such as GPT-3. It is a dense bidirectional model pre-trained with over 400 billion tokens on a cluster of 96 NVIDIA DGX-A100 (840G) GPU nodes. GLM-130B outperforms GPT-3 175B on a wide range of popular English benchmarks, while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B, the largest Chinese language model across related benchmarks.

Model Type: Transformer-based language model with autoregressive blank infilling
Language(s): English, Chinese
License: GLM: MIT license, GLM-130B: Apache-2.0

Research Paper

Model Repository

HuggingFace

Papers with code

Author Note

Checkpoints

Model highlights

The GLM-130B has demonstrated cutting-edge performance in a variety of language tasks, including question-answering, sentiment analysis, and machine translation. It has also been used to generate high-quality text in a variety of contexts, ranging from news article generation to poetry composition. Following are the key highlights of the GLM 130B language model.

Bilingual (English and Chinese) pre-trained language model with 130 billion parameters
Performance is better than GPT-3 175B davinci (+5.0%), OPT-175B (+6.5%), and BLOOM-176B (+13.0%) on LAMBADA and slightly better than GPT-3 175B (+0.9%) on MMLU.
Performance for Chinese is significantly better than ERNIE TITAN 3.0 260B on 7 zero-shot CLUE datasets (+24.26%) and 5 zero-shot FewCLUE datasets (+12.75%).
The model supports fast inference on SAT and FasterTransformer (up to 2.5X faster) with a single A100 server.
Consistently and significantly outperforms ERNIE TITAN 3.0 260B - the largest Chinese language model - across related benchmarks
Effective inference on 4×RTX 3090 or 8×RTX 2080 Ti GPUs, the most ever affordable GPUs required for using 100B-scale models
GLM-130B model weights are publicly accessible. Code, training logs, related toolkits, and lessons learned are open-sourced.

Training Details

Training data

The pre-training data includes 1.2T Pile English corpus, 1.0T Chinese WudaoCorpora, and 250G Chinese corpora, including online forums, encyclopedia, and QA crawled from the web, which forms a balanced composition of English and Chinese contents.

Training infrastructure

GLM-130B is trained on a cluster of 96 DGX-A100 GPU (8×40G) servers with 60-day access. During this period, the publisher managed to train GLM-130B for 400 billion tokens with a fixed sequence length of 2,048 per sample.

Training Objective

GLM-130B pre-training objective includes the self-supervised GLM autoregressive blank infilling and multi-task learning for a small portion of tokens. This is expected to help boost its downstream zero-shot performance.

Training Observation

The training period spanned two months. The publisher managed to reach the INT4 weight quantization for GLM-130B. The INT4 version of GLM-130B without post-training faces negligible performance degradation compared to its uncompressed original.

GLM Model Types

The GLM language model has several architecture variations with varying numbers of parameters. Here's a brief explanation of each of them:

Model	Parameters
GLM-Base	110 Million
GLM-Large	335 Million
GLM-Large-Chinese	335 Million
GLM-Doc	335 Million
GLM-410M	410 Million
GLM-515M	515 Million
GLM-2B	2 Billion
GLM-10B	10 Billion
GLM-10B-Chinese	10 Billion
GLM-130B	130 Billion

Business Applications

GLM 130B LM can be used in various business applications that require natural language processing (NLP) capabilities, such as chatbots, virtual assistants, and sentiment analysis.

Use Case	Description
Customer Service	Model can be used to understand customer queries and respond to them accurately and efficiently, reducing the need for human intervention.
Content Generation	Model can be used to generate content such as articles, reports, and summaries, saving time and effort for content creators.
Personalization	Model can be used to personalize user experiences by understanding their preferences and interests, making recommendations, and suggesting relevant content.
Market Research	Model can be used to analyze large volumes of data from various sources, identify trends, and make predictions, helping businesses make informed decisions.
Language Translation	Model can be used to translate content accurately and quickly, making it easier for businesses to communicate with customers and partners from different parts of the world.

Model Features

GLM 130B is a highly modular and flexible language model architecture with various technical features to optimize training performance and model quality. Here are some key technical features of GLM 130B:

Layer Normalization

To stabilize the training of the model, Post-LN practice is adopted due to its favorable downstream results in preliminary experiments. Post-LN initialized with the newly-proposed DeepNorm generates promising training stability.

Positional Encoding and FFNs

For training stability and downstream performance, the positional encoding used in GLM-130B is Rotary Positional Encoding instead of ALiBi. To improve FFNs in Transformer, GLU with the GeLU is used as activation.

Multi-Task Instruction Pretraining

GLM-130B includes a variety of instruction-prompted datasets, including language understanding, generation, and information extraction for multi-task learning in pretraining. This can be more helpful than fine-tuning. MIP only accounts for 5% of tokens and is set in the pretraining stage to prevent spoiling the model’s general ability.

Embedding Layer Gradient Shrink

The model employed gradient shrink on the embedding layer to help overcome loss spikes and thus stabilize the GLM-130B training. This is first used in the multi-modal transformer CogView.

Autoregressive Blanking Infilling

GLM-130B leverages autoregressive blanking infilling as its primary pre-training objective. It masks random continuous spans and predicts them autoregressively.

Model Tasks

GLM-130B, being a large and powerful language model, can perform a wide range of natural language processing tasks, including;

Abstractive summarization

The GLM language model can generate concise summaries of long text by selectively extracting key information and rephrasing it in a coherent and fluent manner.

Question generation

The model can generate questions from a given passage, which can help improve the quality of automated question-answering and information retrieval systems.

Text infilling

The GLM language model can fill in missing words or phrases in a given text, which can be useful for language understanding and completion tasks.

Language Modeling

The model can predict the probability of a sequence of words given the context, which is useful for various downstream tasks such as speech recognition and machine translation.

Question answering

The GLM language model can answer questions based on a given passage, which can be useful for information retrieval and dialogue systems.

Coreference resolution

The model can identify and resolve coreference ambiguities, which is important for natural language understanding and generating coherent text.

Textual entailment

The GLM language model can determine whether one piece of text logically entails another, which can be useful for tasks such as question answering and text classification.

Text generation

The GLM language model can generate coherent and fluent text, which can be useful for various applications such as chatbots and language translation.

Multi-Genre Natural Language Inference

The GLM language model can determine whether two pieces of text have a similar meaning, even if they belong to different genres or domains.

Semantic Textual Similarity

The GLM language model can determine the degree of similarity between two pieces of text based on their meaning, which is important for various downstream tasks such as text classification and machine translation.

Common sense reasoning

The GLM language model can reason about common sense knowledge and make inferences based on that knowledge, which is important for various downstream tasks such as text classification and question answering.

Pronoun reference ambiguity

The model can resolve ambiguities related to pronoun reference, which is important for natural language understanding and generating coherent text.

Fine-tuning Methods

Fine-tuning is the process of training a pre-trained model on a specific task or dataset to adapt it for that task.

Parameter-Efficient P-Tuning.

Parameter-efficient pre-training (P-Tuning) is a technique for optimizing pre-trained language models with a smaller number of parameters than traditional approaches. This approach involves fine-tuning a smaller and more efficient language model on a specific downstream task, and then using this model to initialize the weights of a larger pre-trained model with similar architecture. This results in a larger model that is already partially optimized for the downstream task and requires less fine-tuning than starting from scratch. P-Tuning is designed to improve the efficiency of training and deploying large language models while maintaining their performance on downstream tasks.

Limitations

Although GLM-130B is a highly advanced language model, it still has some limitations:

Performance Growth

For BIG-bench benchmark tasks, it is observed that GLM-130B’s performance growth with the increase of few-shot samples is not as significant as GPT-3’s.

Other LLMs

OPT

Meta AI first introduced OPT (Open Pre-trained Transformer) Language Model and released it in metaseq's repository on May 3rd, 2022

Galactica

Galactica is a large-scale language model developed by the research team at Meta Platforms, Inc.

LLaMA

Meta first introduced LLaMA in February 2023. LLaMA (Large Language Model Meta AI)

White Papers

Products

MENU

LLMs Explained,GLM 130B

100+ Technical Experts

50 Custom AI projects

4.8 Minimum Rating

An Overview of GLM

GLM has variants' parameter range from 110M to 130B

130B parameters

GLM-130B is trained on English as well as Chinese.

Bilingual Lang Model

GLM-130B outperforms ERNIE TITAN 3.0

Outperforms ERNIE TITAN

Model Details

Model highlights

Training Details

Training data

Training infrastructure

Training Objective

Training Observation

GLM Model Types

Business Applications

Model Features

Layer Normalization

Positional Encoding and FFNs

Multi-Task Instruction Pretraining

Embedding Layer Gradient Shrink

Autoregressive Blanking Infilling

Model Tasks

Abstractive summarization

Question generation

Text infilling

Language Modeling

Question answering

Coreference resolution

Textual entailment

Text generation

Multi-Genre Natural Language Inference

Semantic Textual Similarity

Common sense reasoning

Pronoun reference ambiguity

Fine-tuning Methods

Parameter-Efficient P-Tuning.

Benchmark Results

Limitations

Performance Growth

Other LLMs

OPT

Galactica

LLaMA

LLMs Explained,
GLM 130B