Code LLMs Explained,
CodeBERT

CodeBERT is a pre-trained model developed by Microsoft Research, designed to understand and generate code in multiple programming languages as well as natural language text. It is based on the BERT (Bidirectional Encoder Representations from Transformers) architecture, which is a transformer-based model known for its success in natural language understanding tasks.

Model Details

View All Models

Technical Experts

50 Custom AI projects

4.8 Minimum Rating

An Overview of CodeBERT

CodeBERT is a pre-trained model developed by Microsoft Research, designed to understand and generate code in multiple programming languages and natural language text.

CodeBERT achieved a mean average precision (MAP) of 0.428

Achieved 0.428 MAP

On a dataset of 100K Java methods, CodeBERT achieved a mean average precision (MAP) of 0.428, significantly higher than the previous state-of-the-art model.

Achieves SOTA performance code search and documentation.

SOTA performance

CodeBERT achieves SOTA performance in both natural language code search and code documentation generation, according to the results.

The first NL-PL model for 6 programming languages

6 programming languages

CodeBERT is the first large NL-PL pretrained model for multiple programming languages. On NLPL probing, the results show that CodeBERT outperforms previous pre-trained models.

About Model
Model Highlights
Training Details
Model Types
Key Results
Model Features
Model Tasks
Fine-tuning
Benchmark Results
Sample Codes
Limitations
Other LLMs

About Model

CodeBERT is an advanced model created by Microsoft Research that comes pre-trained to comprehend and generate code written in various programming languages and natural language text. It leverages the BERT architecture, a transformer-based model with exceptional performance in natural language processing tasks. This enables CodeBERT to process code in a way that is similar to how humans read and understand natural language text. This means that it can perform tasks such as code retrieval, code generation, and code summarization with high accuracy and efficiency. With its ability to process multilingual code, CodeBERT is an invaluable tool for software developers, researchers, and practitioners who work with code in different languages.

Research Paper

Model Repository

HuggingFace

Developed By

Papers with code

Checkpoints

Model Highlights

CodeBERT is a pre-trained model based on the successful transformer-based BERT architecture, adapted for programming and natural language tasks.

CodeBERT is a pre-trained natural language and programming language (PL) model (NL).
It is useful for downstream NL-PL applications like the natural language code search and code documentation generation.
CodeBERT employs a Transformer-based neural architecture and a hybrid objective function incorporating the replaced token detection pre-training task.
The replaced token detection task allows the model to learn better generators by utilizing both "bimodal" data of NL-PL pairs and "unimodal" data.
CodeBERT achieves cutting-edge performance in natural language code search and documentation generation.
On NLPL probing, the results show that CodeBERT outperforms previous pre-trained models.

Training Details

Training data

CodeBERT is trained on a large-scale dataset with 2.1 million bimodal data points and 6.4 million unimodal codes from six programming languages (Python, Java, JavaScript, PHP, Ruby, and Go). The data is derived from publicly accessible open-source GitHub repositories.

Training Procedure

CodeBERT is trained cross-modal, leveraging both bimodal NL-PL data and unimodal PL/NL data; it is pre-trained in six programming languages and trained with a new learning objective based on replaced token detection.

Training dataset size

CodeBERT was trained on a batch size of 2048, with a learning rate of 5e-4. FP16 was used to train CodeBERT on a single NVIDIA DGX-2 machine. It combines 16 NVIDIA Tesla V100 interconnected with 32GB memory.

Training time and resources

Training 1,000 batches of data takes 600 minutes with MLM and 120 minutes with RTD. The maximum length is set to 512, and the maximum training step is set to 100K.

Model Types

CodeBERT is a pre-trained model designed to process and generate source code in multiple programming languages. It leverages several pre-training techniques to improve its code and natural language understanding.

Model	Parameters	Highlight
CODEBERT (RTD)	Replaced Token Detection (RTD)	Replaces tokens with others from the same vocabulary to learn more contextual information and relationships.
CODEBERT (MLM)	Masked Language Modeling (MLM)	Randomly masks some tokens to predict them from the context of remaining unmasked tokens.
CODEBERT (MLM+RTD)	Both Masked Language Modeling and RTD	Combines MLM and RTD approaches for more effective learning and better performance on downstream tasks.

Key Results

The researchers constructed a dataset for NL-PL probing to investigate the knowledge learned by CodeBERT. The results show that CodeBERT performs better than previous pre-trained models on NLPL probing.

Task	Dataset	Score
natural language code retrieval	CodeSearchNet	76
PL Probing	CodeSearchNet	85.66
PL Probing with PRECEDING CONTEXT ONLY	CodeSearchNet	59.12
NL probing	CodeSearchNet	74.53
Code-to-Documentation generation	BLEU-4	17.83
Code-to-NL	BLEU	22.36

Model Features

CodeBERT is a bimodal pre-trained model for natural language and programming language (PL) (NL). CodeBERT learns general-purpose representations to help downstream NL-PL applications like the natural language code search and documentation generation.

Contextual representations

CodeBERT has been pre-trained on a large corpus of programming language and natural language code, allowing it to learn contextual representations of the underlying meaning and syntax of code and natural language. This allows the model to perform well on downstream tasks without needing a large amount of task-specific training data.

Masked Language Modeling

CodeBERT is trained to predict the original word masked in an input sequence as one of its pre-training tasks. The model is given a sequence in which a random subset of the tokens has been replaced with [MASK] tokens, and it is required to predict the original tokens. This task assists the model in learning context-specific representations of code and natural language.

Replaced Token Detection (RTD)

CodeBERT uses RTD as another pre-training task to learn contextual representations. The model is trained to predict a plausible replacement for a token in the input sequence that has been randomly replaced. Because it must identify and replace specific tokens in the input sequence, this task helps the model learn to reason about code and natural language more nuancedly.

Model Tasks

CodeBERT is an advanced pre-trained model developed by Microsoft Research that can perform a wide range of natural language processing (NLP) tasks related to source code.

Natural Language Code Retrieval

CodeBERT can be configured to retrieve relevant code snippets based on a natural language query. This task entails teaching the model to recognize natural language queries and match them to relevant code snippets, allowing it to provide accurate and relevant search results.

PL Probing

CodeBERT's ability to perform programming language (PL) tasks, such as predicting variable names, types, or control flow structures, can be assessed. This task entails probing the learned representations of the model to determine how well they capture the syntactic and semantic structure of programming language code.

NL Probing

CodeBERT's ability to perform natural language (NL) tasks, such as sentiment analysis or named entity recognition, can also be assessed. This task involves probing the model's learned representations to see how well they capture the meaning and structure of natural language text.

Code-to-Documentation Generation

CodeBERT can be fine-tuned to generate comments or documentation based on a code snippet. This task entails training the model to generate natural language descriptions of code, which will be useful for developers working with the code.

Code-to-NL generation

CodeBERT can also be fine-tuned to generate natural language descriptions of a given code snippet. This task entails teaching the model to translate programming language code into natural language, which can be useful for tasks like code summarization or translation between programming languages.

Fine-tuning

CodeBERT was evaluated on two NL-PL applications by fine-tuning its model parameters. The results demonstrate that CodeBERT achieves state-of-the-art performance in both natural language code search and code documentation generation. Additionally, to investigate the type of knowledge learned by CodeBERT, a dataset for NL-PL probing was created, and the model was evaluated in a zero-shot setting with fixed pre-trained model parameters. The results indicate that CodeBERT outperforms previous pre-trained models on NL-PL probing. Fine-tuning methods for CodeGeeX will be updated in this section soon.

Benchmark Results

Benchmarking is an important process to evaluate the performance of any language model, including CodeBERT. The key results are;

CodeBERT Benchmark Results

Table 4 shows the results with different models for the code-to-documentation generation task. As we can see, models pre-trained on programming language outperform RoBERTa, which illustrates that pre-trainning models on programming language could improve code-to-NL generation. Besides, results in Table 4 show that CodeBERT pre-trained with RTD and MLM objectives brings a gain of 1.3 BLEU score over RoBERTa overall and achieve the state-of-the-art performance.

Sample Codes

Running the model on a CPU

# Import necessary libraries
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# Model and tokenizer names
model_name = "microsoft/CodeBERTa-small-v1"
tokenizer_name = "microsoft/CodeBERTa-small-v1"

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).cpu() # Ensure the model runs on CPU

# Create a summarization pipeline
code_summarization = pipeline("summarization", model=model, tokenizer=tokenizer, device=-1) # device=-1 for CPU

# Input code snippet (Python code)
code_snippet = """
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n - 1)
"""

# Perform code summarization
summary = code_summarization(code_snippet, max_length=50, min_length=10, do_sample=False)

# Print the summarized code description
print(summary[0]['summary_text'])

Running the model on GPU

# Import necessary libraries
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# Check for GPU availability and set device
if torch.cuda.is_available():
device = torch.device("cuda")
else:
raise RuntimeError("No GPU found. Please ensure you have a compatible GPU and CUDA libraries installed.")

# Model and tokenizer names
model_name = "microsoft/CodeBERTa-small-v1"
tokenizer_name = "microsoft/CodeBERTa-small-v1"

# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device) # Ensure the model runs on GPU

# Create a summarization pipeline
code_summarization = pipeline("summarization", model=model, tokenizer=tokenizer, device=device.index) # device for GPU

# Input code snippet (Python code)
code_snippet = """
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n - 1)
"""

# Perform code summarization
summary = code_summarization(code_snippet, max_length=50, min_length=10, do_sample=False)

# Print the summarized code description
print(summary[0]['summary_text'])

Other LLMs

Polycoder

Polycoder is a deep learning model for multilingual natural language processing tasks

CodeGeex

CodeGeeX, a large-scale multilingual code generation model with 13 billion parameters pre-trained

CodeRL

CodeRL is a novel framework for program synthesis tasks that combines pretrained language models (LMs)

White Papers

Products

MENU

Code LLMs Explained,CodeBERT