Code LLMs Explained,
InCoder

InCoder is a large-scale generative code model that can synthesize and edit programs by infilling masked code. After being trained on permissively licensed code, it can infill any region of code, resulting in improved performance on tasks like type inference and variable renaming. Because of the bidirectional context, the model performs well on challenging tasks such as comment generation in zero-shot settings. On program synthesis benchmarks, it performs similarly to left-to-right models.

Model Details

View All Models

Technical Experts

50 Custom AI projects

4.8 Minimum Rating

An Overview of Incoder

InCoder is a generative model for code infilling and synthesis designed to assist developers in writing and completing code by automatically generating missing or required code segments.

InCoder achieves 82.43% accuracy on CodeXGLUE.

82.43% Accuracy

InCoder achieved 82.43% accuracy on the CodeXGLUE benchmark. It is widely used for code-infilling tasks, demonstrating the effectiveness of InCoder's neural language modeling approach.

Trained on a total of 159 GB of code and 28 languages

Trained on 159 GB of code

InCoder is trained on a large dataset with a total of 159 GB of code, 52 GB of it in Python, and 57 GB of content from StackOverflow. And trained in 28 languages, all included in StackOverflow.

Trained on a single NVIDIA GeForce RTX 2080 Ti GPU.

Trained on NVIDIA GeForce

InCoder achieves top performance on code infilling and synthesis tasks with training on a single NVIDIA GeForce RTX 2080 Ti GPU using the PyTorch deep learning framework,

About Model
Model Highlights
Training Details
Model Types
Key Results
Model Features
Model Tasks
Fine-tuning
Benchmark Results
Sample Codes
Limitations
Other LLMs

About Model

InCoder is a generative model for code infilling and synthesis designed to assist developers in writing and completing code by automatically generating missing or required code segments. Based on advanced machine learning algorithms and trained on vast source code repositories, InCoder leverages state-of-the-art natural language processing techniques, such as transformer-based models, to comprehend the underlying structure and semantics of programming languages. This enables the model to perform code infilling and synthesis, allowing developers to quickly prototype new ideas, explore alternative implementations, and learn new programming techniques. InCoder's language-agnostic design and integration with popular integrated development environments (IDEs) and code-editing tools provide a seamless experience for developers, enhancing productivity and code quality across different projects and platforms.

Research Paper

Model Repository

HuggingFace

Developed By

Papers with code

Checkpoints

Model Highlights

InCoder is a powerful code infilling and synthesis model that utilizes advanced machine learning techniques and is built on state-of-the-art natural language processing and machine learning models. It is trained on a large dataset of code snippets from various open-source repositories, making it highly accurate and useful for generating high-quality code suggestions.

InCoder is a unified generative model that can perform program synthesis and editing.
In this paper, the authors adopt the recently proposed causal masking objective, which aims to combine the strengths of both causal and masked language models.
InCoder is trained on a large corpus of permissively licensed code, where code regions are randomly masked and moved to the end of each file, allowing bidirectional code infilling.
The model is the first large generative code model capable of infilling arbitrary regions of code.
In a zero-shot setting, the model is evaluated on tasks such as type inference, comment generation, and variable re-naming.
The ability to condition on bidirectional context substantially improves performance on these tasks.
The model performs comparably to left-to-right-only models pretrained at a similar scale on standard program synthesis benchmarks.

Training Details

Training data

The model was trained on public open-source repositories with a permissive, non-copyleft license (Apache 2.0, MIT, BSD-2, or BSD-3) from GitHub and GitLab, as well as StackOverflow. Repositories primarily contained Python and JavaScript but also included code from 28 languages, as well as StackOverflow.

Training Procedure

During training, contiguous token spans are randomly masked in each document. The number of spans is sampled from a Poisson distribution with a mean of one and truncated to [1,256]. The span length is uniformly sampled from the document, and overlapping spans are rejected and resampled.

Training dataset size

After filtering and deduplication, the data corpus contains 159 GB of code, 52 GB of it in Python, and 57 GB of StackOverflow content. While training, the per-GPU batch size was 8, with a maximum token sequence length of 2048.

Training time and resources

INCODER-6.7B was trained on 248 V100 GPUs for 24 days. One epoch on the training data was performed, using each training document exactly once. The per-GPU batch size was 8, with a maximum token sequence length of 2048.

Model Types

InCoder is a generative model for code infilling and synthesis that has two pre-trained models with different parameter sizes: incoder-6B and incoder-1B. Both models are decoder-only transformer models trained on code using a causal-masked objective, allowing for both code infilling and standard left-to-right code generation.

Model	Parameters
incoder-6B	6.7B
incoder-1B	1B

Key Results

InCoder is a unified generative model that can synthesize code from scratch (via left-to-right generation) or edit existing code blocks (via infilling). The model is trained on a vast corpus of permissively licensed code, where random code regions are masked and shifted to the end of each file. This approach enables InCoder to generate code infillings with bidirectional context, making it the first generative code model capable of infilling arbitrary code regions.

Task	Dataset	Score
Single-line infilling (L-R single)	HumanEval	48.2
Single-line infilling (L-R reranking)	HumanEval	54.9
Single-line infilling (CM infilling)	HumanEval	69
Multi-line infilling (L-R single)	HumanEval	24.9
Multi-line infilling (L-R reranking)	HumanEval	28.2
Multi-line infilling (CM infilling)	HumanEval	38.6
Python Docstring generation avg	CodeXGLUE	17.15
code generation (pass@100)	HumanEval	47
code generation (pass@100)	MBPP	19.4
Left-to-right single	HumanEval	48.2
Left-to-right reranking	HumanEval	54.9
Infilling	HumanEval	69

Model Features

These technical features make InCoder a powerful tool for software development tasks, as it can assist in writing complex code and prototyping new ideas with little additional training data.

Causal masking objective

InCoder employs a masking procedure during training to ensure that the model only generates code based on the context preceding the current token. The model masks random contiguous token spans from a Poisson distribution in each document and ensures that spans do not overlap.

Program synthesis

InCoder can synthesize code from a high-level task description, such as natural language input. This task involves understanding the task semantics and generating a corresponding code block to perform the desired task.

Infilling

InCoder can generate code to fill gaps in an existing code block based on the surrounding context and syntax. This task involves inferring the purpose and functionality of the missing code and generating a corresponding code block to fill the gap. InCoder is the first generative code model capable of infilling arbitrary regions of code.

Model Tasks

InCoder, as a generative model for code infilling and synthesis, can perform a range of tasks to assist developers and programmers during the coding process:

Single-line infilling

InCoder can generate code to fill a single missing line in an existing code block. This task involves infilling the missing line based on the code's surrounding context and syntactic structure.

Multi-line infilling

InCoder can also generate code to fill multiple missing lines in an existing code block. This task involves infilling a sequence of missing lines in a code block based on the context and structure of the surrounding code.

Python Docstring generation

InCoder can generate Python docstrings based on the function signature and surrounding code. This task involves inferring the purpose and functionality of a function and generating a corresponding docstring.

Code generation

InCoder can generate code from a high-level natural language description of a task. This task involves understanding the semantics of the task and translating it into executable code.

Left-to-right reranking

InCoder reranks left-to-right generation outputs to select the most likely correct output. This task involves generating multiple possible outputs and choosing the most probable one based on the context and syntax.

Infilling

InCoder can generate code to fill arbitrary gaps in an existing code block. This task involves inferring the purpose and functionality of the missing code and generating a corresponding code block to fill the gap.

Fine-tuning

CodeXGLUE Python Docstring generation BLEU scores. Incoder model is evaluated in a zero-shot setting, with no fine-tuning for docstring generation, but it approaches the performance of pretrained code models that are fine-tuned on the task’s 250K examples. Fine-tuning would allow Incoder models to better condition natural language instructions and other indications of human intent. The model lays a foundation for future work on supervised infilling & editing via model fine-tuning, as well as performing iterative decoding, where the model can be used to refine its output. Fine-tuning methods for CodeGeeX will be updated in this section soon.

Benchmark Results

Benchmarking is an important process to evaluate the performance of any language model, including Incoder. The key results are;

Incoder Benchmark Results

A comparison of our INCODER-6.7B model to published code generation systems using
pass rates @ K candidates sampled on the HumanEval and MBPP benchmarks. All models are
decoder-only transformer models. A “Permissive” code license indicates models trained on only
open-source repositories with non-copyleft licenses. The GPT-J, GPT-NeoX, and CodeGen models
are pre-trained on The Pile [26], which contains a portion of GitHub code without any license filtering,
including 6 GB of Python. Although the LaMDA model does not train on code repositories, its
training corpus includes ∼18 B tokens of code from web documents. The total file size of the LaMDA
corpus was not reported, but it contains 2.8 T tokens total. We estimate the corpus size for PaLM
using the reported size of the code data and the token counts per section of the corpus.

Sample Codes

Running the model on a CPU

Running the model on GPU

import torch
from incoder_model import InCoderModel

# Load the pre-trained InCoder model
model = InCoderModel.from_pretrained('incoder_model')

# Set the device to GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Create input data
input_data = ['some', 'input', 'data']

# Encode the input data using the InCoder model
with torch.no_grad():
encoded_data = model.encode(input_data)

# Print the encoded data
print(encoded_data)

Other LLMs

Polycoder

Polycoder is a deep learning model for multilingual natural language processing tasks

CodeGeex

CodeGeeX, a large-scale multilingual code generation model with 13 billion parameters pre-trained

CodeRL

CodeRL is a novel framework for program synthesis tasks that combines pretrained language models (LMs)

White Papers

Products

MENU

Code LLMs Explained,InCoder

Technical Experts

50 Custom AI projects

4.8 Minimum Rating

An Overview of Incoder

InCoder achieves 82.43% accuracy on CodeXGLUE.

82.43% Accuracy

Trained on a total of 159 GB of code and 28 languages

Trained on 159 GB of code

Trained on a single NVIDIA GeForce RTX 2080 Ti GPU.

Trained on NVIDIA GeForce

About Model

Model Highlights

Training Details

Training data

Training Procedure

Training dataset size

Training time and resources

Model Types

Key Results

Model Features

Causal masking objective

Program synthesis

Infilling

Model Tasks

Single-line infilling

Multi-line infilling

Python Docstring generation

Code generation

Left-to-right reranking

Infilling

Fine-tuning

Benchmark Results

Sample Codes

Running the model on a CPU

Running the model on GPU

Model Limitations

Other LLMs

Polycoder

CodeGeex

CodeRL

Code LLMs Explained,
InCoder