Bloom

LLMs Explained,
Bloom

The BigScience research workshop unveiled the BigScience Large Open-science Open-access Multilingual Language Model, a.k.a. Bloom. The model is based on the GPT-3 architecture, and it has been trained on an impressive dataset that includes 46 natural languages and 13 programming languages. This state-of-the-art language model has significantly advanced natural language processing (NLP) techniques. Bloom is the first AI language model to have over 100 billion parameters for most of the languages in the dataset. Although BigScience is currently evaluating the model, early results suggest that Bloom can perform several natural language processing (NLP) tasks with zero-shot learning.

Model Card View All Models

100+ Technical Experts

50 Custom AI projects

4.8 Minimum Rating

An Overview of Bloom

Bloom is an autoregressive language model trained on an impressive dataset that includes 46 natural languages and 13 programming languages. It was developed by collaborating with hundreds of researchers from various organizations, including Facebook AI Research, Stanford University, and New York University.

It is one of the largest open-access language models available.

176B parameters

Bloom is one of the largest language models with 176B parameters, publicly released under the Responsible AI License and freely available to the public.

The model is trained on 46 natural and 13 programming languages.

Trained on 59 Languages

Bloom model was trained on the ROOTS corpus, which is a dataset that includes hundreds of sources in 46 natural and 13 programming languages.

Compared to similar models, CO2 emission is very low (25 tons)

25 tons CO2eq Emissions

Bloom is trained on a low carbon intensity energy grid resulting in 25 tons of CO2 emissions. It is one of the greenest compared to similar models.

Blockchain Success Starts here

  • Introduction

  • Business Applications

  • Model Features

  • Model Tasks

  • Getting Started

  • Fine-tuning

  • Benchmarking

  • Sample Codes

  • Limitations

  • Other LLMs

Key highlights

  • Bloom is a large language model with 176B parameters, making it one of the largest language models available.
  • It is an open-source language model that is freely available to the public.
  • Trained on the ROOTS corpus, a dataset that includes hundreds of sources in 46 natural and 13 programming languages.
  • Achieves competitive performance on a wide variety of benchmarks, which indicates its high-quality results.
  • The models and code used to build Bloom are publicly released under the Responsible AI License, promoting AI technologies' ethical and responsible use.

Training Details

Training data

Bloom is trained on 46 natural languages and 13 programming languages. The dataset had 1.6TB of pre-processed text converted into 350B unique tokens.

Training dataset size

Bloom is trained on a large dataset. Its Bf16 weights 329GB, and the full checkpoint with optimizer states was 2.3TB. The dataset vocabulary size was 250,680.

Training Procedure

BLOOM's learned subword tokenizer is trained using a byte-level Byte Pair Encoding (BPE) algorithm and a simple pre-tokenization rule with no normalization.

Training time and resources

Training the model took about 4 months. Training throughput was about 150 TFLOP per GPU per second and the estimated cost of model training was $2-5M.

Model Parameters
bloom-560m560 Million
bloom-1b11 Billion parameters
bloom-1b71.7 Billion parameters
bloom-3b3 Billion
bloom-7b17 Billion
bloom 176B176 Billion
Language ModelingMultilingual NLP
Text completion and predictionMultilingual customer support
Sentiment analysisMultilingual chatbots and virtual assistants
Text classificationMultilingual sentiment analysis
Language translationMultilingual social media monitoring
Content generation and summarizationMultilingual search engines
Speech recognition and transcriptionMultilingual voice assistants and speech recognition
Personalization and recommendation systemsMultilingual voice assistants and speech recognition
Information retrieval and search enginesMultilingual text summarization and classification
Fraud detection and spam filtering.Multilingual data analysis and visualization

Key Benchmark Results

 

 

 

Model Limitations

Model may:

  • Overrepresent some viewpoints and underrepresent others
  • Contain stereotypes
  • Contain personal information
  • Generate hateful, abusive, or violent language
  • Generate discriminatory or prejudicial language
  • Generate content that may not be appropriate for all settings, including sexual content
  • Make errors, including producing incorrect information as if it were factual
  • Generate irrelevant or repetitive outputs
  • Induce users into attributing human traits to it, such as sentience or consciousness