An Overview of Bloom
Bloom is an autoregressive language model trained on an impressive dataset that includes 46 natural languages and 13 programming languages. It was developed by collaborating with hundreds of researchers from various organizations, including Facebook AI Research, Stanford University, and New York University.
It is one of the largest open-access language models available.
Bloom is one of the largest language models with 176B parameters, publicly released under the Responsible AI License and freely available to the public.
The model is trained on 46 natural and 13 programming languages.
Trained on 59 Languages
Bloom model was trained on the ROOTS corpus, which is a dataset that includes hundreds of sources in 46 natural and 13 programming languages.
Compared to similar models, CO2 emission is very low (25 tons)
25 tons CO2eq Emissions
Bloom is trained on a low carbon intensity energy grid resulting in 25 tons of CO2 emissions. It is one of the greenest compared to similar models.
- Bloom is a large language model with 176B parameters, making it one of the largest language models available.
- It is an open-source language model that is freely available to the public.
- Trained on the ROOTS corpus, a dataset that includes hundreds of sources in 46 natural and 13 programming languages.
- Achieves competitive performance on a wide variety of benchmarks, which indicates its high-quality results.
- The models and code used to build Bloom are publicly released under the Responsible AI License, promoting AI technologies' ethical and responsible use.
Bloom is trained on 46 natural languages and 13 programming languages. The dataset had 1.6TB of pre-processed text converted into 350B unique tokens.
Training dataset size
Bloom is trained on a large dataset. Its Bf16 weights 329GB, and the full checkpoint with optimizer states was 2.3TB. The dataset vocabulary size was 250,680.
BLOOM's learned subword tokenizer is trained using a byte-level Byte Pair Encoding (BPE) algorithm and a simple pre-tokenization rule with no normalization.
Training time and resources
Training the model took about 4 months. Training throughput was about 150 TFLOP per GPU per second and the estimated cost of model training was $2-5M.
|bloom-1b1||1 Billion parameters|
|bloom-1b7||1.7 Billion parameters|
|bloom 176B||176 Billion|
|Language Modeling||Multilingual NLP|
|Text completion and prediction||Multilingual customer support|
|Sentiment analysis||Multilingual chatbots and virtual assistants|
|Text classification||Multilingual sentiment analysis|
|Language translation||Multilingual social media monitoring|
|Content generation and summarization||Multilingual search engines|
|Speech recognition and transcription||Multilingual voice assistants and speech recognition|
|Personalization and recommendation systems||Multilingual voice assistants and speech recognition|
|Information retrieval and search engines||Multilingual text summarization and classification|
|Fraud detection and spam filtering.||Multilingual data analysis and visualization|
Key Benchmark Results
- Overrepresent some viewpoints and underrepresent others
- Contain stereotypes
- Contain personal information
- Generate hateful, abusive, or violent language
- Generate discriminatory or prejudicial language
- Generate content that may not be appropriate for all settings, including sexual content
- Make errors, including producing incorrect information as if it were factual
- Generate irrelevant or repetitive outputs
- Induce users into attributing human traits to it, such as sentience or consciousness