The Allen Institute for Artificial Intelligence (AI2) has released OLMo, an open large language model that aims to help better understand what happens in AI model processes, as well as contribute to developments in the field of language model science.
Allen’s collaboration with the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University, as well as partners including AMD, CSC-IT Center for Science (Finland), Paul G. Allen School of Computer Science & Engineering at the University of Washington, and Databricks are making the OLMo project a reality.
OLMo is being released along with pre-training data and training code that, as the institute’s announcement says, “is not available today in any open model of this scale.”
Among the development tools in the framework are pre-training data built on AI2’s Dolma set, which includes three trillion tokens, along with the code that creates the training data.
“Many language models today are published with limited transparency. Without access to training data, researchers cannot scientifically understand how a model works. This is akin to discovering drugs without clinical trials or studying the solar system without a telescope,” says Hannah Hajishirzi, OLMo project leader, senior director of NLP Research at AI2, and professor at UW’s Allen School.
He adds that thanks to OLMo, researchers “will finally be able to study the science of LLM, which is critical to building the next generation of safe and reliable artificial intelligence.”
The Allen Institute for Artificial Intelligence noted that OLMo provides researchers and developers with greater accuracy by offering insights into the training data behind the model, removing the need to rely on assumptions about how the model works. And because models and datasets are open, researchers can learn and build on previous models and work.
In the coming months, AI2 will continue to iterate on OLMo and will bring different model sizes, modalities, datasets, and capabilities into the OLMo family.