Thesis
As artificial intelligence (AI) has become more prevalent, one segment has become increasingly important: natural language processing (NLP). From 2020 to 2021, 60% of tech leaders increased their NLP budgets by at least 10%, with almost a fifth of them doubling it. While the relevance of the technology has exploded, it has remained somewhat inaccessible. Hugging Face is focused on improving accessibility.
Hugging Face is an open-source hosting platform for natural language processing (NLP) and other machine learning (ML) domains, such as computer vision and reinforcement learning. Large tech companies, like Google, Facebook, and Microsoft, were traditionally the only companies developing and using large NLP models because of the cost and processing resources they require. The cost of training one large language model can be as high as $1.6 million.
Hugging Face is trying to change how companies use NLP models by making them accessible to everyone. They are reproducing and distributing pre-trained models as open-source software. Beyond NLP, the company aims to become the GitHub of machine learning. Hugging Face is creating a repository for machine learning and a marketplace for researchers, data scientists, and engineers to share and exchange their work. The company’s mission is to:
“Democratize good machine learning and maximize its positive impact across industries and society.”
Founding Story
Clément Delangue (CEO) and Julien Chaumond (CTO) founded Hugging Face in 2016 as an “AI best friend forever (BFF)” chatbot via a mobile app for teenagers. The chatbot could interact with teens and provide emotional support and entertainment based on their needs.
Over time the founders started to implement open source AI models for powering their chatbot. The move to open-source their NLP models instantly became popular within the broader AI community, particularly those in AI development. So they ditched the chatbot app and began collecting large NLP models, such as BERT and GPT, turning them into open source models.
Delangue began his career in the product team of Moodstocks, a machine learning startup for image classification. (Google acquired Moodstocks in July 2016 for an undisclosed figure.) He was also previously the CMO of Mention, a media monitoring and social listening startup acquired by Mynewsdesk.
Before cofounding Hugging Face, Chaumond was a software engineer at Stupeflix, a video editing startup for developers. At Stupeflix, he spent most of his time building an ML-powered movie creation mobile app. In February 2016, GoPro acquired Stupeflix and another video editing company named Vemory for $105M.
Product
Transformers Library
In 2017, researchers at Google and the University of Toronto released a ground-breaking paper introducing a new technology named ‘transformers’. Transformers, also known as foundation models, are language analysis attention mechanisms which take into account the relationship between all the words in a sentence, and can rank the importance of elements within sentences. By ranking each element, transformers can efficiently interpret complex and ambiguous elements of a sentence much faster than traditional sequential NLP models. The unique architecture of transformers make use of the parallel processing units found in GPUs for NLP training applications. Not all NLP training models use GPUs, but the industry is working toward making better use of GPUs for that purpose.
Companies like Google, Facebook, and OpenAI immediately built large language models (LLMs) like BERT, ROBERTa, GPT-2, and GPT-3 based on transformer technology. However, not every enterprise can develop architectured transformer models from scratch, as they are expensive and require significant processing and computational resources.
Hugging Face created the transformers library to help people and organizations overcome the built-on costs of transformers. This transformers library is an open-source repository hosting some of the most popular language models. Developers use an application programming interface (API) to access thousands of off-the-shelf NLP and computer vision models offered through Hugging Face. After downloading each model, those developers can fine-tune them for specific use cases.
Source: Brandon Reeves
Hugging Face Hub
The Hugging Face Hub is the flagship open-source platform offered by the company. It consists of pre-trained ML models, datasets, and spaces. As of August 2022 the Hub hosts over 68,000 models covering various tasks, including text, audio and image classification, translation, segmentation, speech recognition, and object detection.
As of August 2022 the Datasets Library contains over 9,100 datasets. Users can access and interact with the Datasets Library using a few lines of code. For instance, it only takes 2 lines of code (see image below) to access RedCaps—a 12 million image-text dataset scraped from Reddit.
Source: Hugging Face
Spaces
Hugging Face also provides a platform named Spaces for developers to build, host, and share their models with the machine learning community. Developers can use Spaces to help build their portfolios by sharing their NLP, computer vision, and audio models. Spaces support interactive demos (limited to 16GB RAM and 8 CPU cores) for hosted models and versioning control with a git-based workflow.
As of September 2022, Spaces hosts over 6,500 models and does not place limits on the number of models developers can host. Users can host an unlimited number of Streamlit, Gradio, and Static apps using Spaces. Notable developer Spaces on the platform include DALLE-E mini by dalle-mini, Stable Diffusion by stabilityai, AnimeGANv2 and ArcaneGan by Akhaliq, and Latent Diffusion by multimodalart.
Source: Hugging Face Spaces
Inference API
The Interface API enables organizations to integrate thousands of ML models through a fully hosted API. It also offers the infrastructure to support building large models requiring over 10GB of space. Hugging Face built the Interface API for enterprise customers and can support up to 1000 API requests per second.
Autotrain
Using Autotrain, users can upload their data onto the Hugging Face platform. Autotrain then automatically finds the best model for the data, then trains, evaluates, and deploys it at scale. The video below summarizes how Autotrain works.
Market
Customer Profile
Hugging Face’s paying customers are primarily large enterprises seeking expert support, additional security, autotrain features, private cloud, SaaS, and on-premise model hosting. In the early days, the company’s target market was indie researchers, machine learning enthusiasts, and SMB organizations with lower security requirements to deploy their NLP models.
The company released a series of products and services in 2020, such as Autotrain, Inference API, and on-premise and private cloud hosting options focused on enterprise solutions. As of June 2022, the company has over 1,000 customers, including Intel, Qualcomm, Pfizer, Bloomberg, and eBay.
Market Size
The global NLP market is projected to grow at a 6-year CAGR of 20%, from $11 billion in 2020 to over $35 billion in 2026. Growth will be driven by the availability of a large volume of datasets, increased business interests in AI, advanced AI research, and frequent releases of more powerful language models with more model parameters.
In a TechCrunch article titled “The emerging types of language models and why they matter”, published in April 2022, the author defines language model parameters as:
"The parts of the model are learned from historical training data and essentially define the skill of the model on a problem, such as generating text."
If the size of the NLP market is being driven by the growth of NLP model parameters, it could grow quicker than expected. From 2018 to 2021, the number of parameters making up notable NLP models increased from 340 million to 530 billion—demonstrating incredible growth.
The number and size of language models exploded in 2018 after Google open-sourced BERT with 340 million model parameters. In 2019, Open-AI's GPT-2 debuted with 1.5 billion parameters, Nvidia’s Megatron-LM with 8.3 billion parameters, and Google’s T5 with 11 billion parameters. Microsoft introduced Turing-NLG with 17.2 billion parameters in early 2020. Open-AI then released GPT-3 in June 2020 with 175 billion parameters and was considered at the time to be the largest language model ever made. However, Nvidia and Microsoft broke the record in 2021 when they unveiled Megatron-Turing NLG with 530 billion parameters. Hugging Face joined the fray in July 2022 when it released BLOOM with 176 billion parameters.
Source: Hugging Face
Competition
Direct Competitors
Hugging Face directly competes with companies like H20.ai, spaCy, AllenNLP, Genism, and Fast.ai. Their most notable direct competitor is H2O.ai.
H2O.ai offers an automated machine learning (autoML) platform for use cases in financial services, healthcare, insurance, manufacturing, marketing, retail, and telecommunications. Over 18,000 organizations use H2O.ai. The company’s most popular product offerings serve R and Python developers in the corporate sector.
Source: H2o.ai
The main differentiating characteristic between H2O.ai and Hugging Face is their business models. Hugging Face is community-oriented and offers a two-sided platform for data scientists and engineers to build off each other’s work by sharing their models. H2O.ai, on the other hand, sells a product focused on enterprise solutions instead of a general open-source AI community of scientists and engineers.
The strong Hugging Face community generates network effects for the company, giving them an advantage against H2O.ai in the long run. For example, Hugging Face’s second most popular open-source repository (datasets) has over double the number of stars as H2O.ai’s most popular repository (h2O-3). Also, Hugging Face’s most popular open-source repository, transformers, has been forked over 8x as many times as H2O.ai’s most popular repository, h20-3 as of September 2020. The Hugging Face ecosystem clearly has a much more active and passionate community compared to H2O.ai.
Indirect Competitors
Indirectly, Hugging Face competes with companies such as OpenAI, Cohere, and AI21 Labs. Any company building and training large language models represents competition. Hugging Face reprocesses some of the pre-trained AI models of other organizations, making them open-source and available to the public for free.
Hugging Face made a significant move in July 2022 when it released its own pre-trained large language model, BLOOM, with similar architecture to Open-AI’s GPT-3. The release marks a transition for the company to start competing directly with previously indirect competitors that offer large language models.
Business Model
Hugging Face employs the open core business model. The company offers some features for free and other features to paying customers. Their core products, such as the Transformers library and Hugging Face Hub, are open-source and free.
Hugging Face charges users for extra features such as the Inference API, Autotrain, expert support, and advanced security. They offer subscription and consumption-based plans for paid platform features. The contributor and pro plans serve the open-source community, and the lab and enterprise plans serve large organizations and research labs.
Source: Hugging Face
Their pro and lab plans allow users to train their first project for free, then pay $9/month (pro plan) or pay-as-you-go (lab plan). The pro plan limits access to their Inference API to 1M input characters per month and 2 hours of audio. The lab plan has no limits because users are paying for what they need to use. Enterprise customers get customized quotes for their Hugging Face usage, plus they have more support and security features than other plans.
Traction
Over 10,000 organizations use Hugging Face products as of June 2022. The Hugging Face transformers have over 60,000 stars on GitHub. Hugging Face only began offering paid features in 2021 and ended that year with about $10 million in revenue.
The traction of the company’s products can be attributed to its initial focus on building a community instead of monetizing users. Pat Grady, a Partner of Sequoia Capital, had the following to say about Hugging Face prioritizing building a strong community early on:
"They prioritized adoption over monetization, which I think was correct. They saw that transformer-based models working their way outside of NLP and saw a chance to be the GitHub not just for NLP, but for every domain of machine learning."
Key Opportunities
Expand Product Offerings to MLOps
Hugging Face already provides autoML solutions enabling organizations to build AI models without code generation. The next reasonable step would be to enter the MLOps market by serving enterprise customers with model management, deployment, and monitoring.
The MLOps market is expected to reach $6.2 billion in 2028, up from $612 million in 2021. Several well-funded startups exist in the space, notably Weight and Biases, DataRobot, Comet, and Dataiku. Hugging Face could develop and cross-sell MLOps model monitoring, observability, and management products to its existing enterprise customers.
Build Large Language Models
Hugging Face released its first open-source large language model, BLOOM, in July 2022. The model has a similar architecture to OpenAI’s widely popular GPT-3, and was trained on 46 spoken languages and 13 programming languages. In the past, Hugging Face relied on large language model creators to open-source their models—or for external researchers to replicate them—before Hugging Face added them to their library. However, large language model creators like OpenAI are increasingly keeping their models proprietary and commercializing them.
The unveiling of BLOOM shows a clear intent from Hugging Face to pursue creating its own large language models. BLOOM was downloaded over 40,000 times as of August 2022. If successful, BLOOM could present Hugging Face with an opportunity to become a formidable player in the large language model space rather than just a marketplace for open-source NLP models.
Key Risks
Biases and Limitations in Datasets
AI models, particularly NLP, have long struggled with biases in datasets used to build them. Human biases such as overgeneralization, stereotypes, prejudice, selection bias, and the halo effect are prevalent in the real world. Large language models are trained with vast volumes of data often scrapped from the internet that could contain some of these biases. For instance, researchers found men are over-represented in online news articles and on Twitter conversations. So, machine learning models trained on such datasets could have implicit gender biases.
Other researchers discovered NLP models underrepresent women in certain career fields like computer programming and medicine. In a task determining the relationships between words, the models associated occupations such as nursing and homemaking with women and doctor and computer programming with men.
NLP models are used across all sectors of the economy, from banking and insurance to law enforcement. Organizations could use some of Hugging Face’s popular models to make business decisions such as credit approval and insurance premium calculations that may impact marginalized and underrepresented groups.
Hugging Face acknowledged the issue and even showed how some models in their library such as BERT contain implicit biases. It put some checks and fine tuning in place, including the Model Card feature intended to accompany every model on the platform and highlight their potential biases. However, these measures may not be enough since they warn users but do not fully tackle them.
Trends to Commercialize Language Models
Hugging Face hosts over 130 architectures. However, some popular architectures like GPT-3, Jurassic-1, and Megatron-Turing NLG are not available in the company’s library because companies, such as Open-AI and AI21 Labs, began commercializing their proprietary models.
Commercialized models usually contain more parameters (175B, 178B, and 350B) than open-source models and can perform more advanced tasks. If the commercialized model trend continues, some of the content in Hugging Face’s library could become obsolete. Hugging Face could no longer host the commercialized models. Models they can host would become less accurate, have fewer parameters, and could not perform advanced tasks as well as commercialized models, driving users away from the platform.
Valuation
Source: Maginative
Hugging Face has raised a total of $161 million as of June 2022 from investors including Lux Capital, Sequoia, Coatue, Addition, A_capital, SV Angel, Betaworks, AIX Ventures, Thirty-Five Ventures, and Olivier Pomel (co-founder & CEO at Datadog). The last funding round was a $100 million Series C in May 2022 led by Lux Capital at a post-money valuation of $2 billion.
Hugging Face has plans to become a publicly traded company. Brandon Reeves from Lux Capital, an investor in Hugging Face since 2019, believes Hugging Face could be a $50-100 billion company. According to the company’s CEO, Clément Delangue, he has turned down multiple acquisition offers for his company. In a 2022 Forbes interview, Delangue made the following comments about his plans to take Hugging Face public:
"We want to be the first company to go public with an emoji, rather than a three-letter ticker. We have to start doing some lobbying to the Nasdaq to make sure it can happen."
Summary
Hugging Face’s open-source library and model repositories is helping to drive access to ML models as the field explodes, and has become very popular among ML engineers, data scientists, and researchers. For the first 5 years of its existence, Hugging Face prioritized gaining free users and building a community over monetization. In 2021, the company began focusing on the enterprise market. They developed enterprise-grade products and services with more security features and processing capacity.
Hugging Face is well capitalized. In May 2022, after its $100 million Series C round, the company also had roughly $40 million in the bank from previous rounds—bringing its cash reserves to $140 million. They can use their cash position will allow them to continue to grow their library and potentially focus on expanding beyond NLP into MLOps, accelerating their commercialization efforts, and fighting the inherent bias in ML models.