There’s an exciting new open-source Large Language Model (LLM) called FreeWilly that is outperforming LLaMA-2, the recently released 70 billion parameter model from Stable Diffusion. FreeWilly comes from Stability AI, the creators behind Stable Diffusion. So how did they pull off an LLM that beats LLaMA-2? The answer lies in Microsoft’s Orca paper.
Although Microsoft hasn’t released their Orca model or dataset yet, open-source projects have replicated the data generation process outlined in the paper, resulting in powerful models like FreeWilly. Stability AI took a similar approach, fine-tuning a LLaMA-2 foundation model using the Orca methodology.
There are two versions - FreeWilly 1 and 2. FreeWilly 1 fine-tunes the 65B LLaMA model, while FreeWilly 2 leverages the full 70B LLaMA-2 parameters. Tests show FreeWilly 2 comparing favourably to ChatGPT on certain benchmarks. Of course, real-world performance remains to be seen.
Understanding Free Willy 1 and 2: Models with a Twist
Stability AI didn't just release one model but two. The first model, Free Willy 1, is essentially a fine-tuned version of the original Llama 65 billion parameter foundational model, built upon a synthetically generated dataset. On the other hand, Free Willy 2 is a fine-tuned version of the Llama 2 70 billion parameter foundational model.
Interestingly, these models are putting up a tough fight against the likes of Chat GPT on some tasks. However, take these results with a grain of salt, as real-world application performance may differ from benchmarks. That's always something to remember when you're navigating through the world of AI.
Data Generation and Training Process
The key to Free Willy 2's success lies in its approach to fine-tuning. Stability AI employed a 70 billion parameter Llama 2 model and curated a data set using the Orca approach suggested in the original Orca paper. This combination has proven to be a winning formula, leading to a model that excels in various areas, including intricate reasoning and linguistic comprehension
The training data totals 600,000 examples - around 10% of Orca’s dataset size. But the examples consist of high-quality instructions sourced from existing datasets. This illustrates the importance of curating your training data, not just having a massive quantity.
Despite using just 10% of Orca’s data, FreeWilly still delivers exceptional results. This highlights how smaller, carefully-curated datasets can train performant models with a lower carbon footprint.
Evaluating FreeWilly's Performance
According to Stability AI, FreeWilly excels at reasoning, understanding linguistics, and tackling complex questions in law, math, and other domains.
Some key benchmark comparisons:
- On the Hello Swag leaderboard, FreeWilly 2 surpasses ChatGPT. On others like MC-MMLU, it lags slightly behind.
- For AGI Eval, FreeWilly 2 achieves impressive parity with ChatGPT, outperforming it on 6/8 datasets.
- Surprisingly, FreeWilly 1 and 2 have comparable results, despite the 5B parameter difference. This further emphasizes the impact of training data.
While benchmarks don’t always reflect real-world use, these results are incredibly promising for FreeWilly and the open-source community.
The Power of Carefully Curated Data Sets
Quality Over Quantity: The Data Set Creation Process 📊
The Free Willy models were developed using a data set that involved meticulous attention to quality over quantity. While the data set size was relatively smaller, around 10 percent of the original proposed in the Orca paper, the examples were of high quality. This highlights the importance of choosing the right data set for training or fine-tuning language models, as it directly impacts the model's overall performance.
A Tale of Two Free Willy Models: Comparing Free Willy 1 and Free Willy 2 🔄
Interestingly, the comparison between Free Willy 1 and Free Willy 2 shows that data set quality is more influential than the model's size. Despite Free Willy 2 being based on the 70 billion parameter Llama 2 model and Free Willy 1 on the 65 billion parameter foundational model, their performance is comparable. This underscores the significance of data set curation and its impact on the language model's effectiveness.
Open-Source but Not for Commercial Use 💼
Despite their open-source nature, the Free Willy models come with certain restrictions. They cannot be used for commercial purposes, likely due to the specific data creation process followed by Stability AI. In contrast, the base Llama 2 model, with 70 billion parameters, is open-source and free to use for commercial applications.
The Future of Free Willy (Orca) Models
Hurdles to Overcome: Long Wait Times and Model Accessibility ⏳
While the excitement and interest in Free Willy 2 are immense, there are some challenges to overcome. Currently, the wait times to access these models can be quite lengthy, hindering their immediate applicability. Efforts are underway to create quantized versions of the model to improve accessibility, but further development is needed.
A Thriving Community: Open-Source Advancements and the Need for Smaller Models 🤝
The Free Willy models have sparked a wave of innovation in the open-source community. The dedication and passion of contributors are impressive, but it is becoming increasingly challenging to keep up with the rapid advancements. The call for Stability AI to release smaller versions of the model will make t
The Future of Open Source LLMs
FreeWilly represents remarkable progress in open-source natural language research. Although wait times to access the models are currently impractical, smaller quantized versions could enable broader experimentation.
As the open-source ecosystem continues innovating at this rapid pace, the future looks bright. FreeWilly is the latest example of open source matching or exceeding proprietary models with minimal resources. We can expect even more exciting developments as research forges ahead.