LLM Scaling Laws
Mar 22, 2024
The performance of an LLM is a function of
- N — the number of parameters in the network (weights and biases)
- D — the amount of text we train on
This is why we see increasingly bigger LLMs are released with time. We get more “intelligence” for free with scaling!
This trend provides an unfair advantage to large corporations like OpenAI, Google, Microsoft, Meta, etc. as they can spend billions of dollars to get more computing infrastructures to build very large LLMs.
Can the smaller models like Mixtral (which uses a mixture of experts) break this trend?
I will blog about this in a follow-up post. Stay tuned!