LLM Scaling Laws

AI/Data Science Digest

Mar 22, 2024

The performance of an LLM is a function of

N — the number of parameters in the network (weights and biases)
D — the amount of text we train on

Model performance (Loss) vs. Model parameters (size) As we increase the number of parameters, the model performance increases (Source: Compute-Optimal Large Language Models)

This is why we see increasingly bigger LLMs are released with time. We get more “intelligence” for free with scaling!

Size of the LLMs over time (Source: By Author)

This diagram provides another perspective on the model size growth over time.

This trend provides an unfair advantage to large corporations like OpenAI, Google, Microsoft, Meta, etc. as they can spend billions of dollars to get more computing infrastructures to build very large LLMs.

Can the smaller models like Mixtral (which uses a mixture of experts) break this trend?

I will blog about this in a follow-up post. Stay tuned!

LLM Scaling Laws

Written by AI/Data Science Digest

No responses yet