How to tweak your model in Ollama or LMStudio or anywhere else
Summary
TLDRThis video script delves into the intricacies of Large Language Models (LLMs), focusing on the parameters that influence their output. It explains concepts like temperature, context size (num_ctx), and the importance of setting seed for consistent results. The script also covers advanced parameters for controlling text generation, such as stop words, repeat penalty, and top k/p. Additionally, it touches on the less common but equally important mirostat sampling parameters, which can affect the diversity and coherence of the model's output. The aim is to guide users on how to fine-tune these parameters for better control over LLMs like Ollama.
Takeaways
- đĽ Temperature affects the randomness of a model's output; lower temperatures make the most probable options more probable, while higher temperatures increase the chances of less probable options.
- đ The 'num_ctx' parameter sets the context size for the model, influencing how much information it can remember and process during a conversation.
- đž Ollama models start with a default context size of 2K tokens due to memory constraints, which can be adjusted to support larger contexts like 128K tokens.
- đŤ 'Stop words' and 'phrases' can be used to prevent the model from repeating certain words or symbols, helping to control the output's coherence.
- đ 'Repeat penalty' and 'repeat last n' parameters help manage the repetition of tokens by adjusting their probabilities based on recent usage.
- đ 'Top k' limits the number of tokens considered for the next prediction, while 'top p' focuses on tokens that sum up to a certain probability threshold.
- đ 'Min p' sets a minimum logit value for tokens to be considered, ensuring only tokens with a certain probability are included in the prediction.
- đ Tail free sampling (tfs_z) cuts off the tail of probabilities, influencing the diversity of the model's output by adjusting the range of considered tokens.
- đą The 'seed' parameter ensures consistent output by making the random number generator predictable, which is useful for testing scenarios.
- đ Mirostat parameters like 'tau' and 'eta' offer an alternative method for generating the list of next possible tokens, focusing on perplexity and surprise.
- âď¸ 'Num_predict' determines the maximum number of tokens to predict, with -1 allowing continuous generation until completion and -2 filling the context.
Q & A
What are the common parameters used when working with Large Language Models (LLMs)?
-Common parameters include temperature, num_ctx, stop words and phrases, repeat penalty, repeat last n, top k, top p, min p, tail free sampling (tfs_z), seed, and mirostat parameters such as mirostat tau and eta.
How does the temperature parameter affect the model's output?
-Temperature scales the logits before they become probabilities. A lower temperature makes the most probable option even more probable, while a temperature greater than 1 reduces differences between logits, leading to a more creative output.
What is the purpose of the num_ctx parameter?
-Num_ctx sets the context size for the model, determining how many tokens are in its context. A larger context size can remember more information but requires more memory.
Why might Ollama models start with a default context size of 2k tokens?
-Ollama models start with a default context size of 2k tokens because supporting more tokens requires more memory, and many users have GPUs with limited memory, such as 8GB.
How can you increase the context size of an Ollama model?
-You can increase the context size by creating a new modelfile with the desired num_ctx value and then running 'ollama create' with the new modelfile.
What is the role of stop words and phrases in controlling model output?
-Stop words and phrases tell the model to stop outputting text when it encounters a specific word or symbol, preventing repetition in the generated text.
How does the repeat penalty parameter work to prevent repetition?
-Repeat penalty adjusts the probability of a token if it has been used recently. If the logit is negative, it multiplies by the penalty, and if positive, it divides by the penalty, usually reducing the token's likelihood of being used again.
What is the purpose of the top k parameter?
-Top k determines the length of the list of tokens to be generated for potential next tokens, limiting the list to only the most likely k tokens.
Can you explain the top p parameter and how it differs from top k?
-Top p creates a list of tokens that, when their probabilities are added up, sum to top p. Unlike top k, which focuses on the most likely tokens, top p excludes tokens that, when summed, contribute less to the total probability.
What is the seed parameter used for in LLMs?
-The seed parameter is used to make the random number generator predictable, ensuring that the model generates the same output every time when given the same input and seed.
How do mirostat tau and eta parameters influence the model's output?
-Mirostat tau controls the balance between coherence and diversity, with a higher value resulting in more diverse outputs. Mirostat eta acts as a learning rate, with a higher rate causing the model to adapt faster to changes in the generated text.
What is the num_predict parameter and how does it affect text generation?
-Num_predict is the maximum number of tokens to predict when generating text. Setting it to -1 allows the model to generate until completion, while -2 will fill the context, potentially cutting off at that point.
Outlines
đ Introduction to LLM Parameters
This paragraph introduces various parameters used in Large Language Models (LLMs) such as temperature, seed, num_ctx, and more, explaining their significance and how they influence the model's output. It emphasizes the importance of understanding these parameters for effective utilization of LLMs like Ollama. The temperature parameter is highlighted for its role in adjusting the model's creativity by scaling logits into probabilities, while num_ctx sets the context size, which is crucial for the model's memory capacity and performance. The paragraph also touches on the default context size in Ollama and how to modify it for models like llama 3.1.
đ§ Controlling Text Generation with Parameters
The second paragraph delves into the parameters that control text generation in LLMs, focusing on stop words, repeat penalty, repeat last n, top k, top p, min p, and tail free sampling. It explains how these parameters can be used to manage the model's output, prevent repetition, and influence the selection of tokens based on their probabilities. The discussion includes the impact of the repeat penalty on token probabilities and the role of repeat last n in defining the window for detecting repetitions. Additionally, it covers the function of top k and top p in narrowing down the list of potential next tokens and the use of min p and tail free sampling to further refine the token selection process.
đĄď¸ Advanced Parameters for Consistency and Control
This paragraph discusses advanced parameters such as seed and mirostat settings, which are used to control the randomness and consistency of LLM outputs. The seed parameter is crucial for generating consistent results, especially in testing scenarios, by making the random number generator predictable. The mirostat parameters, including tau and eta, are introduced for their role in balancing coherence and diversity in the model's text generation. The paragraph also explains the concepts of perplexity and surprise in the context of LLMs and how they relate to the model's output. Finally, it mentions the num_predict parameter, which determines the maximum number of tokens to predict during text generation.
đ ď¸ Configuring Parameters for Model Customization
The final paragraph provides guidance on configuring LLM parameters, both in the modelfile and through the default user interface. It clarifies that while some parameters can be adjusted mid-conversation using the command line, others like num_ctx cannot. The paragraph also notes that certain parameters are not yet documented due to their infrequent use. The video script concludes by inviting viewers to share their experiences with these parameters and thanking them for watching.
Mindmap
Keywords
đĄTemperature
đĄNum_ctx
đĄLogits
đĄSoftmax Function
đĄStop Words and Phrases
đĄRepeat Penalty
đĄTop K
đĄTop P
đĄMin P
đĄTail Free Sampling (TFS_Z)
đĄSeed
Highlights
Introduction to various parameters used in Large Language Models (LLMs) and their significance.
Explanation of temperature and its role in scaling logits to probabilities, affecting model creativity.
Discussion on num_ctx, the context size setting, and its impact on memory requirements.
How to customize context size in Ollama models beyond the default 2K tokens.
Use of stop words and phrases to control repetitive text generation in models.
Repeat penalty and repeat last n parameters to manage token repetition in text generation.
Top k parameter to determine the length of the list of potential next tokens.
Top p parameter for controlling the sum of probabilities in the list of potential tokens.
Min p parameter as an alternative to top p for setting a minimum logit value for token consideration.
Tail free sampling (tfs_z) for cutting off the tail of probabilities, affecting token selection.
The importance of seed in ensuring consistent model responses for testing or other purposes.
Introduction to mirostat parameters and their use in generating a list of next possible tokens.
Mirostats tau parameter for balancing coherence and diversity in generated text.
Mirostats eta parameter as the learning rate influencing model adaptation speed.
Num_predict parameter for setting the maximum number of tokens to predict in text generation.
Configuring parameters in the modelfile and some parameters' availability in the default UI.
Invitation for viewers to share their experiences with these parameters and to provide feedback.
Transcripts
There are a lot of parameters available when working with LLMs, like temperature and seed Â
and num_ctx and more, but do you know how to use them all and what they mean? Well let's go through Â
them all now. And although I tend to focus on Ollama, the contents of this video apply just as Â
much for Ollama as for any other tool, except for the parts where I talk about the implementation.
You can find the list of parameters in the Ollama documentation. Just go to the model Â
file docs and then parameters. The first three talk about mirostats sampling Which Â
is a strange place to start since it's probably not the most common one that Â
you'll use. So instead let's start our list with what are probably the most common ones.
I think the first one in our list should be temperature. The way models work is Â
that they guess the first word of the answer and then try to figure out what is the most Â
likely next word of the answer. And they just keep repeating that over and over Â
and over again.When coming up with that most likely next word, or actually token, Â
it creates a list of words or tokens that could potentially go in that next spot. And they all Â
have a probability assigned to them that shows how probable is this option as the next token.
But the model doesn't store probabilities. It actually works with what are called logistic Â
units or logits. Logits are unscaled when they are first generated, but they tend to be, Â
especially with llama CPP, which Ollama uses, they tend to be between -10 and Â
10. These logits are converted with a softmax function to a series of numbers between 0 and 1 Â
and if you add up all the numbers they add up to 1. So essentially they have become Â
probabilities. Temperature helps scale the logits before they become probabilities. Lower Â
temperatures will spread the logits out, making the smaller numbers smaller still, Â
and the larger numbers even larger. This means what was most probable before has become even Â
more probable now. But a temperature greater than 1 reduces the differences between logits, Â
resulting in probabilities being closer together. So tokens that had a lower probability Â
before now have a higher chance of being chosen than they did before. The result Â
of this is that it feels like the model becomes more creative in the way it answers a question.
Next in our list is num_ctx. This sets the context size for the model. When you Â
look around at different models or when you see the announcement for a brand new model, Â
you might get excited when it says a context size of 128k parameters, or 8k parameters, Â
or a million parameters. But then you start to have a long conversation with, let's say, Â
llama 3.1 and wonder why it's forgetting information that was actually pretty recent.
In Ollama, every model starts out with a 2K context size. That means 2,048 tokens are Â
in its context and anything older may get forgotten. The reason Ollama does this is Â
that supporting more tokens in that context requires more memory. And 128k parameters is Â
going to require a lot of memory. From what we've seen in the Ollama Discord, a lot of Â
people are starting out with GPUs with only 8GB of memory. And some are even smaller than that. Â
And so that means it just can't possibly support the 128k tokens or even 8k tokens. For that Â
reason, all Ollama models start out with a default context size of 2k or 2048 tokens.
So if you are excited to use llama 3.1 for its 128k context size, grab llama3 .1 and Â
then create a new modelfile that looks like this, with a from line pointing to llama 3.1 Â
and a single parameter of num_ctx with a value of 131,072. Then run ollama create mybiggerllama3.1, Â
or whatever you want to call it, then -f and point to the modelfile. This Â
will create a brand new model that has a max context size of 128k tokens. Now you Â
can run ollama run mybiggerllama3.1Â and you will be in your new model.
But let's say, you are playing around with a new model and can't find the max Â
supported size for the context. The easiest way to figure this out is to run ollama show Â
llama3.1. Near the top we see the context length which is the max supported length Â
of the model. The fact that we don't see a parameter of num_ctx defined tells us Â
that it is set to use the Ollama default of 2048 tokens. That can be a bit confusing.
Now let's say you want to use orca2 with a context length of 10000 tokens. If you Â
tell it to summarize something that is much longer than that, Â
you will probably not get anything useful, because the model doesn't know how to handle it.
What do you think of this video so far. Click the like if you find it Â
interesting and be sure to subscribe to see more videos like this in the future. I post Â
another video in my free ollama course every Tuesday and a more in depth video Â
like this every Thursday. Subscribing means you won't miss any future videos.
Now we move on to the next thing in the list, stop words and phrases. Sometimes Â
you will ask a model to generate something and you see it starts to repeat itself, Â
often using one strange word or symbol at the beginning of each repeat. So you can tell the Â
model to stop outputting text when it sees that symbol. All the rest of the parameters Â
only accept a single value, but stop allows for multiple stop words to be used.
There are two other parameters that deal with repeats: repeat penalty, Â
and repeat last n. We talked before about how a list of potential tokens is generated along with Â
their probabilities that they are the most likely next token. Penalty will adjust that probability Â
if the token or word or phrase had been used recently. If the logit for the token is negative, Â
then the logit will be multiplied by the penalty. if the logit is positive, then it will be divided Â
by the penalty. The penalty is usually greater than 1 resulting in that token being used less. Â
But its also possible to set it below one, meaning that token will be used more often.
At this point you are probably wondering how large the window is for finding repeats. Well Â
that is what repeat last n is for. This defines the window. The default is 64, meaning it looks Â
at the last 64 tokens. But you can set it to be a larger or smaller window. If you set it to 0, Â
it disables the window, and if you set it to -1, then the window is the full context of the model.
Top k determines the length of the list of tokens to be generated for potential Â
next token. This defaults to 40, but can be anything you like. This is pretty simple, Â
saying that only the most likely 40 will be in the list.
top p is a little more complicated. When you add up all the probabilities in the list, Â
you should end up with 1. But when using top p, it will create the list of all the Â
tokens that will add up to top p. so a top p of .95 will exclude all the tokens Â
that when you add up the probabilities, will sum to .05.
min p is an alternative to using top p. This looks at the source logits. it takes the value of the Â
largest logit in the list, then figures out the value of min p percent of that large value. All Â
next tokens must have a logit value greater that that minimum value. So if the largest logit is 8 Â
and min p is . 25, then all logits must be greater than 2 to be considered, since 25% of 8 is 2.
ok, tail free sampling, or tfs_z. if you create a chart of all the probabilities you will see Â
it slowly approaching zero. Tail free sampling cuts off that tail at some point. As the number Â
approaches 0, more of the tail will get cut off. If using this, you want to start really close to Â
1 and gradually come down. A value of 1 means none of the tail is cut off. 0.99 to 0.95 is Â
a good starting range. The docs for this one IÂ think are wrong and I have an issue to fix it.
Next one on our list is seed. One of the strengths of large language models is that Â
they generate text much like the way that we humans generate text. They have a word, Â
and then they think of the next word, and they think of the next word, and they think of the Â
next word. And they spit out words that make the most sense for being the next word.And this means Â
that at the beginning of the sentence, they don't really know how the sentence is going to end. And Â
because they're dealing with probabilities, there's a decent chance that when you ask the Â
same question twice, the answer isn't always going to be the same. This really is one of Â
the benefits of working with large language models. But sometimes it's not a benefit. Â
Sometimes you want the model to answer the same way every single time. Maybe in those cases, Â
large language models aren't really the right tool to use.But if you want to use an LLM and Â
you want the answer to be the same every single time, then you need to set the seed for the Â
random number generator. So large language models use random number generators to help figure out Â
that next token. And setting the seed means that it's going to make the random number generator Â
predictable. A sequence of numbers generated one time is going to be the same sequence of numbers Â
every other time if that seed is consistent. One situation where it makes sense to use this Â
is with testing and you want to ensure the model answers the same in your test cases.
Now let's cover the mirostat parameters. These are used by a different method of Â
coming up with the list of next possible tokens. With mirostat, Â
there are essentially three modes to choose from. If you don't set mirostat, then top p, top k, Â
tfs, and the rest of the parameters are used. But then you can set mirostat to 1 or 2 to use Â
the mirostat tau and eta parameters to result in more or less proportional probabilities.
When working with mirostat, you often see the terms perplexity and surprise come up. They Â
are related but different. Perplexity is a statistical measure derived from Â
the probability of the next word appearing in a sequence. A lower perpexity indicates Â
the model assigns higher probabilities to more correct words in a sequence. Higher Â
perplexity indicates that the model assigns higher probabilities to less correct words.
Surprise is a human emotion used as a metaphor to help understand this. Â
A model with a higher perplexity is said to be more surprised by the output of the model, Â
which I think is a really strange way of talking about it. It feels like the folks Â
who came up with that don't interact with other humans much...which may or may not be true.
But mirostat tau controls the balance between coherence and diversity of the output. Tau Â
sets the desired level of perplexity in the generated text. A higher value of tau will Â
result in more diverse outputs. It usually has a range of 3 to 5 and defaults to 5.
Mirostat eta is the learning rate used. A higher rate means that the model will react Â
and adapt faster than a slower rate. This means that if the model chooses Â
a less probable token and eta is high, the model will continue to stick with the newer Â
style of the text. A low eta is more stable and less likely to overreact to temporary changes.
The final parameter we will cover is num_predict. This is the maximum number of tokens to predict Â
when generating text. If you set this to -1, it will keep generating until it's done, Â
and -2 will fill the context. This doesn't mean it will always consume the entire amount, Â
but rather thatâs the point where it will get cut off.
Earlier I showed how you can configure these parameters in the modelfile. Some Â
of the parameters can also be configured in the default UI, Â
by entering /set parameter and then the parameter name and value. But Â
not all parameters can be set that way. For instance, setting num_ctx that way won't work.
Basically the ones that make sense to be able to Â
change mid conversation can be set at the command line.
And that's pretty much all the parameters that folks tend to use. There are a few others that Â
aren't documented yet but they are so rarely used that it doesn't make sense to cover them.
What do you think? Do you use any of these parameters when you use large Â
language models? let me know in the comments below. Thanks so much for watching. Goodbye.
5.0 / 5 (0 votes)