How to tweak your model in Ollama or LMStudio or anywhere else

Matt Williams
22 Aug 202411:42

Summary

TLDRThis video script delves into the intricacies of Large Language Models (LLMs), focusing on the parameters that influence their output. It explains concepts like temperature, context size (num_ctx), and the importance of setting seed for consistent results. The script also covers advanced parameters for controlling text generation, such as stop words, repeat penalty, and top k/p. Additionally, it touches on the less common but equally important mirostat sampling parameters, which can affect the diversity and coherence of the model's output. The aim is to guide users on how to fine-tune these parameters for better control over LLMs like Ollama.

Takeaways

  • 🔥 Temperature affects the randomness of a model's output; lower temperatures make the most probable options more probable, while higher temperatures increase the chances of less probable options.
  • 📚 The 'num_ctx' parameter sets the context size for the model, influencing how much information it can remember and process during a conversation.
  • 💾 Ollama models start with a default context size of 2K tokens due to memory constraints, which can be adjusted to support larger contexts like 128K tokens.
  • 🚫 'Stop words' and 'phrases' can be used to prevent the model from repeating certain words or symbols, helping to control the output's coherence.
  • 🔄 'Repeat penalty' and 'repeat last n' parameters help manage the repetition of tokens by adjusting their probabilities based on recent usage.
  • 🔑 'Top k' limits the number of tokens considered for the next prediction, while 'top p' focuses on tokens that sum up to a certain probability threshold.
  • 📈 'Min p' sets a minimum logit value for tokens to be considered, ensuring only tokens with a certain probability are included in the prediction.
  • 📊 Tail free sampling (tfs_z) cuts off the tail of probabilities, influencing the diversity of the model's output by adjusting the range of considered tokens.
  • 🌱 The 'seed' parameter ensures consistent output by making the random number generator predictable, which is useful for testing scenarios.
  • 🔍 Mirostat parameters like 'tau' and 'eta' offer an alternative method for generating the list of next possible tokens, focusing on perplexity and surprise.
  • ✂️ 'Num_predict' determines the maximum number of tokens to predict, with -1 allowing continuous generation until completion and -2 filling the context.

Q & A

  • What are the common parameters used when working with Large Language Models (LLMs)?

    -Common parameters include temperature, num_ctx, stop words and phrases, repeat penalty, repeat last n, top k, top p, min p, tail free sampling (tfs_z), seed, and mirostat parameters such as mirostat tau and eta.

  • How does the temperature parameter affect the model's output?

    -Temperature scales the logits before they become probabilities. A lower temperature makes the most probable option even more probable, while a temperature greater than 1 reduces differences between logits, leading to a more creative output.

  • What is the purpose of the num_ctx parameter?

    -Num_ctx sets the context size for the model, determining how many tokens are in its context. A larger context size can remember more information but requires more memory.

  • Why might Ollama models start with a default context size of 2k tokens?

    -Ollama models start with a default context size of 2k tokens because supporting more tokens requires more memory, and many users have GPUs with limited memory, such as 8GB.

  • How can you increase the context size of an Ollama model?

    -You can increase the context size by creating a new modelfile with the desired num_ctx value and then running 'ollama create' with the new modelfile.

  • What is the role of stop words and phrases in controlling model output?

    -Stop words and phrases tell the model to stop outputting text when it encounters a specific word or symbol, preventing repetition in the generated text.

  • How does the repeat penalty parameter work to prevent repetition?

    -Repeat penalty adjusts the probability of a token if it has been used recently. If the logit is negative, it multiplies by the penalty, and if positive, it divides by the penalty, usually reducing the token's likelihood of being used again.

  • What is the purpose of the top k parameter?

    -Top k determines the length of the list of tokens to be generated for potential next tokens, limiting the list to only the most likely k tokens.

  • Can you explain the top p parameter and how it differs from top k?

    -Top p creates a list of tokens that, when their probabilities are added up, sum to top p. Unlike top k, which focuses on the most likely tokens, top p excludes tokens that, when summed, contribute less to the total probability.

  • What is the seed parameter used for in LLMs?

    -The seed parameter is used to make the random number generator predictable, ensuring that the model generates the same output every time when given the same input and seed.

  • How do mirostat tau and eta parameters influence the model's output?

    -Mirostat tau controls the balance between coherence and diversity, with a higher value resulting in more diverse outputs. Mirostat eta acts as a learning rate, with a higher rate causing the model to adapt faster to changes in the generated text.

  • What is the num_predict parameter and how does it affect text generation?

    -Num_predict is the maximum number of tokens to predict when generating text. Setting it to -1 allows the model to generate until completion, while -2 will fill the context, potentially cutting off at that point.

Outlines

00:00

🔍 Introduction to LLM Parameters

This paragraph introduces various parameters used in Large Language Models (LLMs) such as temperature, seed, num_ctx, and more, explaining their significance and how they influence the model's output. It emphasizes the importance of understanding these parameters for effective utilization of LLMs like Ollama. The temperature parameter is highlighted for its role in adjusting the model's creativity by scaling logits into probabilities, while num_ctx sets the context size, which is crucial for the model's memory capacity and performance. The paragraph also touches on the default context size in Ollama and how to modify it for models like llama 3.1.

05:05

🔧 Controlling Text Generation with Parameters

The second paragraph delves into the parameters that control text generation in LLMs, focusing on stop words, repeat penalty, repeat last n, top k, top p, min p, and tail free sampling. It explains how these parameters can be used to manage the model's output, prevent repetition, and influence the selection of tokens based on their probabilities. The discussion includes the impact of the repeat penalty on token probabilities and the role of repeat last n in defining the window for detecting repetitions. Additionally, it covers the function of top k and top p in narrowing down the list of potential next tokens and the use of min p and tail free sampling to further refine the token selection process.

10:07

🌡️ Advanced Parameters for Consistency and Control

This paragraph discusses advanced parameters such as seed and mirostat settings, which are used to control the randomness and consistency of LLM outputs. The seed parameter is crucial for generating consistent results, especially in testing scenarios, by making the random number generator predictable. The mirostat parameters, including tau and eta, are introduced for their role in balancing coherence and diversity in the model's text generation. The paragraph also explains the concepts of perplexity and surprise in the context of LLMs and how they relate to the model's output. Finally, it mentions the num_predict parameter, which determines the maximum number of tokens to predict during text generation.

🛠️ Configuring Parameters for Model Customization

The final paragraph provides guidance on configuring LLM parameters, both in the modelfile and through the default user interface. It clarifies that while some parameters can be adjusted mid-conversation using the command line, others like num_ctx cannot. The paragraph also notes that certain parameters are not yet documented due to their infrequent use. The video script concludes by inviting viewers to share their experiences with these parameters and thanking them for watching.

Mindmap

Keywords

💡Temperature

In the context of the video, 'temperature' refers to a parameter in large language models (LLMs) that influences the randomness of the model's output. A lower temperature makes the model's predictions more certain, favoring the most probable tokens, while a higher temperature increases randomness, allowing less probable tokens to have a chance of being selected. This is crucial for controlling the creativity and diversity of the model's responses, as exemplified when discussing how a higher temperature can make the model 'more creative in the way it answers a question.'

💡Num_ctx

'Num_ctx' stands for the context size parameter in LLMs, which determines how much of the conversation history the model takes into account when generating a response. The video explains that a larger context size allows the model to remember more information, but it also requires more memory. The default context size in Ollama is 2K tokens, which is a balance between memory constraints and the need for context. This parameter is important for ensuring the model's responses are relevant and coherent within a conversation.

💡Logits

Logits are the raw, unscaled scores that a model generates for each possible token before converting them into probabilities. The video describes how logits are initially generated between -10 and 10 and then scaled using a softmax function to become probabilities that sum up to 1. Understanding logits is key to grasping how models decide which tokens are more likely to be the next word in a sequence.

💡Softmax Function

The softmax function is a mathematical formula used to convert logits into probabilities. As described in the video, after a model generates logits, the softmax function is applied to these logits to produce a probability distribution where the sum of all probabilities equals 1. This is essential for the model to make decisions about the most likely next token in a sequence.

💡Stop Words and Phrases

Stop words and phrases are user-defined terms that a model can be instructed to avoid when generating text. The video mentions that these can be used to prevent the model from repeating certain words or symbols that might indicate it's entering a loop of repetition. This feature is useful for controlling the output quality and ensuring the text generated is varied and engaging.

💡Repeat Penalty

The 'repeat penalty' is a parameter that adjusts the probability of a token being selected if it has been used recently. As explained in the video, if the logit for a token is negative, it is multiplied by the penalty, and if positive, it is divided by the penalty. This mechanism helps in reducing repetition in the model's output, making the conversation flow more naturally.

💡Top K

Top K is a parameter that determines the number of most probable tokens to consider when predicting the next token in a sequence. The video clarifies that by default, only the top 40 tokens are considered, but this number can be adjusted. This parameter helps in controlling the diversity of the model's responses by limiting the pool of potential tokens.

💡Top P

Top P is a parameter that filters the list of potential tokens based on a cumulative probability threshold. The video explains that a top P of .95, for instance, would exclude all tokens whose combined probabilities sum to .05. This helps in focusing the model's predictions on a narrower set of more probable tokens, influencing the coherence and focus of the generated text.

💡Min P

Min P is a parameter that sets a minimum probability threshold for a token to be considered in the prediction process. The video uses the example of setting a minimum probability based on a percentage of the highest logit value. This parameter ensures that only tokens with a certain minimum probability are considered, which can help in maintaining the quality and relevance of the model's output.

💡Tail Free Sampling (TFS_Z)

Tail Free Sampling, or TFS_Z, is a parameter that truncates the lower tail of the probability distribution. The video suggests starting with values close to 1 and adjusting downwards to control how much of the tail is cut off. This parameter affects the exploration of less probable tokens and can influence the creativity and diversity of the model's responses.

💡Seed

The 'seed' parameter is used to initialize the random number generator in a predictable way. By setting a seed, the model's output can be made deterministic, ensuring the same input leads to the same output every time. The video mentions this as useful for testing scenarios where consistent model behavior is required.

Highlights

Introduction to various parameters used in Large Language Models (LLMs) and their significance.

Explanation of temperature and its role in scaling logits to probabilities, affecting model creativity.

Discussion on num_ctx, the context size setting, and its impact on memory requirements.

How to customize context size in Ollama models beyond the default 2K tokens.

Use of stop words and phrases to control repetitive text generation in models.

Repeat penalty and repeat last n parameters to manage token repetition in text generation.

Top k parameter to determine the length of the list of potential next tokens.

Top p parameter for controlling the sum of probabilities in the list of potential tokens.

Min p parameter as an alternative to top p for setting a minimum logit value for token consideration.

Tail free sampling (tfs_z) for cutting off the tail of probabilities, affecting token selection.

The importance of seed in ensuring consistent model responses for testing or other purposes.

Introduction to mirostat parameters and their use in generating a list of next possible tokens.

Mirostats tau parameter for balancing coherence and diversity in generated text.

Mirostats eta parameter as the learning rate influencing model adaptation speed.

Num_predict parameter for setting the maximum number of tokens to predict in text generation.

Configuring parameters in the modelfile and some parameters' availability in the default UI.

Invitation for viewers to share their experiences with these parameters and to provide feedback.

Transcripts

play00:00

There are a lot of parameters available when  working with LLMs, like temperature and seed  

play00:06

and num_ctx and more, but do you know how to use  them all and what they mean? Well let's go through  

play00:11

them all now. And although I tend to focus on  Ollama, the contents of this video apply just as  

play00:16

much for Ollama as for any other tool, except for  the parts where I talk about the implementation.

play00:22

You can find the list of parameters in the  Ollama documentation. Just go to the model  

play00:27

file docs and then parameters. The first  three talk about mirostats sampling Which  

play00:32

is a strange place to start since it's  probably not the most common one that  

play00:36

you'll use. So instead let's start our list  with what are probably the most common ones.

play00:40

I think the first one in our list should  be temperature. The way models work is  

play00:43

that they guess the first word of the answer  and then try to figure out what is the most  

play00:47

likely next word of the answer. And they  just keep repeating that over and over  

play00:51

and over again.When coming up with that  most likely next word, or actually token,  

play00:56

it creates a list of words or tokens that could  potentially go in that next spot. And they all  

play01:02

have a probability assigned to them that shows  how probable is this option as the next token.

play01:08

But the model doesn't store probabilities. It  actually works with what are called logistic  

play01:13

units or logits. Logits are unscaled when  they are first generated, but they tend to be,  

play01:19

especially with llama CPP, which Ollama  uses, they tend to be between -10 and  

play01:24

10. These logits are converted with a softmax  function to a series of numbers between 0 and 1  

play01:30

and if you add up all the numbers they add  up to 1. So essentially they have become  

play01:34

probabilities. Temperature helps scale the  logits before they become probabilities. Lower  

play01:39

temperatures will spread the logits out,  making the smaller numbers smaller still,  

play01:43

and the larger numbers even larger. This means  what was most probable before has become even  

play01:49

more probable now. But a temperature greater  than 1 reduces the differences between logits,  

play01:54

resulting in probabilities being closer  together. So tokens that had a lower probability  

play02:00

before now have a higher chance of being  chosen than they did before. The result  

play02:04

of this is that it feels like the model becomes  more creative in the way it answers a question.

play02:10

Next in our list is num_ctx. This sets  the context size for the model. When you  

play02:15

look around at different models or when you  see the announcement for a brand new model,  

play02:19

you might get excited when it says a context  size of 128k parameters, or 8k parameters,  

play02:25

or a million parameters. But then you start  to have a long conversation with, let's say,  

play02:29

llama 3.1 and wonder why it's forgetting  information that was actually pretty recent.

play02:33

In Ollama, every model starts out with a 2K  context size. That means 2,048 tokens are  

play02:40

in its context and anything older may get  forgotten. The reason Ollama does this is  

play02:44

that supporting more tokens in that context  requires more memory. And 128k parameters is  

play02:50

going to require a lot of memory. From what  we've seen in the Ollama Discord, a lot of  

play02:54

people are starting out with GPUs with only 8GB  of memory. And some are even smaller than that.  

play03:00

And so that means it just can't possibly support  the 128k tokens or even 8k tokens. For that  

play03:06

reason, all Ollama models start out with a  default context size of 2k or 2048 tokens.

play03:12

So if you are excited to use llama 3.1 for  its 128k context size, grab llama3 .1 and  

play03:18

then create a new modelfile that looks like  this, with a from line pointing to llama 3.1  

play03:23

and a single parameter of num_ctx with a value of  131,072. Then run ollama create mybiggerllama3.1,  

play03:34

or whatever you want to call it, then  -f and point to the modelfile. This  

play03:37

will create a brand new model that has a  max context size of 128k tokens. Now you  

play03:42

can run ollama run mybiggerllama3.1  and you will be in your new model.

play03:47

But let's say, you are playing around  with a new model and can't find the max  

play03:51

supported size for the context. The easiest  way to figure this out is to run ollama show  

play03:57

llama3.1. Near the top we see the context  length which is the max supported length  

play04:01

of the model. The fact that we don't see  a parameter of num_ctx defined tells us  

play04:07

that it is set to use the Ollama default of  2048 tokens. That can be a bit confusing.

play04:13

Now let's say you want to use orca2 with  a context length of 10000 tokens. If you  

play04:18

tell it to summarize something  that is much longer than that,  

play04:21

you will probably not get anything useful,  because the model doesn't know how to handle it.

play04:25

What do you think of this video so  far. Click the like if you find it  

play04:28

interesting and be sure to subscribe to see  more videos like this in the future. I post  

play04:32

another video in my free ollama course  every Tuesday and a more in depth video  

play04:37

like this every Thursday. Subscribing  means you won't miss any future videos.

play04:42

Now we move on to the next thing in the  list, stop words and phrases. Sometimes  

play04:47

you will ask a model to generate something  and you see it starts to repeat itself,  

play04:51

often using one strange word or symbol at the  beginning of each repeat. So you can tell the  

play04:55

model to stop outputting text when it sees  that symbol. All the rest of the parameters  

play04:59

only accept a single value, but stop  allows for multiple stop words to be used.

play05:04

There are two other parameters that  deal with repeats: repeat penalty,  

play05:08

and repeat last n. We talked before about how a  list of potential tokens is generated along with  

play05:13

their probabilities that they are the most likely  next token. Penalty will adjust that probability  

play05:19

if the token or word or phrase had been used  recently. If the logit for the token is negative,  

play05:24

then the logit will be multiplied by the penalty.  if the logit is positive, then it will be divided  

play05:29

by the penalty. The penalty is usually greater  than 1 resulting in that token being used less.  

play05:34

But its also possible to set it below one,  meaning that token will be used more often.

play05:40

At this point you are probably wondering how  large the window is for finding repeats. Well  

play05:44

that is what repeat last n is for. This defines  the window. The default is 64, meaning it looks  

play05:49

at the last 64 tokens. But you can set it to be  a larger or smaller window. If you set it to 0,  

play05:56

it disables the window, and if you set it to -1,  then the window is the full context of the model.

play06:01

Top k determines the length of the list  of tokens to be generated for potential  

play06:05

next token. This defaults to 40, but can be  anything you like. This is pretty simple,  

play06:10

saying that only the most  likely 40 will be in the list.

play06:13

top p is a little more complicated. When you  add up all the probabilities in the list,  

play06:19

you should end up with 1. But when using  top p, it will create the list of all the  

play06:23

tokens that will add up to top p. so a  top p of .95 will exclude all the tokens  

play06:29

that when you add up the  probabilities, will sum to .05.

play06:35

min p is an alternative to using top p. This looks  at the source logits. it takes the value of the  

play06:41

largest logit in the list, then figures out the  value of min p percent of that large value. All  

play06:46

next tokens must have a logit value greater that  that minimum value. So if the largest logit is 8  

play06:53

and min p is . 25, then all logits must be greater  than 2 to be considered, since 25% of 8 is 2.

play07:01

ok, tail free sampling, or tfs_z. if you create  a chart of all the probabilities you will see  

play07:07

it slowly approaching zero. Tail free sampling  cuts off that tail at some point. As the number  

play07:13

approaches 0, more of the tail will get cut off.  If using this, you want to start really close to  

play07:18

1 and gradually come down. A value of 1 means  none of the tail is cut off. 0.99 to 0.95 is  

play07:25

a good starting range. The docs for this one I  think are wrong and I have an issue to fix it.

play07:32

Next one on our list is seed. One of the  strengths of large language models is that  

play07:36

they generate text much like the way that  we humans generate text. They have a word,  

play07:41

and then they think of the next word, and they  think of the next word, and they think of the  

play07:44

next word. And they spit out words that make the  most sense for being the next word.And this means  

play07:49

that at the beginning of the sentence, they don't  really know how the sentence is going to end. And  

play07:53

because they're dealing with probabilities,  there's a decent chance that when you ask the  

play07:58

same question twice, the answer isn't always  going to be the same. This really is one of  

play08:03

the benefits of working with large language  models. But sometimes it's not a benefit.  

play08:07

Sometimes you want the model to answer the same  way every single time. Maybe in those cases,  

play08:12

large language models aren't really the right  tool to use.But if you want to use an LLM and  

play08:17

you want the answer to be the same every single  time, then you need to set the seed for the  

play08:22

random number generator. So large language models  use random number generators to help figure out  

play08:27

that next token. And setting the seed means that  it's going to make the random number generator  

play08:32

predictable. A sequence of numbers generated one  time is going to be the same sequence of numbers  

play08:36

every other time if that seed is consistent.  One situation where it makes sense to use this  

play08:41

is with testing and you want to ensure the  model answers the same in your test cases.

play08:46

Now let's cover the mirostat parameters.  These are used by a different method of  

play08:50

coming up with the list of next  possible tokens. With mirostat,  

play08:54

there are essentially three modes to choose from.  If you don't set mirostat, then top p, top k,  

play08:59

tfs, and the rest of the parameters are used.  But then you can set mirostat to 1 or 2 to use  

play09:05

the mirostat tau and eta parameters to result  in more or less proportional probabilities.

play09:12

When working with mirostat, you often see the  terms perplexity and surprise come up. They  

play09:18

are related but different. Perplexity  is a statistical measure derived from  

play09:24

the probability of the next word appearing  in a sequence. A lower perpexity indicates  

play09:29

the model assigns higher probabilities to  more correct words in a sequence. Higher  

play09:34

perplexity indicates that the model assigns  higher probabilities to less correct words.

play09:40

Surprise is a human emotion used as  a metaphor to help understand this.  

play09:45

A model with a higher perplexity is said to  be more surprised by the output of the model,  

play09:50

which I think is a really strange way of  talking about it. It feels like the folks  

play09:55

who came up with that don't interact with other  humans much...which may or may not be true.

play10:01

But mirostat tau controls the balance between  coherence and diversity of the output. Tau  

play10:07

sets the desired level of perplexity in the  generated text. A higher value of tau will  

play10:12

result in more diverse outputs. It usually  has a range of 3 to 5 and defaults to 5.

play10:18

Mirostat eta is the learning rate used. A  higher rate means that the model will react  

play10:23

and adapt faster than a slower rate.  This means that if the model chooses  

play10:28

a less probable token and eta is high, the  model will continue to stick with the newer  

play10:33

style of the text. A low eta is more stable and  less likely to overreact to temporary changes.

play10:39

The final parameter we will cover is num_predict.  This is the maximum number of tokens to predict  

play10:45

when generating text. If you set this to  -1, it will keep generating until it's done,  

play10:50

and -2 will fill the context. This doesn't  mean it will always consume the entire amount,  

play10:55

but rather that’s the point  where it will get cut off.

play10:58

Earlier I showed how you can configure  these parameters in the modelfile. Some  

play11:02

of the parameters can also be  configured in the default UI,  

play11:05

by entering /set parameter and then  the parameter name and value. But  

play11:10

not all parameters can be set that way. For  instance, setting num_ctx that way won't work.

play11:16

Basically the ones that make sense to be able to  

play11:17

change mid conversation can  be set at the command line.

play11:20

And that's pretty much all the parameters that  folks tend to use. There are a few others that  

play11:25

aren't documented yet but they are so rarely  used that it doesn't make sense to cover them.

play11:32

What do you think? Do you use any of  these parameters when you use large  

play11:36

language models? let me know in the comments  below. Thanks so much for watching. Goodbye.

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Related Tags
LLM ParametersAI TextTemperature SettingContext SizeLogits ConversionModel ConfigurationText CreativityRepeat ControlSampling MethodsSeed StabilityPerplexity Measure