They made Python faster with this compiler option

Hussein Nasser
7 May 202426:42

Summary

TLDRThe video script discusses the impact of compiler optimizations on Python's performance, particularly focusing on Fedora Linux's decision to use the '-O3' optimization level for compiling Python. This change results in a performance boost, with some cases showing up to a 4% increase in speed. The script delves into the intricacies of compiler optimization levels -O1, -O2, and -O3, explaining how they affect code execution, memory usage, and CPU cache efficiency. It also touches on the concept of function inlining and its role in enhancing performance at the cost of increased binary size. The discussion is aimed at providing viewers with a deeper understanding of how compiler optimizations can significantly influence the performance of software.

Takeaways

  • 🐍 Fedora Linux's decision to compile Python with the -O3 optimization option has made Python run significantly faster on the platform.
  • βš™οΈ The -O3 optimization level is known to enhance performance, with speed improvements ranging from 1.6% to 4% in various cases.
  • πŸ” The discussion opens up the topic of compiler optimizations, particularly focusing on function inlining, which is a key aspect of the -O3 optimization.
  • πŸ“š The script explains the basics of compiling, which is the process of converting high-level language code into machine-level instructions.
  • πŸ’Ύ The script touches on the importance of registers and memory in the compiling process, highlighting the trade-offs between using limited register resources and memory.
  • πŸ”§ Compiler optimization levels -O1, -O2, and -O3 are explained, with each level offering different levels of optimization and performance gains.
  • πŸ“ˆ The script provides a comparison of binary sizes and performance between Python compiled with -O2 and -O3, showing a larger binary size with -O3 but with improved performance.
  • πŸš€ Function inlining, a part of -O2 and -O3 optimizations, can significantly speed up code execution by reducing the overhead of function calls.
  • πŸ’₯ The -O3 optimization includes aggressive function inlining and Single Instruction, Multiple Data (SIMD) optimizations, which can further boost performance.
  • πŸ“Š Benchmark results indicate that the -O3 optimization generally provides a performance boost, with improvements shown in various tests and workloads.

Q & A

  • What is the significance of the -O3 optimization option in compiling Python on Fedora Linux?

    -The -O3 optimization option in GCC significantly improves the performance of Python on Fedora Linux by enabling more aggressive compiler optimizations. This can result in speed improvements ranging from 1.6% to 4% in various benchmarks and workloads.

  • Why did Fedora switch from using -O2 to -O3 optimization for Python?

    -Fedora switched to -O3 optimization for Python to align with Upstream Python's release builds, which are known to be faster due to this more aggressive optimization level.

  • What are the trade-offs when using the -O3 optimization level?

    -While -O3 can significantly improve performance, it also increases the size of the binary due to aggressive function inlining, which can lead to higher memory usage and potential performance deterioration on systems with limited memory.

  • How does function inlining as part of -O2 and -O3 optimization work?

    -Function inlining replaces function calls with the actual function code, reducing the overhead of function calls and improving cache utilization. -O2 performs inlining for small functions, while -O3 does more aggressive inlining, potentially inlining almost all functions.

  • What is the difference between the compiler optimization levels -O1, -O2, and -O3?

    -The optimization levels -O1, -O2, and -O3 in GCC represent different levels of compiler optimizations. -O1 enables basic optimizations, -O2 includes further optimizations like function inlining for small functions, and -O3 includes even more aggressive optimizations, often resulting in the largest performance gains but also larger binary sizes.

  • How does the compilation process translate high-level code into machine-level instructions?

    -The compilation process translates high-level code into machine-level instructions by going through several stages, including parsing the code, optimizing it, and then generating the assembly code that corresponds to the machine-level instructions the CPU can execute.

  • What is the role of registers in the compilation process?

    -Registers play a crucial role in the compilation process as they are fast storage locations within the CPU used for holding temporary values during computation. Compilers aim to use registers efficiently to minimize memory access, which can slow down the execution.

  • Why might aggressive function inlining in -O3 optimization not always result in performance improvements?

    -Aggressive function inlining in -O3 optimization might not always result in performance improvements because it can significantly increase the binary size, leading to more memory usage and potential cache misses, which can offset the benefits of reduced function call overhead.

  • What is the impact of the -O3 optimization on the binary size of Python?

    -The -O3 optimization increases the binary size of Python due to the inclusion of more aggressive function inlining, which can result in a larger executable size compared to the -O2 optimization level.

  • How do compiler optimizations like -O3 affect the development and deployment of software?

    -Compiler optimizations like -O3 can affect software development and deployment by potentially increasing the performance of the software, but also by increasing the size of the binaries, which can impact the distribution and memory usage of the software.

Outlines

00:00

🐍 Python Performance Boost on Fedora Linux

The script discusses the performance improvement of Python on Fedora Linux due to the implementation of the -O3 compiler optimization option. This option enhances the speed of Python, with improvements ranging from 1.6% to 4%. The speaker uses this news as a springboard to delve into compiler optimizations, specifically function inlining. The script mentions that Fedora 41 will come with Python compiled using the -O3 optimization, a change approved by the Fedora Engineering and Steering Committee. The speaker contrasts this with previous versions that used -O2, which is considered more stable. The script also touches on the speaker's personal preference for Red Hat and macOS over Fedora, highlighting the ease of use and tool inclusion in Linux distributions.

05:01

πŸ› οΈ Compiler Optimizations and Function Inlining

This section provides an in-depth look into compiler optimizations, focusing on the GCC options -O1, -O2, and -O3. The script explains that these options dictate how the compiler translates high-level code into machine-level instructions. The speaker clarifies the process of compiling, which involves translating code into machine language that the CPU can execute. The script also delves into the concept of registers and memory, highlighting registers as a scarce resource that the compiler must manage efficiently. The discussion then moves to the first level of optimization, -O1, which involves local register substitution to reduce unnecessary memory writes. The script also introduces the idea of function inlining, which is a more aggressive optimization technique used in higher optimization levels.

10:02

πŸ”§ Deep Dive into Compiler Optimization Techniques

The script continues the exploration of compiler optimizations, particularly focusing on the -O2 and -O3 levels. It explains that -O2 optimization includes function inlining but is limited to small functions to maintain stability. The speaker then discusses the -O3 level, which is more aggressive, performing function inlining on almost all functions, leading to a significant increase in binary size. This results in better cache hits and performance but at the cost of increased memory usage. The script also touches on the potential downsides of aggressive inlining, such as decreased performance on devices with limited memory due to increased swapping. Additionally, the script mentions the use of Single Instruction Multiple Data (SIMD) optimizations in -O3, which can further enhance performance if the CPU supports it.

15:03

πŸ“Š Benchmarks and Impact of Optimization Levels

This part of the script presents benchmark results comparing the performance of Python compiled with -O2 and -O3 optimization levels. The speaker shows that the -O3 level generally provides a modest performance improvement, with speed increases ranging from 1.04 to 1.09 times across various benchmarks. The script also compares the binary sizes of Python compiled with the two optimization levels, with -O3 resulting in a larger binary due to aggressive function inlining. The speaker concludes by emphasizing that these optimizations are beneficial for performance but may not be necessary for all users, especially those already using upstream Python versions that are compiled with the -O3 option.

20:05

πŸ“– Conclusion and Learning Resources

In the final paragraph, the speaker wraps up the discussion on compiler optimizations for the CPython interpreter and reflects on the learning experience. They express enthusiasm for understanding the intricacies of compiler optimization levels -O1, -O2, and -O3. The script also includes a call to action for the audience to check out the speaker's operating system course for further learning. The speaker provides a discount link for the course and encourages the audience to explore the content, highlighting the comprehensive nature of the course and the passion behind its creation.

Mindmap

Keywords

πŸ’‘Compiler Optimizations

Compiler optimizations refer to the techniques used by a compiler to improve the performance, size, or reliability of the compiled code. In the video, the speaker discusses the impact of different optimization levels (O1, O2, O3) on the performance of Python compiled on Fedora Linux. The theme revolves around how these optimizations can lead to significant speed improvements, with real-world examples provided from the Fedora project's decision to use the -O3 optimization level for compiling Python packages.

πŸ’‘Function Inlining

Function inlining is an optimization technique where the body of a function is directly inserted into its caller, eliminating the function call overhead. This concept is central to the video's discussion on compiler optimizations, particularly under the O2 and O3 optimization levels. The speaker explains how inlining can improve performance by reducing the overhead associated with function calls, although it may increase the size of the compiled binary.

πŸ’‘Fedora Linux

Fedora Linux is a Linux distribution developed by the Fedora Project. It is mentioned in the script as the operating system that has decided to compile Python with the -O3 optimization level. This decision is significant as it showcases Fedora's commitment to performance improvements, and it serves as the backdrop against which the discussion of compiler optimizations is framed.

πŸ’‘GCC (GNU Compiler Collection)

The GNU Compiler Collection, commonly known as GCC, is a compiler system produced by the GNU Project. In the video, GCC is the tool used to compile Python with different optimization levels. The speaker discusses how GCC's optimization options, such as -O1, -O2, and -O3, influence the performance and size of the compiled Python interpreter.

πŸ’‘Benchmark

A benchmark in the context of the video refers to a test or a set of tests designed to measure the performance of a system or component. The speaker references benchmarks to demonstrate the performance improvements achieved by compiling Python with the -O3 optimization level on Fedora Linux, indicating that certain workloads saw speed increases ranging from 1.04 to 1.09 times faster.

πŸ’‘CPython

CPython is the canonical implementation of the Python programming language. It is written in C and is what most Python users interact with when they use the language. The video discusses how CPython is compiled with different optimization levels, which affects its performance. The script uses CPython as an example to illustrate the practical implications of compiler optimizations.

πŸ’‘Instruction Set

An instruction set refers to the basic set of commands that a processor can understand and execute. The video touches on the concept when explaining how compilers translate high-level code into machine-level instructions that the CPU can execute. The speaker uses this to highlight the complexity of the compilation process and the importance of optimizations in generating efficient code.

πŸ’‘Memory Access

Memory access in the video pertains to how a program retrieves data from memory, which can be a performance bottleneck if not managed efficiently. The speaker discusses how compiler optimizations, like function inlining, can reduce memory access overhead by minimizing the need to jump between different parts of memory during function calls.

πŸ’‘Cache Miss

A cache miss occurs when a processor needs to access data that is not in its cache, leading to a slower retrieval from main memory. The video explains how certain compiler optimizations can reduce cache misses by keeping related code and data closer together in memory, thus improving performance.

πŸ’‘Single Instruction, Multiple Data (SIMD)

SIMD is an execution paradigm that allows multiple data elements to be processed in parallel with a single instruction. The video mentions SIMD in the context of the -O3 optimization level, where the compiler can take advantage of SIMD instructions to perform operations on multiple data points simultaneously, thus speeding up execution times for certain types of workloads.

πŸ’‘Upstream Python

Upstream Python refers to the original, unmodified version of Python as distributed by the Python Software Foundation. The speaker contrasts this with the version of Python that comes with Fedora, which has traditionally been compiled with different optimization levels. Upstream Python is mentioned to highlight the default optimizations used in the official Python releases, which align with those now being adopted by Fedora.

Highlights

Fedora Linux's Python performance has improved due to a small but powerful compiler option.

The performance increase varies, with some cases showing a 4% improvement and others as low as 1.6%.

The discussion opens the gate to delve into compiler optimizations, specifically function inlining.

Fedora 41 will come with a compiled version of Python using the -O3 optimization option.

The -O3 optimization is already used by Upstream Python for its release builds, known to make Python significantly faster.

Previous versions of Fedora compiled Python with the -O2 optimization, which is considered more stable.

Function inlining is a compiler optimization technique that can improve performance by reducing function call overhead.

The -O1 optimization level performs local register substitutions to optimize code.

The -O2 optimization level includes function inlining for small functions, reducing memory access and improving cache hits.

The -O3 optimization level is more aggressive with function inlining, potentially increasing binary size but improving performance.

Aggressive function inlining can lead to larger binary sizes, which might impact memory usage and cache efficiency.

Single Instruction Multiple Data (SIMD) is utilized by -O3 to perform multiple data operations in one instruction, if the CPU supports it.

The benchmarks show that -O3 optimization can lead to performance improvements of up to 1.09 times faster in certain cases.

Users of Upstream Python are already benefiting from -O3 optimizations as it's the default compilation option.

The video discusses the trade-offs between different optimization levels and their impact on performance and binary size.

The presenter expresses enthusiasm for learning about compiler optimizations and their practical applications.

The video concludes with a call to action for viewers to check out the presenter's operating system course for more in-depth knowledge.

Transcripts

play00:00

running python on Fedora Linux is now

play00:02

significantly faster thanks to this

play00:05

small but powerful compiler option now

play00:10

double code on significant because that

play00:13

really depends what you think is

play00:16

significant here in this particular case

play00:18

it's 4% in some cases is

play00:23

1.6% but I want to use uh this

play00:26

opportunity this particular news to open

play00:29

the gate

play00:30

to discuss a very important issue n and

play00:35

that issue

play00:36

is compiler optimizations and

play00:40

specifically uh something that is well

play00:42

known as um function in lining I want to

play00:45

dive deep into that how about we dive

play00:47

into this news news time for the back

play00:51

and Engineering show so this comes from

play00:54

foron my favorite uh website for

play00:57

operating systems Hardware optim ization

play01:01

news and let's read the blur and discuss

play01:06

Fedora cleared to build python package

play01:10

with the

play01:11

-3 optimizations now some of you might

play01:14

know what this is some of you might not

play01:17

I'll I'll explain all of this okay the

play01:20

Fedor engineering and steering committee

play01:23

F Co has signed off on the plans for

play01:27

Fedor 41 as the current release uh

play01:31

honestly I didn't use Fedora much I used

play01:34

red

play01:35

hat and uh MBE obono is my main

play01:40

operating system I know it's not cool to

play01:42

say that but hey it is what it is the

play01:45

other operating system I use as you know

play01:48

Mac and work it's primarily

play01:51

Windows un cool as it is but Fedora 41

play01:56

you guys is going to come with a

play01:59

compiled version of python with an

play02:02

option- O3 you know that means Fedora as

play02:06

a as a as a distro of Linux comes with

play02:10

shipped tools cuz that's what

play02:12

distribution is it's it's just a you

play02:14

know it's

play02:15

a it's a nice bouquet if you will of

play02:19

tools on tops of the kernel kernel is

play02:21

the best thing that's the core of the

play02:24

stuff and I talk this about this in my

play02:26

OS and the and the Destroyers just hey

play02:29

let's make a nice UI let's make a nice

play02:31

gooey here let's nice let's make this

play02:33

let's make partitioning easy let's make

play02:35

you hey format stuff easy you want to

play02:38

get started really you don't have to run

play02:40

F desk and figure out all this stuff

play02:42

what sector to start with what's what's

play02:45

the beginning SE logical sector and

play02:47

what's the end logical se you don't have

play02:49

to do all of that we'll take care of all

play02:51

that with the beautiful UI you know um

play02:55

and uh we will include the curl for you

play02:57

version of curl we include uh GCC

play03:00

version this tool include python that's

play03:03

another option so here we're looking at

play03:05

a version of Fedora that has python but

play03:08

the version prior to that was compiled

play03:11

with a specific option so Fedora comes

play03:13

with a version of python and this python

play03:16

is we're talking about C python which is

play03:19

the word c means it's written in C right

play03:22

so the python has been written in C and

play03:25

to compile python The Interpreter that

play03:28

takes your script P scripts and you know

play03:32

transpile them and build them

play03:35

into interpret them that logic is The

play03:39

Interpreter and that interpreter itself

play03:42

is itself written in C compiling that in

play03:47

C was using a certain option D O2

play03:52

certain optimization we'll talk about

play03:54

what o21 02 and3 in a minute right but

play03:57

that's that's what it is right so there

play03:58

is a been with 41 to use o03 instead of

play04:02

o2 for better optimizations what are

play04:05

this optimizations we we we we'll see

play04:07

we'll see we'll see all of this stuff

play04:09

the O3 optimization level is what

play04:12

Upstream python uses for its release

play04:15

builds and it's proven that it makes

play04:18

python significantly faster the

play04:21

significantly faster here I'm reading is

play04:23

between codes okay across range of

play04:27

Benchmark and workload so it's already

play04:30

being being used in Upstream python so

play04:33

the python you download if you don't

play04:34

have any uh if you if you have Linux and

play04:38

you just

play04:39

installed uh you didn't you don't have

play04:41

any like python version and you

play04:44

installed python from the official repo

play04:47

then you get a python that is already

play04:49

compiled with

play04:50

o03 optimization we're going to talk

play04:53

about what that means so you get the

play04:54

fast version It's Upstream python the

play04:57

vanilla out of the box comes with that

play05:01

the version that comes with Fedora does

play05:03

not it's compiled with the with the more

play05:06

O2 which is considered uh more stable

play05:10

you

play05:11

know so and they talk about like how is

play05:14

it faster but but what what is that

play05:15

that's what I want to talk about here so

play05:17

we have o1 O2 O3 these are GCC options

play05:23

the compiler option just the it says all

play05:26

right if you give me a c code I can comp

play05:29

compile it with certain optimization and

play05:32

I'm going to um I'm going to explain

play05:34

here so the first option to compile

play05:37

first what is compiling right what is

play05:39

compile what does it what does that mean

play05:42

right compiling is taking a

play05:45

language written in a highlevel human

play05:49

readable thing you know and pushing it

play05:54

to a

play05:56

specific Machine level instructions that

play06:00

that CPU because CPU executes

play06:03

instructions understands yeah so you

play06:06

compile a c

play06:09

code into this machine language so one

play06:12

line of code can produce 100 machine

play06:16

instructions and based on that logic

play06:20

okay so based on that we have this

play06:23

intermediary which is the compile and

play06:25

there is also the Linker which actually

play06:27

produces the executables but it's out of

play06:29

the equation here so now we have the

play06:31

compiler we compile right so you really

play06:34

need to specify the CPU to actually know

play06:37

what are you compiling against because

play06:39

compiling against uh arm is different

play06:42

than Intel is different than uh AMD is

play06:46

different than you know other CPUs that

play06:48

I don't know about right so you need to

play06:51

know the CPU and you need to know like

play06:53

what is this 32-bit versus 64-bit

play06:55

because 64-bit means you going to use

play06:57

64bit instruction length versus 32 bit

play07:00

you can to use 32 bits that's 4 bytes

play07:02

versus 8 bytes yikes we're compiling so

play07:06

this this process of translating so you

play07:08

have the freedom of doing anything you

play07:10

want I decided to do these sets of

play07:14

instruction right

play07:17

this this is what I want my friend this

play07:19

is what I want I want a equal one Bal 2

play07:26

right I want b equal 2 and Cal

play07:30

a plus b that's what I want that's

play07:31

that's the instructions I wrote right

play07:34

you will translate them whatever you

play07:37

want but give me the final output it

play07:39

does not mean translate one line by line

play07:43

as long as you get the final output

play07:45

that's that's the job of the compiler

play07:48

and that's where the art of building

play07:50

compilers come so you then produce sets

play07:53

of instructions and CPUs work with this

play07:57

local

play08:00

Lightning Fast memory uh uh storage

play08:03

scratch pads called registers and CPU

play08:07

cannot work with anything outside of the

play08:08

register even if you have something in

play08:10

the memory it needs to put it in the

play08:12

register so that it can pass it to the

play08:14

ALU right from the control unit to

play08:17

actually do the math or add or

play08:20

multiplication whatever the operation is

play08:22

so you need to be in registers so you

play08:24

have scarce resource and that's called

play08:26

the registers so you need to do whatever

play08:28

you want and you have an unlimited

play08:29

resource almost which is the memory

play08:31

memory is just the the to to the

play08:34

compiler that's the unlimited resource

play08:36

all right you can put anything you want

play08:37

there but the registers is the true

play08:41

scarce resource now if you flip the

play08:43

equations memory becomes the scarce

play08:46

resource to like a database right

play08:48

application but then the desk becomes

play08:51

like the hey it's an unlimited resource

play08:53

and when I say unlimited take it with a

play08:55

grand so it means like it's it's more

play08:57

ubiquitous

play09:00

all right so we're compiling so I want

play09:03

to go through three options where the

play09:05

compilers does thing here right first of

play09:08

all no optimization the the the easiest

play09:12

way to compile this code to this with no

play09:15

optimization is say all right I'm just

play09:18

going to

play09:20

move right and this is assembly but you

play09:24

can think of assembly as one to one as

play09:25

the as an instruction so I was like all

play09:27

let's move uh I don't know said r0 move

play09:31

one to it then

play09:34

store uh r0 to a which is the address of

play09:37

a right uh let's just use uh I don't

play09:40

know address like that right so I want I

play09:43

want you to store I want you to to save

play09:47

one in R zero and then store it in

play09:51

memory because the user said hey I want

play09:55

an integer a equal this and an integer b

play10:00

equal this an integer that's what that's

play10:02

what when one to one would do right

play10:04

there's no optimization it's like I'm

play10:05

going to do that right this is a hit to

play10:08

memory you are actually going to the

play10:10

memory and and I talk about that the

play10:11

average hit of the memory depending if

play10:13

you hit the same Row in the dam or not

play10:17

it's 100 nond you know that's the that

play10:20

that's the cost of going to memory which

play10:22

is which is can be slow right I talk

play10:24

about that in my my OS course but but

play10:27

then you do this and then you say all

play10:28

right then move r

play10:31

z two let's store two in the same

play10:35

register and it's okay because we used

play10:37

another register and let's let's make it

play10:39

worse I'm going to use another register

play10:40

R zero R1 so we start we for those

play10:44

listening I am I am I really have a very

play10:46

simple program with three lines of code

play10:48

integer a equal 1 integer b equal 2

play10:51

integer c equal a plus b that's all what

play10:54

I want right and I'm translating this

play10:57

one by one as a compiler so I'm moving

play10:59

to a register close by and then I'm I'm

play11:03

storing r0 into the address of a which

play11:07

is basically in this in this case going

play11:09

to be in the stack so I'm storing r01 in

play11:12

a because user told me to store it right

play11:15

so then then I'm going to store uh two

play11:19

and R1 and then I'm going to store

play11:22

R1 in the address of B so go and store

play11:26

two in B in memory in the memory

play11:28

location so another 100 NC it's it's

play11:33

less because those those two is going to

play11:35

be in the same uh Row in the dam so you

play11:38

probably open the r to the hit and then

play11:40

the second B is this is in the same the

play11:44

same frame so you're probably hitting

play11:46

the same row right uh especially that

play11:49

this whole thing is is a single page uh

play11:52

virtual memory page that is so now

play11:54

you're doing that boom then you do all

play11:57

right C We need c let's add R3 R1 and R2

play12:03

and store it in R3 now you have R3 as

play12:07

the sum of those two and then you now

play12:09

move or store actually R3 to the address

play12:13

of C so that's what you going to do

play12:15

without optimization you translated

play12:17

three lines of code to 1 2 3 4 5 6

play12:20

probably there are a lot of boiler plate

play12:22

instructions that I skipped but that's

play12:24

basically the the the gist of it so what

play12:27

you do the first optimization right

play12:30

which is done with this thing that's

play12:33

called Dash

play12:34

o1 right the imprimis will be wait a

play12:38

minute what are you doing what are you

play12:40

doing what you you stored one in a but

play12:43

you never used a technically I mean you

play12:46

used it to add but I I know what you're

play12:49

trying to do the output of this whole

play12:51

thing you just W see to have the sum of

play12:54

one and two so what the what the

play12:56

optimization will do is like all right

play12:58

will do no you do this what is this

play13:00

strike through can I do strike through

play13:02

here come on KVA what the heck why can't

play13:05

I do strike through here strike through

play13:09

oh there you go there you go right make

play13:11

it red bad y let's make it red

play13:14

y red it's like I'm not going to move

play13:17

I'm not going to I'm not going to move

play13:19

I'm going to well I take it back we need

play13:21

to store one in R zero

play13:25

right and then we really not don't need

play13:28

to store it in memory we don't need one

play13:30

to be living in memory at all we don't

play13:33

need it it doesn't have to be so we go

play13:36

oh the not spip that so all right I'm

play13:39

going to store two in the register R1 so

play13:42

we have R Zer and R1 in the local

play13:44

registers because we have enough

play13:46

registers to work with and then I don't

play13:48

need to store R1 to B there is no point

play13:51

of storing B Because B is not going to

play13:53

be used at all right except in one place

play13:57

so and and I and as far as I I look

play13:59

through this I'll be fine you know so

play14:02

I'll I'll I'll do strike through this

play14:04

one and I'll remove that guy and then

play14:07

I'll just add these guys on the store

play14:09

see so I just with this optimization I

play14:12

remove two instructions and this is a

play14:14

very simple thing you know we don't we

play14:16

don't really write to to to memory if we

play14:19

don't need to and this is called the

play14:22

local register substitutions you know or

play14:26

a location so so we use as much as

play14:29

possible we use local registers but the

play14:31

problem with this is this only

play14:33

applicable if you have enough registers

play14:35

so the compiler thinks through okay I

play14:36

have this many registers to work with

play14:38

and I have these functions so if if you

play14:40

have enough registers you'll work with

play14:42

them if you don't have you'll have to go

play14:43

back to memory store it save it it's

play14:45

like working with a temporary variables

play14:47

right to to work with temporary variabl

play14:49

you have to store them and then put them

play14:51

aside and then work with them again so

play14:54

that's the first optimization right and

play14:56

there's like other kind of optimization

play14:58

like like like Sub sub expression

play15:00

elimination like if you evaluated a plus

play15:03

b I'm not and you later said oh a plus b

play15:07

again you did something like that right

play15:10

it will detect that oh we have already

play15:12

did this a plus b so I'm not going to

play15:14

re-eval I'm not going to do an add again

play15:16

I know what a plus b is right so it will

play15:19

do that stuff optimization and will

play15:22

avoid doing an an extra add this kind of

play15:25

optimization that that the that the

play15:27

compiler does then then there's the O2

play15:30

right and example of an O2 what we'll do

play15:34

is and I'm not going to go through this

play15:36

so This make the video very long but

play15:38

essentially it will do function in

play15:41

lining right it's very important but but

play15:44

but but but but it's going to but only

play15:48

for uh small functions that's all what

play15:51

it does right a function and lining is

play15:54

that if you

play15:55

call right let's say this is now I put

play15:57

this in a function let's called add

play15:59

right and this is an add function and

play16:03

then you have

play16:05

here this main

play16:07

function right and then you say hey add

play16:13

add one and two add three and four add

play16:18

five and six add whatever you see my

play16:21

point

play16:23

right what we'll do is it will replace

play16:28

all the these add functions with the

play16:30

content

play16:32

itself right of the function if the

play16:36

function is small enough why because

play16:38

instead of jumping to this of course

play16:40

there should be parameters and we

play16:42

changing these values but instead of

play16:44

jumping back and forth between

play16:48

completely two different instruction

play16:50

sets right because eventually what will

play16:52

happen is we're going to compile this

play16:54

function into a set of instructions and

play16:55

we're going to load them into memory and

play16:57

then we're going to compile this into a

play16:58

that's for instruction get a load in

play17:00

memory and then the CPU will be jumping

play17:03

back and forth between every time I go

play17:06

to this code and then execute it and

play17:08

then go all the way here and then

play17:10

execute and then oh I'm done let me go

play17:13

back and then execute the rest and then

play17:15

jump back jump back and forth right and

play17:20

it's not so bad in this particular

play17:22

example but what is happening with every

play17:24

function call there is a cost associated

play17:27

with that slight over head you know like

play17:30

you're going to set up the stack frame

play17:32

and you're going to because you're

play17:35

function you have to save whatever work

play17:37

you have been been doing before probably

play17:39

like the base point the old base pointer

play17:41

the old stack pointers uh the return

play17:43

address if if need so and then you need

play17:46

to remember saving is what to me that's

play17:49

the cost right you're going back to

play17:51

memory you're writing to memory and

play17:52

that's the 100 nond right you going

play17:54

doing

play17:55

that right so that's one cost the other

play17:59

cost which is to me that's the that's

play18:02

what really makes it bad is you're

play18:05

jumping between different portions of

play18:07

your process memory and that uh

play18:11

introduces cash messes because you see

play18:14

PC CPU loves when it reads something it

play18:16

reads a beautiful 64 bytes and then of

play18:19

course all of this is living in a 4K

play18:22

page virtual memory so you're reading

play18:24

sequential stuff but if you say Okay

play18:27

read this fun execute this line but then

play18:30

go jump and go all over the way all the

play18:33

way to line

play18:36

9,901 because my function lives there

play18:38

happens to live there go and execute

play18:40

that now you lost this beauty cash

play18:43

beautiful cash that you have on your L1

play18:45

cache and l1i cash right you just jumped

play18:48

over there and then you you will be

play18:50

causing jumping back and forth right

play18:53

between them and that's the cost so in

play18:57

lining what it do it literally copies

play19:00

this code inside this so yes it bloats

play19:03

your code

play19:05

significantly all right

play19:08

so it does inlining so what is the

play19:10

beauty of if you do inlining then if you

play19:13

do a read guess what everything is I'm

play19:16

not jumping I read something my function

play19:19

the code of the function is right there

play19:22

so as I'm executing main I don't need to

play19:24

jump somewhere else to execute there and

play19:26

I need to I don't need to set up a stack

play19:28

frame I don't need to save anything

play19:29

there's no function call at all there's

play19:32

no parameters to pass there's no

play19:33

overhead of all this stuff right and I'm

play19:35

talking about this overhead is too tiny

play19:38

but it adds up so so you put all this

play19:41

inlining stuff inside which bloats your

play19:44

code right so that's the cost you you

play19:47

going to get more memory you're going to

play19:49

need more memory but you might get this

play19:53

beautiful cash hit and that's what what

play19:55

they do here for o03 that's O2 so what

play19:58

is what does O3

play19:59

does aggressive in lining that's what it

play20:02

does it almost all functions I don't

play20:04

know exactly how it works they did

play20:06

decides like all

play20:10

right I'm going to do everything like

play20:13

any function in lining almost let just

play20:15

put it put it which increases

play20:18

significantly the size of the binary I'm

play20:21

going to show you an example here of how

play20:23

python looked on O2 and how python

play20:26

looked on O3 aggressive functional

play20:28

lining so you

play20:29

increase your uh size of the binary

play20:33

that's the

play20:34

cost but you get beautiful nice cash

play20:38

hits as a result of course aggressive

play20:41

function aligning can also deteriorate

play20:44

performance because now you're putting

play20:46

your memory especially in smaller

play20:48

devices if you have a lot of memory then

play20:51

you're going to be swapping out right uh

play20:54

if it's just too large of a process uh

play20:57

and it's not a a problem with python per

play20:59

se because if you once you look python

play21:02

that entire text area which is the code

play21:05

is just mapped once and probably is

play21:09

going to be labeled as not swap app all

play21:12

cuz that's how I would do it I like

play21:13

don't swap code let's code be hot in

play21:16

memory and then all processes that are

play21:19

python the and the next pythons will all

play21:22

share that memory that's the beauty of

play21:24

share memory cuz you're going to share

play21:25

all this memory and you you cannot do

play21:27

this easily with without virtual memory

play21:29

so another thing that O3 does is it's um

play21:32

s single instruction multiple data s a s

play21:36

IMD which I talked about on this channel

play21:38

right you take one single as opposed to

play21:41

instead of

play21:43

doing let's say you want to add uh you

play21:46

want to add one and two and three and

play21:48

four and five and six and seven right so

play21:50

because add always takes like two

play21:52

parameter so you have to add one and two

play21:54

and then you add two and three and you

play21:56

have to add

play21:58

add three and four like let's say you

play22:01

adding an array right and you're doing

play22:03

all this silliness right doing an

play22:05

summing all this stuff to do this you

play22:08

have to do four instructions four ad

play22:10

well we've we've seen how it's actually

play22:12

more than that right way more than that

play22:15

in this particular case what s s IM IMD

play22:19

does it says all right let's use vectors

play22:21

right I'm going to add one and two and

play22:23

three and four let's just do eight here

play22:28

and I want them to add these two vectors

play22:32

and you put use special registers this

play22:34

only applicable if the CPU support is as

play22:37

IMD of course right so five six seven

play22:40

and

play22:41

eight so you will sub We'll add all this

play22:44

stuff and you boom one instruction

play22:46

multiple data frash I don't know what

play22:49

that means just add over this boom

play22:52

optimization so if your C supports

play22:54

instructions and python does a lot of

play22:56

stuff of course so if you

play22:59

adding arrays The Interpreter will will

play23:02

will use the version the SM s IMD

play23:05

version if it's applicable of course

play23:08

right and you can do so you'll take

play23:11

advantage of this so that's why O3 is is

play23:14

faster now how fast is this how fast

play23:17

well before we talk how fast well look

play23:21

at the binary sizes here and for those

play23:23

listening we're looking

play23:25

at oh man let's do math

play23:30

does this it's 16,000 kilobytes it's

play23:33

16ish megabyte let's say 17 megabyte

play23:37

that's the

play23:38

O2 that's the python compiled with the

play23:41

O2 and but this is old huh guys this is

play23:44

July 2019 so who knows maybe this is

play23:46

more this is a c python code developer

play23:49

let's shout them up inada Nai okay so

play23:53

this 16 megab is the python O2 and the o

play23:58

three expectably so is larger it's

play24:02

around 20 megab so you're talking 3

play24:05

megab extra 4 megabytes extra is it

play24:07

worth it or not it's up to you right but

play24:09

but the

play24:10

benchmarks well the Benchmark shows yeah

play24:14

found it so the federal project that's

play24:16

the Benchmark we're looking at here guys

play24:19

there is a list of the benchmarks the

play24:21

two 223 async generators and all of that

play24:25

stuff and mostly all of them are like 1

play24:28

Point 04 faster 1.09 faster so you're

play24:32

saving few millisecond here 100

play24:35

millisecond here 100 millisecond there

play24:37

200 millisecond here 200 millisecond

play24:40

there it's just all adds up and guys by

play24:43

the way if you're not in Fedora this

play24:45

this doesn't affect you at all because

play24:47

you're probably using a version Upstream

play24:49

which already compiles with that option

play24:53

right so you're safe essentially you're

play24:56

good right and um

play24:59

yeah but I I thought I'll talk about

play25:01

that because it's just interesting to

play25:03

understand all these you know compiling

play25:06

option but yeah that's what I want to

play25:09

talk about 01 02 03 you know for the

play25:13

cpython

play25:15

interpreters you know and I'm going to

play25:18

reference all uh I'm going to reference

play25:20

all of this stuff for you guys to read

play25:23

cuz I got to credit everybody here this

play25:25

is good uh I the the news is not that

play25:29

it's me and I've seen the comments

play25:31

people doesn't don't seem to like Fedora

play25:34

for some reason know they it seems like

play25:37

Fedora was not innovating enough and

play25:39

they were making fun of this article but

play25:40

I just I just love I loved learning a

play25:45

new thing you know this 01 O2 O3

play25:48

optimization I didn't I knew about the

play25:50

optimization themselves I didn't know

play25:52

about these levels to be honest right so

play25:54

I spent some time researching and I just

play25:56

I love it I love it anything of

play25:58

engineering love it love it see you on

play26:00

the next one you guys check out my

play26:01

operating system

play26:04

cor OS course. win that is OS

play26:10

course. wiin a shortcut to direct you to

play26:14

UD me

play26:17

directly with a beautiful

play26:20

coupon

play26:21

um and enjoy it over 20 hours you guys

play26:25

21 I've added some more lectures 20 2

play26:28

hours I spent two years working on that

play26:30

course I learned so much and you will

play26:33

see my enthusiasm in the course and you

play26:35

probably get sick of it but hope you

play26:38

check it out thank you so much see you

play26:40

on the next one

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Python OptimizationFedora LinuxCompiler OptionsPerformance BoostO3 CompilationFunction InliningSoftware EngineeringCode PerformanceOperating SystemsTech News