tree-sitter explained
Summary
TLDRThis video script demystifies Tree-sitter, a parser generator tool that powers various editor functionalities by focusing on incremental parsing and syntax tree analysis. It clarifies that Tree-sitter is not an interpreter or compiler but a library for text interaction, enabling quick feedback and error recovery. The script introduces the concept of writing grammars that transform into parsers, the incremental and error recovery benefits for editors, and the powerful query engine that allows asking complex questions about code structure. It showcases Tree-sitter's capabilities in enhancing editor features like highlighting, navigation, and structural editing, and its potential for broader code analysis and tooling across multiple languages.
Takeaways
- π Treesitter is a parser generator tool that can transform a grammar into a parser, which is then used by applications for syntax analysis.
- π Unlike LSP (Language Server Protocol), Treesitter focuses on parsing one file at a time for quick and incremental feedback, rather than understanding the whole project semantically.
- π οΈ Treesitter is not a compiler or interpreter; it is a library for interacting with text to create a syntax tree and perform queries on it.
- βοΈ The incremental aspect of Treesitter allows for efficient parsing as text is edited, without needing to recompute the entire syntax tree with each keystroke.
- π Error recovery is a key feature of Treesitter, enabling it to handle broken code by isolating minimal errors while keeping the rest of the file functional.
- π The query engine in Treesitter is powerful for asking complex questions about the syntax tree, which can be used for various editor features like highlighting and structural editing.
- π To utilize Treesitter, one must write a grammar in a JavaScript-like DSL, which then generates a parser in C that can be compiled and used by the application.
- π The generated parser is a shared library that can be easily embedded into other applications, making Treesitter highly portable and efficient.
- πΌοΈ The visual representation of the syntax tree in the script demonstrates how Treesitter can be used for detailed code analysis and manipulation.
- π The script warns that Treesitter uses Lisp-like code for queries, which may be a point of contention for some developers but is essential for its functionality.
- π Treesitter's wide adoption and support for many languages make it a valuable tool for building cross-language tooling for code analysis, linting, and more.
Q & A
What is Treesitter and how does it differ from LSP?
-Treesitter is a parser generator tool that provides incremental parsing and querying capabilities for a single file at a time. Unlike LSP (Language Server Protocol), which focuses on understanding the whole project semantically and provides features like definitions, references, and completions, Treesitter is concerned with the text of a single file and does not know about types or packages outside of that file.
What are the main features of Treesitter?
-The main features of Treesitter include its ability to generate parsers from grammars, its incremental parsing which is efficient for editing, error recovery that allows it to handle incomplete code, and a powerful query engine that allows users to ask questions about the syntax tree.
How does Treesitter handle incremental parsing?
-Treesitter handles incremental parsing by updating the syntax tree as the user types, rather than recomputing the entire tree with each keystroke. This makes it efficient for providing real-time feedback while editing code.
What is error recovery in the context of Treesitter?
-Error recovery in Treesitter refers to its ability to continue parsing and understanding the structure of code even when errors are present. It tries to find the minimal amount of error and encloses that, allowing the rest of the file to remain well-highlighted and functional.
What is the role of the query engine in Treesitter?
-The query engine in Treesitter allows users to ask questions about the syntax tree generated from the code. It enables the retrieval of specific information from the tree, which can be used to power features in editors or other applications.
How is Treesitter integrated into code editors?
-Treesitter is integrated into code editors by being embedded as a library that powers various features such as syntax highlighting, code folding, and structural editing. It is already used in editors like Neovim, Helix, Zed, and Emacs.
Why are incremental parsing and error recovery important in code editors?
-Incremental parsing and error recovery are important because they allow the editor to provide immediate feedback and maintain a good editing experience even as the user types and potentially introduces errors into the code.
How does Treesitter generate a parser from a grammar?
-Treesitter generates a parser from a grammar by using a JavaScript-like Domain-Specific Language (DSL). Users write a grammar in this DSL, which is then compiled into a parser using the 'tree-sitter' command-line tool, resulting in a 'parser.c' file.
What is the significance of the query language used in Treesitter?
-The query language in Treesitter is significant because it allows for the selection and manipulation of specific parts of the syntax tree. It uses a Lisp-like syntax and enables complex queries that can filter and select nodes based on various criteria.
How can Treesitter be used for syntax highlighting in editors?
-Treesitter can be used for syntax highlighting by writing queries that capture specific patterns or nodes in the syntax tree, such as integer literals or function declarations. The editor can then use these captures to apply different styles or highlight colors to the corresponding parts of the code.
What are some benefits of using Treesitter for code analysis or linting?
-Treesitter offers benefits for code analysis or linting because it provides pre-built parsers for many languages, allowing developers to write custom queries and captures without needing to create a new parser for each language. This makes it easier to build tools that can analyze and lint code across different programming languages.
Outlines
π² Understanding Tree-Sitter: The Parser Generator Tool
This paragraph introduces the concept of Tree-Sitter as a parser generator tool that can demystify the process of parsing code within an editor. It clarifies that Tree-Sitter is not a compiler or interpreter and emphasizes its focus on parsing individual files incrementally for quick feedback. The paragraph explains that Tree-Sitter includes automatic incremental parsing and error recovery, which are challenging to implement in custom parsing libraries. It also introduces the powerful query engine that allows users to ask questions about the syntax tree generated from the text. The purpose of Tree-Sitter in editors is highlighted, including its adoption by various editors for its incremental parsing, error recovery, and querying capabilities.
π οΈ Building and Using Tree-Sitter Parsers
The second paragraph delves into the technical process of creating a Tree-Sitter parser. It describes how to write a grammar in a JavaScript-like DSL, which is then transformed into a C parser file through the 'tree-sitter CLI'. The resulting files, including the parser .c file, are discussed, along with their benefits such as minimal dependencies and ease of embedding in other languages. The paragraph also touches on the use of shared libraries to load grammars at runtime, allowing for dynamic language support in editors. A small Go file example is presented to illustrate how Tree-Sitter generates a syntax tree, and the introduction of a query editor is mentioned as a tool for selecting and highlighting specific elements within the code.
π The Power of Queries in Tree-Sitter
This paragraph explores the capabilities of Tree-Sitter's query system, demonstrating how it can be used to perform complex selections and manipulations within the syntax tree. It shows how queries can filter and select specific elements, such as strings on the left side of binary expressions, which would be difficult or impossible with regex. The paragraph also discusses the practical applications of queries in editors, such as syntax highlighting, language embedding, indenting, and structural editing. The potential for Tree-Sitter to enhance editor functionality with accurate and efficient parsing is emphasized, as well as its utility beyond just editors for tasks like code analysis, linting, and highlighting due to its wide language support.
Mindmap
Keywords
π‘Treesitter
π‘LSP (Language Server Protocol)
π‘Incremental Parsing
π‘Error Recovery
π‘Query Engine
π‘Syntax Tree
π‘Grammar
π‘Highlighting
π‘Embedding
π‘Foreign Function Interface (FFI)
π‘Capture Groups
Highlights
Treesitter demystifies the process of understanding code within an editor by focusing on quick, incremental feedback from a single file.
Treesitter is distinct from LSP (Language Server Protocol), which provides semantic understanding of a whole project rather than individual files.
Treesitter operates as a parser generator tool, allowing users to write a grammar that is transformed into a parser for specific applications.
The parsers created by Treesitter come with incremental and error recovering features, enhancing the parsing process without additional coding.
A powerful aspect of Treesitter is its query engine, enabling users to ask complex questions about the syntax tree generated from the text.
Treesitter is integrated into various editors like Neovim, Helix, and Zed, enhancing their capabilities for code understanding and manipulation.
Incremental parsing by Treesitter ensures that the structure of the text is recomputed efficiently with each keystroke, optimizing performance.
Error recovery in Treesitter allows for continuous parsing even when the code is incomplete or contains errors, maintaining functionality.
Queries in Treesitter enable editors to retrieve information from the syntax tree, facilitating features like highlighting and code analysis.
Grammars for Treesitter are written in a JavaScript-like DSL, which are then compiled into C files for efficient embedding in applications.
The use of C for Treesitter's parser generation ensures minimal dependencies and broad compatibility with various development environments.
Treesitter's grammars and parsers can be dynamically loaded at runtime, allowing for flexible language support without recompiling the editor.
A demonstration of Treesitter's query editor shows how to select and filter specific elements within a code file, such as binary expressions.
Treesitter's capabilities extend beyond simple text selection, allowing for complex queries that can identify patterns within code structures.
In Neovim, Treesitter powers features like syntax highlighting, which can be customized based on user-defined queries and captures.
Treesitter enables advanced editor features such as structural editing, code folding, and navigation, all based on the syntax tree.
Beyond editors, Treesitter can be used for code analysis and linting tools across multiple languages due to its extensive grammar support.
Treesitter's focus on the current file's syntax tree allows developers to build applications on top of it for various practical coding solutions.
Transcripts
treesitter before but you may not
understand how it works and the whole
process seems somewhat mystical and
confusing and interlaced with a lot of
different tooling that might exist
inside your editor but today I hope that
I can demystify that for you and show a
few pertinent examples that really give
a basis for understanding the power of
what Tre sitter brings to the table but
before we go any further we need to stop
and if you're thinking wow I'm finally
going to understand more stuff about LSP
you're in the wrong video I did that
video already in fact tritter and LSP
are unrelated LSP is focused on
understanding your code at a semantic
level and what I mean by that is it
doesn't just look at the current file
that you're editing or even the current
directory but it trying to understand
your whole project it's trying to
understand the packages that you've
installed and everything else along
those lines and provide definitions and
references and completions that are
accurate that's not the purpose of of
treesitter treesitter is focused on one
file at a time and getting quick and
incremental feedback back from that file
so along that line as well treesitter is
not a compiler it doesn't know about the
types it doesn't know about the pack it
it knows nothing about the rest of those
aspects of your project it only knows
about the text and beyond that as well
sometimes people sort of get this in
their mind that tritter is like doing or
running other things no it's not an
interpreter it's just a library for
interacting with this text so what is
treesitter then the first aspect of
treesitter is that it's a parser
generator tool which means you have the
ability to write a grammar and that
grammar will be able to be transformed
into a parser that parser is then loaded
up by whatever application is embedding
treesitter and then used inside of that
application beyond that the parsers
though automatically get this
incremental and error recovering part of
the grammar you don't have to write that
extra as part of the grammar making
process instead you get that for free
which is generally quite difficult to do
when you're writing a parsing library
for a particular language and then
lastly and probably the most powerful
part of the whole thing is the query
engine and in short what the query
engine does is part of this idea of what
tree sitter is as this framework of
generating a syntax tree or sometimes
called an AS or CST and then allowing
you to use queries to ask questions
about that tree that's sort of the high
level thing that you need to keep in
your mind as you're understanding
treesitter it just cares about text
right and it helps us get a tree from
that text and then ask questions about
that
text so why tree sitter in an editor
well you might actually already be using
treesitter you don't know it's inside
neovim or helix or Zed or emac and and
some other editors as well and so why
are all of these editors adopting
treesitter as sort of this underlying
library to power a bunch of different
aspects well there's three main things
the first one is the incremental aspect
that I mentioned before generally when
you're editing text you're not
completely deleting the file and
rewriting it on every keystroke usually
you're adding just one or two more
characters to the file at a time and
what you'd like to do is you'd like the
thing that understands the structure of
the text to not have to recompute the
entire tree every time you type so
that's the incremental aspect of
treesitter the second aspect that's
really important is related to error
recovery the error recovery isn't just
about your code being broken because you
wrote it I mean that's somewhat of a
given the error recovery we're talking
about here is that as you're writing a
particular line of code it's broken all
the way until you type the last
semicolon or close the parentheses or
close the end of your list right it's
broken all the way until you're done
editing well what would kind of suck is
as you're typing and editing treesitter
just says oh there's an error in the
file I can't parse it I'm not going to
do anything that would be really bad
instead what treesitter does is it tries
its best to find the minimal amount of
error and then enclose that there and
the rest of your file still stays well
highlighted and working well and then
the third and final part that's very
important is queries and queries that's
the part where we're able to ask
questions to the tree and retrieve that
information out which allows your editor
or other application to do useful work
with the knowledge that's in your syntax
tree so how does this happen right we
got to start at the beginning we have to
write our grammar so we write a grammar
in JavaScript and I put quotes here not
because you know JavaScript isn't a
language although you know an
alternative timeline where that was the
case would be kind of interesting no
because what this actually is is it's
really a JavaScript like DSL it takes
the JavaScript and then generates a
parser do C file based on that
Javascript file and that happens through
the treit or cly so you run a command
and you take your grammar. JS and out
comes a parser do c in fact the
structure looks something like this
where there's a couple Json files with
metadata a parser doc sometimes a
scanner. c if you had to write custom C
code and then the parser Doh file to
include so you're probably wondering
whyc isn't that like outdated and
illegal now well yes but sometimes
projects still can use it for a good
reason what's really nice about this is
the only dependency you have for these
files is a c compiler which is really
good most places where you're building
and using an editor have a c compiler
available beyond that as well there's a
really good ffi for those who don't know
foreign function interface story with C
for lots of different languages and what
this means is that it's very easy to
embed treeit inside of another language
and directly access those bindings
efficiently so you don't have to
serialize everything and send it over
some multiple different processes you
can actually include it wholesale inside
of your application this makes it really
fast and really powerful as well as
being very very portable right like I
said if you have an editor being built
you probably have a c compiler somewhere
this makes it really easy to use tree
sitter whever you're building your
project but okay so that's what we have
so once you've done the tree sitter clly
and you've generated this parser doc you
can tell whatever C compiler you like
using hey can you please compile this
for me
and and sure enough it'll work just fine
and you get out this parser doso or
whatever file you called it and now with
this shared Library we can load this
into the tree sitter runtime if you will
as part of the application that is using
treesitter so if you're not super
familiar with shared libraries or you
don't know that that's okay right but
basically what I'm saying is you can
kind of take this later and load it
almost like a plugin okay right you can
take another grammar of another language
you can load it up you can link it and
you can use that at runtime very very
powerful to be able to easily add or
remove or update languages without
recompiling your whole editor or getting
an editor update or anything like
that so before we do a little bit of
exploring I do have to warn you I know
some of you are like allergic to lisp
code and I would say uh you're probably
just wrong scheme's pretty cool I I know
all the list fans are going to comment
right now and agree with me but just as
a warning that's what's coming up so
let's take a quick look at a small
example go file what we have here on the
left of the screen is a small go file
that just has a few expressions and some
other stuff going on and what we have on
the right is our tree this tree is
generated by tree sitter okay and you
can see that if we were to click around
inside here when I click to a function
declaration the entire function
declaration is highlighted if I click on
the identifier just the identifier of
that function is highlighted if I click
on the parameter list that's highlighted
although there's no parameters in there
right now and then I have the block
inside as well and treeit sitter is what
is giving us this tree okay and neovim
knows what the tree looks like and can
interact with it and then print this out
for us for easy viewing and some sort of
debugging but this is where the power
starts to be shown what we have here now
is the query editor and this will be
something something that I think you'll
be able to pick up as we go even if
you're not super familiar with lisp so
what we have here is let's say select
something like an integer literal and I
want to say that this looks like an
integer you see that the only two things
that are highlighted in the whole file
are the integers we could do the same
thing for something like an interpreted
string literal we can call that a string
and you'll see just the two strings are
highlighted but that part is impressive
but not incredible where it starts
becoming really powerful is our ability
to ask more complicated queries to the
tree so for example let's say we were
looking for all the binary expressions
like this so you'll see we have two
binary Expressions which in this case
just means binary two right expression
so that's like a left thing and a right
thing and we're adding them together but
we can actually ask something more
interesting than this we can ask
something like what about what is on the
left side of this here and let's say we
just say I want this to be called called
a node well now we see we've only
selected the left side of the binary
expression but we can even filter this
down more we can use the same sort of
idea that we had here before of
selecting strings and we can say okay
well I want to find all the strings that
are on the left side of a binary
expression and we can do that just like
this now this would
be anywhere from very difficult to
Impossible with Rex right I know I have
some HTML parsing redx enjoyers in the
audience but but this is very very
simple and easy to do now that we're
able to interact with the code as a tree
okay so that's the basics of queries but
what do we actually do with the queries
what what is neovim going to do with the
responses and the captures that we've
been getting let's take a quick look if
we go to go highlights here you can see
that if I am going to select this file
here this is is basically the file that
powers the highlighting for go inside of
neovim and if what I do is I say I
actually want to say that int literals
should be highlighted instead like a
function when I save this file and
re-edit you'll notice the changing for
the integers changed from integer to
function but if we do something instead
like we say we want it to be a buil-in
function and we open this again my
built-in functions are highlighted
purple and
bold so the important bit to understand
here is that this part of the query this
aspect of the query the capture group
and the name neovim is sort of providing
meaning to these captures so we have
different grammars which turn into
different parsers we have to write
multiple different queries but as long
as those queries return the same
captures we can use those to power
things like
highlighting so what is it used for in
an Eder I just gave you an example of
how you could do highlighting it doesn't
just do the highlighting it can also
manage things like finding where to
embed other languages inside right if
you're writing HTML and you enter a
script tag it should embed JavaScript it
also can do things like indenting you
can do structural editing if you're not
familiar with that that's things like
being able to say daf in neovim and go
delete around function you could select
a particular scope or you could tell
your editor to switch to parameters all
of those can then instead of being
powered by rex or best guess you can
actually edit the tree or select from
the tree directly which is really
powerful and powered just by those
queries right you would have a separate
query to tell you which things are text
objects you would have a separate query
to tell you about selections you could
do all of those different things they're
all powered by the same mechanism of
writing queries with a particular set of
captures you could do things like tell
you if you're inside of a class or
inside of a function you can fold based
on where functions start a lot of these
things before were all powered just by
reg X's or best guess efforts or naming
conventions whereas now when you use
tree sitter you can actually ask those
questions of the tree itself which is
really powerful beyond that you can use
Tre sitter outside of editors because of
the wide adoption that treit has been
getting there are grammars for many many
many different languages which lets you
write tooling around code analysis or
linting or highlighting because the
grammars already exist which means you
already have parsers for all the
languages so you can just make up your
own set of queries and captures and do
something like I want to find where all
of the Imports are in a particular file
okay well you use the Imports you write
a query for that and then you build this
framework the same way that neovim is a
framework around tree sitter and
highlighting you could do the same thing
for your own types of problems this is
really great for writing stuff that you
want to expand to a bunch of languages
but you don't want to write a custom
parser or custom runtime or a custom
whatever in each of those
languages and so that's really what tree
sitter is about and I want to remind you
the primary thing treesitter is dealing
with is the code in your particular file
that's all the information that it has
has and it will give you a tree and let
you ask questions of those trees but
because we're software developers we can
build things on top of that like
highlighting or navigation or folding or
whatever the particular application that
you're looking to do but tree sitter is
focused explicitly on that parser
incremental compilation Air recovery and
queries thanks everybody I hope you
really like the video feel free to leave
a like or subscribe or uh come hang out
on Twitch with us bye every one
Browse More Related Video
5.0 / 5 (0 votes)