CodeSearch.jl
CodeSearch.jl is a package for semantically searching Julia code. Unlike plain string search and regex search, CodeSearch performs search operations after parsing. Thus the search patterns j"a + b"
and j"a+b"
are equivalent, and both match the code a +b
.
julia> using CodeSearch
julia> j"a + b" == j"a+b"
true
julia> findfirst(j"a+b", "sqrt(a +b)/(a+ b)")
6:9
The other key feature in this package is wildcard matching. You can use the character *
to match any expression. For example, the pattern j"a + *"
matches both a + b
and a + (b + c)
.
julia> Expr.(eachmatch(j"a + *", "a + (a + b), a + sqrt(2)"))
3-element Vector{Expr}:
:(a + (a + b))
:(a + b)
:(a + sqrt(2))
Here we can see that j"a + *"
matches multiple places, even some that nest within eachother!
Finally, it is possible to extract the "captured values" that match the wildcards.
julia> m = match(j"a + *", "a + (a + b), a + sqrt(2)")
CodeSearch.Match((call-i a + (call-i a + b)), captures=[(call-i a + b)])
julia> m.captures
1-element Vector{JuliaSyntax.SyntaxNode}:
(call-i a + b)
julia> Expr(only(m.captures))
:(a + b)
How to use this package
- Create
Pattern
s with the@j_str
macro or theCodeSearch.pattern
function. - Search an
AbstractString
or aJuliaSyntax.SyntaxNode
for whether and where that pattern occurs with generic functions likeoccursin
,findfirst
,findlast
, orfindall
OR extract the actualMatch
es with generic functions likeeachmatch
andmatch
. - If you extracted an actual match, access relevant information using the public
syntax_node
andcaptures
fields, convert to aSyntaxNode
,Expr
, orAbstractString
via constructors, index into the captures directly withgetindex
, or extract the indices in the original string that match the capture withindices
.
Reference
The following are manually selected docstrings
CodeSearch.@j_str
— Macroj"str" -> Pattern
Construct a Pattern
, such as j"a + (b + *)"
that matches Julia code.
The *
character is a wildcard that matches any expression, and matching is performed insensitive of whitespace and comments. Only the characters "
and *
must be escaped, and interpolation is not supported.
See pattern
for the function version of this macro if you need interpolation.
Examples
julia> j"a + (b + *)"
j"a + (b + *)"
julia> match(j"(b + *)", "(b + 6)")
CodeSearch.Match((call-i b + 6), captures=[6])
julia> findall(j"* + *", "(a+b)+(d+e)")
3-element Vector{UnitRange{Int64}}:
1:11
2:4
8:10
julia> match(j"(* + *) \* *", "(a-b)*(d+e)") # no match -> returns nothing
julia> occursin(j"(* + *) \* *", "(a-b)*(d+e)")
false
julia> eachmatch(j"*(\"hello world\")", "print(\"hello world\"), display(\"hello world\")")
2-element Vector{CodeSearch.Match}:
Match((call print (string "hello world")), captures=[print])
Match((call display (string "hello world")), captures=[display])
julia> count(j"*(*)", "a(b(c))")
2
julia> match(j"(* + *) \* *", "(a+b)*(d+e)")
CodeSearch.Match((call-i (call-i a + b) * (call-i d + e)), captures=[a, b, (call-i d + e)])
CodeSearch.pattern
— Functionpattern(str::AbstractString) -> Pattern
Function version of the j"str"
macro. See @j_str
for documentation.
Examples
julia> using CodeSearch: pattern
julia> pattern("a + (b + *)")
j"a + (b + *)"
julia> match(pattern("(b + *)"), "(b + 6)")
CodeSearch.Match((call-i b + 6), captures=[6])
julia> findall(pattern("* + *"), "(a+b)+(d+e)")
3-element Vector{UnitRange{Int64}}:
1:11
2:4
8:10
julia> match(pattern("(* + *) \\* *"), "(a-b)*(d+e)") # no match -> returns nothing
julia> occursin(pattern("(* + *) \\* *"), "(a-b)*(d+e)")
false
julia> eachmatch(pattern("*(\"hello world\")"), "print(\"hello world\"), display(\"hello world\")")
2-element Vector{CodeSearch.Match}:
Match((call print (string "hello world")), captures=[print])
Match((call display (string "hello world")), captures=[display])
julia> count(pattern("*(*)"), "a(b(c))")
2
julia> match(pattern("(* + *) \\* *"), "(a+b)*(d+e)")
CodeSearch.Match((call-i (call-i a + b) * (call-i d + e)), captures=[a, b, (call-i d + e)])
CodeSearch.Pattern
— TypePattern <: AbstractPattern
A struct that represents a Julia expression with wildcards. When matching Pattern
s, it is possilbe for multiple matches to nest within one another.
The fields and constructor of this struct are not part of the public API. See @j_str
and pattern
for the public API for creating Pattern
s.
Methods accepting Pattern
objects are defined for eachmatch
, match
, findall
, findfirst
, findlast
, occursin
, and count
.
Extended Help
The following are implmenetation details:
The expression is stored as an ordinary SyntaxNode
in the internal syntax_node
field. Wildcards in that expression are represented by the symbol stored in the internal wildcard_symbol
field. For example, the expression a + (b + *)
might be stored as Pattern((call-i a + (call-i b + wildcard)), :wildcard)
.
CodeSearch.Match
— Typestruct Match <: AbstractMatch
syntax_node::JuliaSyntax.SyntaxNode
captures::Vector{JuliaSyntax.SyntaxNode}
end
Represents a single match to a Pattern
, typically created from the eachmatch
or match
function.
The syntax_node
field stores the SyntaxNode
that matched the Pattern
and the captures
field stores the SyntaxNode
s that fill match each wildcard in the Pattern
, indexed in the order they appear.
Methods that accept Match
objects are defined for Expr
, SyntaxNode
, AbstractString
, indices
, and getindex
.
Examples
julia> m = match(j"√*", "2 + √ x")
CodeSearch.Match((call-pre √ x), captures=[x])
julia> m.captures
1-element Vector{JuliaSyntax.SyntaxNode}:
x
julia> m[1]
line:col│ tree │ file_name
1:7 │x
julia> Expr(m)
:(√x)
julia> AbstractString(m)
" √ x"
julia> CodeSearch.indices(m)
4:9
CodeSearch.indices
— Functionindices(m)
Return the indices into a source datastructure that a view is derived from.
Examples
julia> m = match(j"x/*", "4 + x/2")
CodeSearch.Match((call-i x / 2), captures=[2])
julia> indices(m)
4:7
julia> c = m[1]
line:col│ tree │ file_name
1:7 │2
julia> indices(c)
7:7
Generic functions
Many functions that accept Regex
s also accept CodeSearch.Pattern
s and behave according to their generic docstrings. Here are some of those supported functions:
findfirst
findlast
findall
eachmatch
match
occursin
Performance
The code search performance bottleneck is parsing. The search itself is about 20x faster than parsing and similar in performance to an optimized regex library. Consequently, if you want high performance repeated code search, you should cache parsed SyntaxNodes and pass them directly to search functions.
Benchmarks
Using the 395 lines of source code of this package as of 6820e64232 as a test case, on a 2022 M2 mac running Asahi Linux, we can see the following performance:
Operation | Time | Time per line | Benchmark |
---|---|---|---|
Searching a string | 541.0 μs | 1.37 μs | @b collect(eachmatch(j"* !== nothing", node)) seconds=1 |
Parsing a string | 516.8 μs | 1.31 μs | @b parseall(SyntaxNode, str, ignore_errors=true) seconds=1 |
Searching a SyntaxNode | 20.9 μs | 53.0 ns | @b collect(eachmatch(j"* !== nothing", node)) seconds=1 |
Regex search | 22.7 μs | 57.5 ns | @b collect(eachmatch(r".* !== nothing", str)) seconds=1 |
Setup for benchmarks
shell> git clone https://github.com/LilithHafner/CodeSearch.jl CodeSearch
[...]
shell> cd CodeSearch
shell> git checkout 6820e642320f803407bcbc07e691277dc4d91ae4
[...]
julia> using CodeSearch, JuliaSyntax, Chairmarks
julia> str = read("src/CodeSearch.jl", String);
julia> node = parseall(SyntaxNode, str, ignore_errors=true);
Credits
Lilith Hafner is the original author of this package. CodeSearch.jl would not exist without Claire Foster's JuliaSyntax which does all the parsing and provides appropriate data structures to represent parsed code.