CodeSearch.jl
CodeSearch.jl is a package for semantically searching Julia code. Unlike plain string search and regex search, CodeSearch performs search operations after parsing. Thus the search patterns j"a + b" and j"a+b" are equivalent, and both match the code a +b.
julia> using CodeSearch
julia> j"a + b" == j"a+b"
true
julia> findfirst(j"a+b", "sqrt(a +b)/(a+ b)")
6:9The other key feature in this package is wildcard matching. You can use the character * to match any expression. For example, the pattern j"a + *" matches both a + b and a + (b + c) .
julia> Expr.(eachmatch(j"a + *", "a + (a + b), a + sqrt(2)"))
3-element Vector{Expr}:
:(a + (a + b))
:(a + b)
:(a + sqrt(2))Here we can see that j"a + *" matches multiple places, even some that nest within eachother!
Finally, it is possible to extract the "captured values" that match the wildcards.
julia> m = match(j"a + *", "a + (a + b), a + sqrt(2)")
CodeSearch.Match((call-i a + (call-i a + b)), captures=[(call-i a + b)])
julia> m.captures
1-element Vector{JuliaSyntax.SyntaxNode}:
(call-i a + b)
julia> Expr(only(m.captures))
:(a + b)How to use this package
- Create
Patterns with the@j_strmacro or theCodeSearch.patternfunction. - Search an
AbstractStringor aJuliaSyntax.SyntaxNodefor whether and where that pattern occurs with generic functions likeoccursin,findfirst,findlast, orfindallOR extract the actualMatches with generic functions likeeachmatchandmatch. - If you extracted an actual match, access relevant information using the public
syntax_nodeandcapturesfields, convert to aSyntaxNode,Expr, orAbstractStringvia constructors, index into the captures directly withgetindex, or extract the indices in the original string that match the capture withindices.
Reference
The following are manually selected docstrings
CodeSearch.@j_str — Macro
j"str" -> PatternConstruct a Pattern, such as j"a + (b + *)" that matches Julia code.
The * character is a wildcard that matches any expression, and matching is performed insensitive of whitespace and comments. Only the characters " and * must be escaped, and interpolation is not supported.
See pattern for the function version of this macro if you need interpolation.
Examples
julia> j"a + (b + *)"
j"a + (b + *)"
julia> match(j"(b + *)", "(b + 6)")
CodeSearch.Match((call-i b + 6), captures=[6])
julia> findall(j"* + *", "(a+b)+(d+e)")
3-element Vector{UnitRange{Int64}}:
1:11
2:4
8:10
julia> match(j"(* + *) \* *", "(a-b)*(d+e)") # no match -> returns nothing
julia> occursin(j"(* + *) \* *", "(a-b)*(d+e)")
false
julia> eachmatch(j"*(\"hello world\")", "print(\"hello world\"), display(\"hello world\")")
2-element Vector{CodeSearch.Match}:
Match((call print (string "hello world")), captures=[print])
Match((call display (string "hello world")), captures=[display])
julia> count(j"*(*)", "a(b(c))")
2
julia> match(j"(* + *) \* *", "(a+b)*(d+e)")
CodeSearch.Match((call-i (call-i a + b) * (call-i d + e)), captures=[a, b, (call-i d + e)])CodeSearch.pattern — Function
pattern(str::AbstractString) -> PatternFunction version of the j"str" macro. See @j_str for documentation.
Examples
julia> using CodeSearch: pattern
julia> pattern("a + (b + *)")
j"a + (b + *)"
julia> match(pattern("(b + *)"), "(b + 6)")
CodeSearch.Match((call-i b + 6), captures=[6])
julia> findall(pattern("* + *"), "(a+b)+(d+e)")
3-element Vector{UnitRange{Int64}}:
1:11
2:4
8:10
julia> match(pattern("(* + *) \\* *"), "(a-b)*(d+e)") # no match -> returns nothing
julia> occursin(pattern("(* + *) \\* *"), "(a-b)*(d+e)")
false
julia> eachmatch(pattern("*(\"hello world\")"), "print(\"hello world\"), display(\"hello world\")")
2-element Vector{CodeSearch.Match}:
Match((call print (string "hello world")), captures=[print])
Match((call display (string "hello world")), captures=[display])
julia> count(pattern("*(*)"), "a(b(c))")
2
julia> match(pattern("(* + *) \\* *"), "(a+b)*(d+e)")
CodeSearch.Match((call-i (call-i a + b) * (call-i d + e)), captures=[a, b, (call-i d + e)])CodeSearch.Pattern — Type
Pattern <: AbstractPatternA struct that represents a Julia expression with wildcards. When matching Patterns, it is possilbe for multiple matches to nest within one another.
The fields and constructor of this struct are not part of the public API. See @j_str and pattern for the public API for creating Patterns.
Methods accepting Pattern objects are defined for eachmatch, match, findall, findfirst, findlast, occursin, and count.
Extended Help
The following are implmenetation details:
The expression is stored as an ordinary SyntaxNode in the internal syntax_node field. Wildcards in that expression are represented by the symbol stored in the internal wildcard_symbol field. For example, the expression a + (b + *) might be stored as Pattern((call-i a + (call-i b + wildcard)), :wildcard).
CodeSearch.Match — Type
struct Match <: AbstractMatch
syntax_node::JuliaSyntax.SyntaxNode
captures::Vector{JuliaSyntax.SyntaxNode}
endRepresents a single match to a Pattern, typically created from the eachmatch or match function.
The syntax_node field stores the SyntaxNode that matched the Pattern and the captures field stores the SyntaxNodes that fill match each wildcard in the Pattern, indexed in the order they appear.
Methods that accept Match objects are defined for Expr, SyntaxNode, AbstractString, indices, and getindex.
Examples
julia> m = match(j"√*", "2 + √ x")
CodeSearch.Match((call-pre √ x), captures=[x])
julia> m.captures
1-element Vector{JuliaSyntax.SyntaxNode}:
x
julia> m[1]
line:col│ tree │ file_name
1:7 │x
julia> Expr(m)
:(√x)
julia> AbstractString(m)
" √ x"
julia> CodeSearch.indices(m)
4:9CodeSearch.indices — Function
indices(m)Return the indices into a source datastructure that a view is derived from.
Examples
julia> m = match(j"x/*", "4 + x/2")
CodeSearch.Match((call-i x / 2), captures=[2])
julia> indices(m)
4:7
julia> c = m[1]
line:col│ tree │ file_name
1:7 │2
julia> indices(c)
7:7Generic functions
Many functions that accept Regexs also accept CodeSearch.Patterns and behave according to their generic docstrings. Here are some of those supported functions:
findfirstfindlastfindalleachmatchmatchoccursin
Performance
The code search performance bottleneck is parsing. The search itself is about 20x faster than parsing and similar in performance to an optimized regex library. Consequently, if you want high performance repeated code search, you should cache parsed SyntaxNodes and pass them directly to search functions.
Benchmarks
Using the 395 lines of source code of this package as of 6820e64232 as a test case, on a 2022 M2 mac running Asahi Linux, we can see the following performance:
| Operation | Time | Time per line | Benchmark |
|---|---|---|---|
| Searching a string | 541.0 μs | 1.37 μs | @b collect(eachmatch(j"* !== nothing", node)) seconds=1 |
| Parsing a string | 516.8 μs | 1.31 μs | @b parseall(SyntaxNode, str, ignore_errors=true) seconds=1 |
| Searching a SyntaxNode | 20.9 μs | 53.0 ns | @b collect(eachmatch(j"* !== nothing", node)) seconds=1 |
| Regex search | 22.7 μs | 57.5 ns | @b collect(eachmatch(r".* !== nothing", str)) seconds=1 |
Setup for benchmarks
shell> git clone https://github.com/LilithHafner/CodeSearch.jl CodeSearch
[...]
shell> cd CodeSearch
shell> git checkout 6820e642320f803407bcbc07e691277dc4d91ae4
[...]
julia> using CodeSearch, JuliaSyntax, Chairmarks
julia> str = read("src/CodeSearch.jl", String);
julia> node = parseall(SyntaxNode, str, ignore_errors=true);Credits
Lilith Hafner is the original author of this package. CodeSearch.jl would not exist without Claire Foster's JuliaSyntax which does all the parsing and provides appropriate data structures to represent parsed code.