CodeSearch.jl

CodeSearch.jl is a package for semantically searching Julia code. Unlike plain string search and regex search, CodeSearch performs search operations after parsing. Thus the search patterns j"a + b" and j"a+b" are equivalent, and both match the code a +b.

julia> using CodeSearch

julia> j"a + b" == j"a+b"
true

julia> findfirst(j"a+b", "sqrt(a +b)/(a+ b)")
6:9

The other key feature in this package is wildcard matching. You can use the character * to match any expression. For example, the pattern j"a + *" matches both a + b and a + (b + c) .

julia> Expr.(eachmatch(j"a + *", "a + (a + b), a + sqrt(2)"))
3-element Vector{Expr}:
 :(a + (a + b))
 :(a + b)
 :(a + sqrt(2))

Here we can see that j"a + *" matches multiple places, even some that nest within eachother!

Finally, it is possible to extract the "captured values" that match the wildcards.

julia> m = match(j"a + *", "a + (a + b), a + sqrt(2)")
CodeSearch.Match((call-i a + (call-i a + b)), captures=[(call-i a + b)])

julia> m.captures
1-element Vector{JuliaSyntax.SyntaxNode}:
 (call-i a + b)

julia> Expr(only(m.captures))
:(a + b)

How to use this package

  1. Create Patterns with the @j_str macro or the CodeSearch.pattern function.
  2. Search an AbstractString or a JuliaSyntax.SyntaxNode for whether and where that pattern occurs with generic functions like occursin, findfirst, findlast, or findall OR extract the actual Matches with generic functions like eachmatch and match.
  3. If you extracted an actual match, access relevant information using the public syntax_node and captures fields, convert to a SyntaxNode, Expr, or AbstractString via constructors, index into the captures directly with getindex, or extract the indices in the original string that match the capture with indices.

Reference

The following are manually selected docstrings

CodeSearch.@j_strMacro
j"str" -> Pattern

Construct a Pattern, such as j"a + (b + *)" that matches Julia code.

The * character is a wildcard that matches any expression, and matching is performed insensitive of whitespace and comments. Only the characters " and * must be escaped, and interpolation is not supported.

See pattern for the function version of this macro if you need interpolation.

Examples

julia> j"a + (b + *)"
j"a + (b + *)"

julia> match(j"(b + *)", "(b + 6)")
CodeSearch.Match((call-i b + 6), captures=[6])

julia> findall(j"* + *", "(a+b)+(d+e)")
3-element Vector{UnitRange{Int64}}:
 1:11
 2:4
 8:10

julia> match(j"(* + *) \* *", "(a-b)*(d+e)") # no match -> returns nothing

julia> occursin(j"(* + *) \* *", "(a-b)*(d+e)")
false

julia> eachmatch(j"*(\"hello world\")", "print(\"hello world\"), display(\"hello world\")")
2-element Vector{CodeSearch.Match}:
 Match((call print (string "hello world")), captures=[print])
 Match((call display (string "hello world")), captures=[display])

julia> count(j"*(*)", "a(b(c))")
2

julia> match(j"(* + *) \* *", "(a+b)*(d+e)")
CodeSearch.Match((call-i (call-i a + b) * (call-i d + e)), captures=[a, b, (call-i d + e)])
source
CodeSearch.patternFunction
pattern(str::AbstractString) -> Pattern

Function version of the j"str" macro. See @j_str for documentation.

Examples

julia> using CodeSearch: pattern

julia> pattern("a + (b + *)")
j"a + (b + *)"

julia> match(pattern("(b + *)"), "(b + 6)")
CodeSearch.Match((call-i b + 6), captures=[6])

julia> findall(pattern("* + *"), "(a+b)+(d+e)")
3-element Vector{UnitRange{Int64}}:
 1:11
 2:4
 8:10

julia> match(pattern("(* + *) \\* *"), "(a-b)*(d+e)") # no match -> returns nothing

julia> occursin(pattern("(* + *) \\* *"), "(a-b)*(d+e)")
false

julia> eachmatch(pattern("*(\"hello world\")"), "print(\"hello world\"), display(\"hello world\")")
2-element Vector{CodeSearch.Match}:
 Match((call print (string "hello world")), captures=[print])
 Match((call display (string "hello world")), captures=[display])

julia> count(pattern("*(*)"), "a(b(c))")
2

julia> match(pattern("(* + *) \\* *"), "(a+b)*(d+e)")
CodeSearch.Match((call-i (call-i a + b) * (call-i d + e)), captures=[a, b, (call-i d + e)])
source
CodeSearch.PatternType
Pattern <: AbstractPattern

A struct that represents a Julia expression with wildcards. When matching Patterns, it is possilbe for multiple matches to nest within one another.

The fields and constructor of this struct are not part of the public API. See @j_str and pattern for the public API for creating Patterns.

Methods accepting Pattern objects are defined for eachmatch, match, findall, findfirst, findlast, occursin, and count.

Extended Help

The following are implmenetation details:

The expression is stored as an ordinary SyntaxNode in the internal syntax_node field. Wildcards in that expression are represented by the symbol stored in the internal wildcard_symbol field. For example, the expression a + (b + *) might be stored as Pattern((call-i a + (call-i b + wildcard)), :wildcard).

source
CodeSearch.MatchType
struct Match <: AbstractMatch
    syntax_node::JuliaSyntax.SyntaxNode
    captures::Vector{JuliaSyntax.SyntaxNode}
end

Represents a single match to a Pattern, typically created from the eachmatch or match function.

The syntax_node field stores the SyntaxNode that matched the Pattern and the captures field stores the SyntaxNodes that fill match each wildcard in the Pattern, indexed in the order they appear.

Methods that accept Match objects are defined for Expr, SyntaxNode, AbstractString, indices, and getindex.

Examples

julia> m = match(j"√*", "2 + √ x")
CodeSearch.Match((call-pre √ x), captures=[x])

julia> m.captures
1-element Vector{JuliaSyntax.SyntaxNode}:
 x

julia> m[1]
line:col│ tree        │ file_name
   1:7  │x

julia> Expr(m)
:(√x)

julia> AbstractString(m)
" √ x"

julia> CodeSearch.indices(m)
4:9
source
CodeSearch.indicesFunction
indices(m)

Return the indices into a source datastructure that a view is derived from.

Examples

julia> m = match(j"x/*", "4 + x/2")
CodeSearch.Match((call-i x / 2), captures=[2])

julia> indices(m)
4:7

julia> c = m[1]
line:col│ tree        │ file_name
   1:7  │2


julia> indices(c)
7:7
source

Generic functions

Many functions that accept Regexs also accept CodeSearch.Patterns and behave according to their generic docstrings. Here are some of those supported functions:

  • findfirst
  • findlast
  • findall
  • eachmatch
  • match
  • occursin

Performance

The code search performance bottleneck is parsing. The search itself is about 20x faster than parsing and similar in performance to an optimized regex library. Consequently, if you want high performance repeated code search, you should cache parsed SyntaxNodes and pass them directly to search functions.

Benchmarks

Using the 395 lines of source code of this package as of 6820e64232 as a test case, on a 2022 M2 mac running Asahi Linux, we can see the following performance:

OperationTimeTime per lineBenchmark
Searching a string541.0 μs1.37 μs@b collect(eachmatch(j"* !== nothing", node)) seconds=1
Parsing a string516.8 μs1.31 μs@b parseall(SyntaxNode, str, ignore_errors=true) seconds=1
Searching a SyntaxNode20.9 μs53.0 ns@b collect(eachmatch(j"* !== nothing", node)) seconds=1
Regex search22.7 μs57.5 ns@b collect(eachmatch(r".* !== nothing", str)) seconds=1

Setup for benchmarks

shell> git clone https://github.com/LilithHafner/CodeSearch.jl CodeSearch
[...]

shell> cd CodeSearch

shell> git checkout 6820e642320f803407bcbc07e691277dc4d91ae4
[...]

julia> using CodeSearch, JuliaSyntax, Chairmarks

julia> str = read("src/CodeSearch.jl", String);

julia> node = parseall(SyntaxNode, str, ignore_errors=true);

Credits

Lilith Hafner is the original author of this package. CodeSearch.jl would not exist without Claire Foster's JuliaSyntax which does all the parsing and provides appropriate data structures to represent parsed code.