Structural search

Structural search lets you match richer syntax patterns specifically in code and structured data formats like JSON. It can be awkward or difficult to match code blocks or nested expressions with regular expressions. To meet this challenge we’ve introduced a new and easier way to search code that operates more closely on a program’s parse tree. We use Comby syntax for structural matching. Below you’ll find examples and notes for this language-aware search functionality.

Example

The fmt.Sprintf function is a popular print function in Go. Here is a pattern that matches all the arguments in fmt.Sprintf calls in our code:

fmt.Sprintf(...)

See it live on Sourcegraph’s code ↗

The ... part is special syntax that matches all characters inside the balanced parentheses (...). Let’s look at two interesting variants of matches in our codebase. Here’s one:

fmt.Sprintf("must be authenticated as an admin (%s)", isSiteAdminErr.Error())

Note that to match this code we didn’t have to do any special thinking about handling the parentheses (%s) that happen inside the first string argument, or the nested parentheses that form part of Error(). Unlike regular expressions, no “overmatching” can happen and the match will always respect balanced parentheses. With regular expressions, taking care to match the closing parentheses for this call could, in general, really complicate matters.

Here is a second match:

fmt.Sprintf(
		"rest/api/1.0/projects/%s/repos/%s/pull-requests/%d",
		pr.ToRef.Repository.Project.Key,
		pr.ToRef.Repository.Slug,
		pr.ID,
	)

Here we didn’t have to do any special thinking about matching contents that spread over multiple lines. The ... syntax by default matches across newlines. Structural search supports various balanced syntax like (), [], and {} in a language-aware way. This allows to match large, logical blocks or expressions without the limitations of typical line-based regular expression patterns.

Syntax reference

The syntax ... above is an alias for a canonical syntax :[hole], where hole is a descriptive identifier for the matched content. Identifiers are useful when expressing that matched content should be equal (see the return :[v.], :[v.] example below). See additional syntax below

Syntax Alias Description
... :[hole]
:[_]
match zero or more characters in a lazy fashion. When :[hole] is inside delimiters, as in {:[h1], :[h2]} or (:[h]), holes match within that group or code block, including newlines.
:[~regexp] :[hole~regexp] match an arbitrary regular expression regexp. A descriptive identifier like hole is optional. Avoid regular expressions that match special syntax like ) or .*, otherwise your pattern may fail to match balanced blocks.
:[[_]]
:[[hole]]
:[~\w+]
:[hole~\w+]
match one or more alphanumeric characters and underscore.
:[hole\n] :[~.*\n]
:[hole~.*\n]
match zero or more characters up to a newline, including the newline.
:[ ]
:[ hole]
:[~[ \t]+]
:[hole~[ \t]+]
match only whitespace characters, excluding newlines.
:[hole.] [_.] match one or more alphanumeric characters and punctuation like ., ;, and - that do not affect balanced syntax. Language dependent.

Note: to match the string ... literally, use regular expression patterns like :[~[.]{3}] or :[~\.\.\.].

Rules. Comby supports rules to express equality constraints or pattern-based matching. Comby rules are not officially supported in Sourcegraph yet. We are in the process of making that happen and are taking care to address stable performance and usability. That said, you can explore rule functionality with an experimental rule: parameter. For example:

buildSearchURLQuery(:[first], ...) rule:'where match :[first] { | " query: string" -> true }'

More examples

Below you’ll find more examples. Also see our blog post for additional examples.

Match stringy data

Taking the original fmt.Sprintf(...) example, let’s modify the original pattern slightly to match only if the first argument is a string. We do this by adding string quotes around .... Adding quotes communicates structural context and changes how the hole behaves: it will match the contents of a single string delimited by ". It won’t match multiple strings like "foo", "bar".

fmt.Sprintf("...", ...)

See it live on Sourcegraph’s code ↗

Some matched examples are:

fmt.Sprintf("external service not found: %v", e.id)
fmt.Sprintf("%s/campaigns/%s", externalURL, string(campaignID))

Holes stop matching based on the first fragment of syntax that comes after it, similar to lazy regular expression matching. So, we could write:

fmt.Sprintf(:[first], :[second], ...)

to match all functions with three or more arguments, matching the the first and second arguments based on the contextual position around the commas.

Match equivalent expressions

Using the same identifier in multiple holes adds a constraint that both of the matched values must be syntactically equal. So, the pattern:

return :[v.], :[v.]

will match code where a pair of identifier-like syntax in the return statement are the same. For example, return true, true, return nil, nil, or return 0, 0.

See it live on Sourcegraph’s code ↗

Match JSON

Structural search also works on structured data, like JSON. Use patterns to declaratively describe pieces of data to match. For example the pattern:

"exclude": [...]

matches all parts of a JSON document that have a member "exclude" where the value is an array of items.

See it live on Sourcegraph’s code ↗

Current functionality and configuration

Structural search behaves differently to plain text search in key ways. We are continually improving functionality of this new feature, so please note the following:

  • Only indexed repos. Structural search can currently only be performed on indexed repositories. See configuration for more details if you host your own Sourcegraph installation. Our service hosted at sourcegraph.com indexes approximately 200,000 of the most popular repositories on GitHub. Other repositories are currently unsupported. To see whether a repository on your instance is indexed, visit https://<sourcegraph-host>.com/repo-org/repo-name/-/settings/index.

  • The lang keyword is semantically significant. Adding the lang keyword informs the parser about language-specific syntax for comments, strings, and code. This makes structural search more accurate for that language. For example, fmt.Sprintf(...) lang:go. If lang is omitted, we perform a best-effort to infer the language based on matching file extensions, or fall back to a generic structural matcher.

  • Saved search are not supported. It is not currently possible to save structural searches.

  • Matching blocks in indentation-sensitive languages. It’s not currently possible to match blocks of code that are indentation-sensitive. This is a feature planned for future work.