Structural search

Structural search lets you match richer syntax patterns specifically in codeand structured data formats like JSON. It can be awkward or difficult to matchcode blocks or nested expressions with regular expressions. To meet thischallenge we've introduced a new and easier way to search code that operatesmore closely on a program's parse tree. We use Comby syntax for structural matching. Below you'll find examples and notes for thislanguage-aware search functionality.

Example

The fmt.Sprintf function is a popular print function in Go. Here is a patternthat matches all the arguments in fmt.Sprintf calls in our code:

fmt.Sprintf(...)

See it live on Sourcegraph's code ↗

The ... part is special syntax that matches all characters inside thebalanced parentheses (...). Let's look at two interesting variants ofmatches in our codebase. Here's one:

fmt.Sprintf("must be authenticated as an admin (%s)", isSiteAdminErr.Error())

Note that to match this code we didn't have to do any special thinking abouthandling the parentheses (%s) that happen inside the first string argument,or the nested parentheses that form part of Error(). Unlike regularexpressions, no "overmatching" can happen and the match will always respectbalanced parentheses. With regular expressions, taking care to match the closingparentheses for this call could, in general, really complicate matters.

Here is a second match:

fmt.Sprintf(
		"rest/api/1.0/projects/%s/repos/%s/pull-requests/%d",
		pr.ToRef.Repository.Project.Key,
		pr.ToRef.Repository.Slug,
		pr.ID,
	)

Here we didn't have to do any special thinking about matching contents thatspread over multiple lines. The ... syntax by default matches across newlines.Structural search supports various balanced syntax like (), [], and {} ina language-aware way. This allows it to match large, logical blocks or expressionswithout the limitations of typical line-based regular expression patterns.

Syntax reference

The syntax ... above is an alias for a canonical syntax :[hole], wherehole is a descriptive identifier for the matched content. Identifiers areuseful when expressing that matched content should be equal (see the return :[v.], :[v.] example below). See additionalsyntax below

Syntax Alias Description
... :[hole]
:[_]
match zero or more characters in a lazy fashion. When :[hole] is inside delimiters, as in {:[h1], :[h2]} or (:[h]), holes match within that group or code block, including newlines.
:[~regexp] :[hole~regexp] match an arbitrary regular expression regexp. A descriptive identifier like hole is optional. Avoid regular expressions that match special syntax like ) or .*, otherwise your pattern may fail to match balanced blocks.
:[[_]]
:[[hole]]
:[~\w+]
:[hole~\w+]
match one or more alphanumeric characters and underscore.
:[hole\n] :[~.*\n]
:[hole~.*\n]
match zero or more characters up to a newline, including the newline.
:[ ]
:[ hole]
:[~[ \t]+]
:[hole~[ \t]+]
match only whitespace characters, excluding newlines.
:[hole.] [_.] match one or more alphanumeric characters and punctuation like ., ;, and - that do not affect balanced syntax. Language dependent.

Note: to match the string ... literally, use regular expression patterns like:[~[.]{3}] or :[~\.\.\.].

Rules. Comby supports rules toexpress equality constraints or pattern-based matching. Comby rules are notofficially supported in Sourcegraph yet. We are in the process of making thathappen and are taking care to address stable performance and usability. Thatsaid, you can explore rule functionality with an experimental rule: parameter.For example:

buildSearchURLQuery(:[first], ...) rule:'where match :[first] { | " query: string" -> true }'

More examples

Below you'll find more examples. Also see our blog post for additional examples.

Match stringy data

Taking the original fmt.Sprintf(...) example, let's modify the originalpattern slightly to match only if the first argument is a string. We do this byadding string quotes around .... Adding quotes communicates structuralcontext and changes how the hole behaves: it will match the contents of asingle string delimited by ". It won't match multiple strings like "foo", "bar".

fmt.Sprintf("...", ...)

See it live on Sourcegraph's code ↗

Some matched examples are:

fmt.Sprintf("external service not found: %v", e.id)
fmt.Sprintf("%s/campaigns/%s", externalURL, string(campaignID))

Holes stop matching based on the first fragment of syntax that comes after it,similar to lazy regular expression matching. So, we could write:

fmt.Sprintf(:[first], :[second], ...)

to match all functions with three or more arguments, matching the the first and second arguments based on the contextual position around the commas.

Match equivalent expressions

Using the same identifier in multiple holes adds a constraint that both of the matched values must be syntactically equal. So, the pattern:

return :[v.], :[v.]

will match code where a pair of identifier-like syntax in the return statement are the same. For example, return true, true, return nil, nil, or return 0, 0.

See it live on Sourcegraph's code ↗

Match JSON

Structural search also works on structured data, like JSON. Use patterns to declaratively describe pieces of data to match. For example the pattern:

"exclude": [...]

matches all parts of a JSON document that have a member "exclude" where the value is an array of items.

See it live on Sourcegraph's code ↗

Current functionality and configuration

Structural search behaves differently to plain text search in key ways. We arecontinually improving functionality of this new feature, so please note thefollowing:

  • Only indexed repos. Structural search can currently only be performed onindexed repositories. See configuration for moredetails if you host your own Sourcegraph installation. Our service hosted atsourcegraph.com indexes approximately 200,000of the most popular repositories on GitHub. Other repositories are currentlyunsupported. To see whether a repository on your instance is indexed, visithttps://<sourcegraph-host>.com/repo-org/repo-name/-/settings/index.

  • The lang keyword is semantically significant. Adding the lang keyword informs the parser about language-specific syntax forcomments, strings, and code. This makes structural search more accurate forthat language. For example, fmt.Sprintf(...) lang:go. If lang is omitted,we perform a best-effort to infer the language based on matching fileextensions, or fall back to a generic structural matcher.

  • Saved search are not supported. It is not currently possible to savestructural searches.

  • Matching blocks in indentation-sensitive languages. It's not currentlypossible to match blocks of code that are indentation-sensitive. This is afeature planned for future work.