How can I match overlapping strings with regex?

Asked
Active3 hr before
Viewed126 times

5 Answers

regexoverlappingstringsmatch
90%

To answer the "How", you can manually change the index of the last match (requires a loop) :,The string#match with a global flag regex returns an array of matched substrings. The /\d{3}/g regex matches and consumes (=reads into the buffer and advances its index to the position right after the currently matched character) 3 digit sequence. Thus, after "eating up" 123, the index is located after 3, and the only substring left for parsing is 45 - no match here.,Note that the same can be written with a "regular" consuming \d{3} pattern and manually set re.lastIndex to m.index+1 value after each successful match:,Connect and share knowledge within a single location that is structured and easy to search.

var re = /(?=(\d{3}))/g;
console.log(Array.from('12345'.matchAll(re), x => x[1]));
load more v
88%

Note that there are three independent that substrings in the input string, but there are two additional overlapping matches that we need to match and count. Here are the start-end positions of overlapping the substring that:,Suppose that we need to count the occurrence of the string, that, in this input, including all overlapping occurrences.,Lookahead patterns are also very useful for situations where we want to match and capture text from overlapping matches.,A simple search using the regex that will give us a match count of three because we miss out all the overlapping matches. ...

Let's consider the following input string as an example:

thathathisthathathatis
load more v
72%

Here (?=...) is a lookahead assertion:,(?=...) matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.,example regex: (d)(?=.1),findall doesn't yield overlapping matches by default. This expression does however:

>>> match = re.findall(r 'ww', 'hello') >>>
   print match['he', 'll']

Since ww means two characters, 'he' and 'll' are expected. But why do 'el' and 'lo' not match the regex?

>>> match1 = re.findall(r 'el', 'hello') >>>
   print match1['el'] >>>
load more v
65%

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.,The regex matching flags. This is a combination of the flags given to compile(), any (?...) inline flags in the pattern, and implicit flags such as UNICODE if the pattern is a Unicode string.,Note that for backward compatibility, the re.U flag still exists (as well as its synonym re.UNICODE and its embedded counterpart (?u)), but these are redundant in Python 3 since matches are Unicode by default for strings (and Unicode matching isn’t allowed for bytes).,When one wants to match a literal backslash, it must be escaped in the regular expression. With raw string notation, this means r"\\". Without raw string notation, one must use "\\\\", making the following lines of code functionally identical:

>>>
import re
   >>>
   m = re.search('(?<=abc)def', 'abcdef') >>>
   m.group(0)
'def'
load more v
75%

A regular expression is used to determine whether a string matches a pattern and, if it does, to extract or transform the parts that match., Note that findAllIn finds matches that don't overlap. (See findAllIn for more examples.),The string defining the regular expression,Compile a regular expression, supplied as a string, into a pattern that can be matched against inputs.

The canonical way to create a Regex is by using the method r, provided implicitly for strings:

val date = raw "(\d{4})-(\d{2})-(\d{2})".r

To extract the capturing groups when a Regex is matched, use it as an extractor in a pattern match:

"2004-01-20"
match {
   case date(year, month, day) => s "$year was a good year for PLs."
}

To check only whether the Regex matches, ignoring any groups, use a sequence wildcard:

"2004-01-20"
match {
   case date(_ * ) => "It's a date!"
}

That works because a Regex extractor produces a sequence of strings. Extracting only the year from a date could also be expressed with a sequence wildcard:

"2004-01-20"
match {
   case date(year, _ * ) => s "$year was a good year for PLs."
}

In a pattern match, Regex normally matches the entire input. However, an unanchored Regex finds the pattern anywhere in the input.

val embeddedDate = date.unanchored "Date: 2004-01-20 17:25:18 GMT (10 years, 28 weeks, 5 days, 17 hours and 51 minutes ago)"
match {
   case embeddedDate("2004", "01", "20") => "A Scala is born."
}

For example, pattern matching with an unanchored Regex, as in the previous example, can also be accomplished using findFirstMatchIn. The findFirst methods return an Option which is non-empty if a match is found, or None for no match:

val dates = "Important dates in history: 2004-01-20, 1958-09-05, 2010-10-06, 2011-07-15"
val firstDate = date.findFirstIn(dates).getOrElse("No date found.")
val firstYear =
   for (m < -date.findFirstMatchIn(dates)) yield m.group(1)

To find all matches:

val allYears =
   for (m < -date.findAllMatchIn(dates)) yield m.group(1)

To check whether input is matched by the regex:

date.matches("2018-03-01") // true
date.matches("Today is 2018-03-01") // false
date.unanchored.matches("Today is 2018-03-01") // true

To iterate over the matched strings, use findAllIn, which returns a special iterator that can be queried for the MatchData of the last match:

val mi = date.findAllIn(dates)
while (mi.hasNext) {
   val d = mi.next
   if (mi.group(1).toInt < 1960) println(s "$d: An oldie but goodie.")
}

Although the MatchIterator returned by findAllIn is used like any Iterator, with alternating calls to hasNext and next, hasNext has the additional side effect of advancing the underlying matcher to the next unconsumed match. This effect is visible in the MatchData representing the "current match".

val r = "(ab+c)".r
val s = "xxxabcyyyabbczzz"
r.findAllIn(s).start // 3
val mi = r.findAllIn(s)
mi.hasNext // true
mi.start // 3
mi.next() // "abc"
mi.start // 3
mi.hasNext // true
mi.start // 9
mi.next() // "abbc"

Note that findAllIn finds matches that don't overlap. (See findAllIn for more examples.)

val num = raw "(\d+)".r
val all = num.findAllIn("123").toList // List("123"), not List("123", "23", "3")

Text replacement can be performed unconditionally or as a function of the current match:

val redacted = date.replaceAllIn(dates, "XXXX-XX-XX")
val yearsOnly = date.replaceAllIn(dates, m => m.group(1))
val months = (0 to 11).map {
   i => val c = Calendar.getInstance;
   c.set(2014, i, 1);
   f "$c%tb"
}
val reformatted = date.replaceAllIn(dates, _ match {
   case date(y, m, d) => f "${months(m.toInt - 1)} $d, $y"
})

Pattern matching the Match against the Regex that created it does not reapply the Regex. In the expression for reformatted, each date match is computed once. But it is possible to apply a Regex to a Match resulting from a different pattern:

val docSpree = ""
"2011(?:-\d{2}){2}"
"".r
val docView = date.replaceAllIn(dates, _ match {
   case docSpree() => "Historic doc spree!"
   case _ => "Something else happened"
})
load more v

Other "regex-overlapping" queries related to "How can I match overlapping strings with regex?"