./overlapping_strings

An extension of one of my recent Stack Overflow answers.

I’ll show two pure Python solutions for concatenating an arbitrary list of strings based on suffix-prefix pairs. This means that EVERY possible pair of words must be checked to see if their suffix and prefixes match e.g.: "helloworld" and "worldfoo".

Here’s an example of a (short) list of strings for which we can start working on a solution:

strings = ["lazydog",
           "Thequick",
           "thelazy",
           "overthe",
           "foxjumps",
           "jumpsover",
           "brownfox",
           "quickbrown"]

Notice that if the strings were in order, e.g., "Thequick", "quickbrown", "brownfox", ..., then it’d be pretty straight forward to join the strings and remove the overlaps to get the desired result "Thequickbrownfox...". Of course, we can’t make that assumption about an ordered list, so we’ll have to figure out the ordering manually.

Since we need to check every pair of words, and since we can’t assume any pair is ordered, we also have to take the reverse of every pair:

pairs = set()
for a in strings:
    for b in strings:
        if a != b:
            pairs.add((a, b))
            pairs.add((b, a))

Output:

{('Thequick', 'brownfox'),
 ('Thequick', 'foxjumps'),
 ('Thequick', 'jumpsover'),
 ('Thequick', 'lazydog'),
 ('Thequick', 'overthe'),
 ('Thequick', 'quickbrown'),
 ('Thequick', 'thelazy'),
 ('brownfox', 'Thequick'),
 ('brownfox', 'foxjumps'),
 ('brownfox', 'jumpsover'),
 ('brownfox', 'lazydog'),
 ('brownfox', 'overthe'),
 .
 .
 .

For this task, we can use permutations instead, which tends to be a bit quicker than the cartesian product:

import itertools

set(itertools.permutations(strings, r=2))

Output:

{('Thequick', 'brownfox'),
 ('Thequick', 'foxjumps'),
 ('Thequick', 'jumpsover'),
 ('Thequick', 'lazydog'),
 ('Thequick', 'overthe'),
 ('Thequick', 'quickbrown'),
 ('Thequick', 'thelazy'),
 ('brownfox', 'Thequick'),
 ('brownfox', 'foxjumps'),
 ('brownfox', 'jumpsover'),
 ('brownfox', 'lazydog'),
 ('brownfox', 'overthe'),
 .
 .
 .

Now for every pair, we need to find whether or not they overlap. Take, for example, the pair ('brownfox', 'foxjumps'):

brownfox
       foxjumps

brownfox
      foxjumps

brownfox
     foxjumps  # The end of 'brownfox' overlaps with the start of 'foxjumps'

brownfox
    foxjumps

brownfox
   foxjumps

brownfox
  foxjumps

brownfox
 foxjumps

Now take ('brownfox', 'jumpsover'):

brownfox
       jumpsover

brownfox
      jumpsover

brownfox
     jumpsover

brownfox
    jumpsover

brownfox
   jumpsover

brownfox
  jumpsover

brownfox
 jumpsover

 # No point in checking any further than this

We’ll use this function to find a suffix-prefix match:

def get_match(pair: Tuple[str, str], min_overlap: int = 3) -> str:
    a, b = pair
    for i in range(min_overlap, min(map(len, pair)) + 1):
        if a[-i:] == b[:i]:
            return b[:i]
    return ""

What this function does is essentially what is shown in the diagrams above: the second string b from pair is “slid” over the “top” of the first string a from pair until a match is found at which point the function returns immediately. The min_overlap parameter is optional and allows us to enforce that a match must be at least min_overlap number of characters, e.g., if min_overlap == 2, then "hello", "one" shouldn’t match since the only suffix-prefix match is "o".

Now, we need to iterate over every possible pairing, and record all pairs which match:

for pair in itertools.permutations(strings, r=2):
    if (match := get_match(pair)):
        print(pair)

Output:

('Thequick', 'quickbrown')
('thelazy', 'lazydog')
('overthe', 'thelazy')
('foxjumps', 'jumpsover')
('jumpsover', 'overthe')
('brownfox', 'foxjumps')
('quickbrown', 'brownfox')

Now here’s where the two solutions diverge.

First solution: constructing a DAG

So we know that the solution for our particular example should be Thequickbrownfoxjumpsoverthelazydog. This can be represented by a (boring) Directed Acyclic Graph:

Thequick -> quickbrown -> brownfox -> foxjumps -> jumpsover -> overthe -> thelazy -> lazydog

You may also recognize that as a singly-linked list, which is what it is in essence, save for the fact that we only have the nodes, and the rules for what can be connected. We still have to properly construct that singly-linked list.

This first solution uses a dictionary to track the adjacencies:

def links_joiners(strings: List[str]) -> Tuple[Dict[str, str], Set[str]]:
    links = dict()
    joiners = set()
    for pair in itertools.permutations(strings, r=2):
        if (match := get_match(pair)):
            joiners.add(match)
            links.update((pair,))
    return links, joiners

Output:

>>> links, joiners = links_joiners(strings)
>>> links
{'Thequick': 'quickbrown',
 'thelazy': 'lazydog',
 'overthe': 'thelazy',
 'foxjumps': 'jumpsover',
 'jumpsover': 'overthe',
 'brownfox': 'foxjumps',
 'quickbrown': 'brownfox'}

>>> joiners
{'brown', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the'}

For now, we’ll focus on the links dictionary, which shows us our adjacencies. Recall our target:

Thequick -> quickbrown -> brownfox -> foxjumps -> jumpsover -> overthe -> thelazy -> lazydog

Now pick a random key in links and take its associated value. If that value is also a key, repeat that last step with the new key. If you get a value that isn’t also a key in links, you’ve reached the end of the DAG. For example, consider an intial starting key, "jumpsover".

"jumpsover" points to "overthe"
"overthe" points to "thelazy"
"thelazy" points to "lazydog"
"lazydog" is not a key in links, therefore we’ve reached the end of the DAG

This algorithm gives us the number of steps from any given key in the DAG (or “node”) to the end. Our goal is to connect the nodes in the correct order without having to rerun another get_match() function on all the adjacent pairs of pairs. With that in mind, we can recursively traverse over links, finding the number of steps to the end of the DAG from every node. Since there’s only one sequence which results in the longest list, and that will be achieved by ordering the nodes by their number of steps to the end of the DAG, in descending order:

 nodes:   Thequick -> quickbrown -> brownfox -> foxjumps -> jumpsover -> overthe -> thelazy -> lazydog
 steps:       7           6            5           4            3           2          1          0

Here’s the code:

def sort_nodes(nodes: List[str], links: Dict[str, str]) -> List[str]:
    def dist(node: str) -> int:
        if node not in links:
            return 0
        return 1 + dist(links[node])
    return sorted(nodes, key=dist, reverse=True)

The inner function dist() is what’s known as a closure, which means that dist() encloses or “captures” any variables which are local to the scope of the closure’s parent and which are used inside the closure. Here, the two variables in the parent scope are nodes, and links, the two parameters of the sort_nodes() function.

Notice that inside the closure dist(), the variable links is accessed even though links isn’t within the scope of dist(). This is taking advantage of the fact that a function in Python will jump up the call stack to its parent’s scope for a variable reference if it can’t find it locally first.

The reason for using a closure here is so dist only needs to take one parameter. This is so we can use it as the sorting method for the built-in sorted() function. Under the hood, Python will take each node in the list of nodes, and use dist to determine its ordinality.

Let’s look at the closure more closely:

def dist(node: str) -> int:
    if node not in links:
        return 0
    return 1 + dist(links[node])

Take a minute to convince yourself that this function implements the algorithm outlined above. Let’s do an example. Recall links:

{'Thequick': 'quickbrown',
 'thelazy': 'lazydog',
 'overthe': 'thelazy',
 'foxjumps': 'jumpsover',
 'jumpsover': 'overthe',
 'brownfox': 'foxjumps',
 'quickbrown': 'brownfox'}

Now trace an example call, dist("jumpsover"):

dist("jumpsover") = 1 + dist("overthe")
                  = 1 + (1 + dist("thelazy"))
                  = 1 + (1 + (1 + dist("lazydog")))  # base case; "lazydog" is not a key in `links`
                  = 1 + (1 + (1 + 0))
                  = 3

Now, by passing all of our strings, or “nodes”, to sorted(), we will end up with a properly sorted list:

strings = ["lazydog",
           "Thequick",
           "thelazy",
           "overthe",
           "foxjumps",
           "jumpsover",
           "brownfox",
           "quickbrown"]

nodes_in_order = sort_nodes(strings, links)

Output:

>>> nodes_in_order
['Thequick',
 'quickbrown',
 'brownfox',
 'foxjumps',
 'jumpsover',
 'overthe',
 'thelazy',
 'lazydog']

Now, recall joiners from earlier:

{'brown', 'fox', 'jumps', 'lazy', 'over', 'quick', 'the'}

Our sorted list, together with joiners are all we need in order to finally put the pieces together. However, there’s one last hurdle, when we join nodes_in_order, all of the overlapping strings are present:

>>> "".join(nodes_in_order)
'Thequickquickbrownbrownfoxfoxjumpsjumpsoveroverthethelazylazydog'

So instead of leaving it at that, we iterate over joiners and for each joiner j, we replace one copy of j with the empty string so as to get rid of the duplicates formed by joining:

def join_strings(strings: List[str], joiners: Set[str]) -> str:
    result = "".join(strings)
    for j in joiners:
        result = result.replace(j, "", 1)
    return result

Output:

>>> join_strings(nodes_in_order, joiners)
'Thequickbrownfoxjumpsoverthelazydog'

Done!

Second solution: pure, unfettered recursion

This solution reuses join_strings() and get_match() from above. This solution foregoes collecting pairs, creating a dictionary of adjacency “links”, finding the correct ordering – and skips straight to joining pairs of strings:

def join_overlapping_pairs(strings: Set[str]) -> str:
    if len(strings) == 1:
        return strings.pop()
    matches = set()
    for pair in itertools.permutations(strings, 2):
        if (match := get_match(pair)):
            matches.add(join_strings(pair, (match,)))
    return join_overlapping_pairs(filter_substrings(matches))


def filter_substrings(strings: Set[str]) -> Set[str]:
    if len(strings) == 1:
        return strings
    largest = set()
    for s1 in strings:
        for s2 in strings - {s1,}:
            if (s1 in s2):
                break
        else:
            largest.add(s1)
    if largest == strings:
        return strings
    return largest | filter_substrings(largest)

By adding print(matches) right before the recursive call of join_overlapping_pairs(), we can trace the behavior of the function:

>>> join_overlapping_pairs(strings)
{'brownfoxjumps', 'foxjumpsover', 'jumpsoverthe', 'quickbrownfox', 'overthelazy', 'thelazydog', 'Thequickbrown'}
{'Thequickbrownfoxjumps', 'quickbrownfoxjumpsover', 'jumpsoverthelazydog', 'foxjumpsoverthelazy', 'brownfoxjumpsoverthe'}
{'Thequickbrownfoxjumpsoverthelazydog'}

The filter_substrings() function isn’t wholly necessary, but it does reduce the size of the search space join_overlapping_pairs() has to work with. Without it, here’s what the matches set looks like after the second iteration of join_overlapping_pairs():

{'overthelazydog', 'brownfoxjumpsoverthe', 'quickbrownfoxjumps', 'jumpsoverthelazydog', 'jumpsoverthelazy', 'Thequickbrownfox', 'foxjumpsoverthelazy', 'Thequickbrownfoxjumps', 'quickbrownfoxjumpsover', 'foxjumpsoverthe', 'brownfoxjumpsover'}

But if you look closely, many of these strings are actually substrings of others (e.g., "jumpsoverthelazy" and "foxjumpsoverthelazy"), which means we can get rid of a lot of redundancy. If we print len(matches) instead:

The numbers on the left are how many strings were passed to join_overlapping_pairs(). The numbers on the right are the size of the search space. Notice that the number of strings and the size of the search space (the paired permutations of every string, which is n * (n - 1)) actually increases before it decreases. This isn’t a good sign if we want to use this for larger sets of strings…

By first filtering out the substrings before recursing, we can not only reduce the search space for successive iterations, but even reduce the number of steps it takes to find a solution:

42
20
0

The filter_substrings() function is pretty straightforward. It recursively iterates over the set, checking each string in the set to determine if it’s a substring of any other string in the set.

First solution: source code

Here’s the source code for the first solution:

import itertools
from typing import Set, Tuple, Dict, List


def get_match(pair: Tuple[str, str], min_overlap: int = 3) -> str:
    """
    For `a`, `b` in `pair`, this 'slides' `b` over the end of
    `a` from the left, starting at `min_overlap`. Returns as
    soon as a match is found, or the empty string otherwise.
    """
    a, b = pair
    for i in range(min_overlap, min(map(len, pair)) + 1):
        if a[-i:] == b[:i]:
            return b[:i]
    return ""


def join_strings(strings: List[str], joiners: Set[str]) -> str:
    """
    Joins an ordered sequence of `strings`, removing the extraneous `joiner`
    between each joined pair.
    """
    result = "".join(strings)
    for j in joiners:
        result = result.replace(j, "", 1)
    return result


def links_joiners(strings: List[str]) -> Tuple[Dict[str, str], Set[str]]:
    """
    Links will be key:value pairs of words that have a common
    substring where the end of `s1` overlaps with the start of `s2`.
    """
    links = dict()
    joiners = set()
    for pair in itertools.permutations(strings, r=2):
        if (match := get_match(pair)):
            joiners.add(match)
            links.update((pair,))
    return links, joiners


def sort_nodes(nodes: List[str], links: Dict[str, str]) -> List[str]:
    """
    A recursive closure which determines the length of each sequence
    starting at `node` and traversing the `links` dictionary.
    """
    def dist(node: str) -> int:
        if node not in links:
            return 0
        return 1 + dist(links[node])
    return sorted(nodes, key=dist, reverse=True)

Second solution: source code:

import itertools
from typing import Set, Tuple, List, Union


def get_match(pair: Tuple[str, str], min_overlap: int = 3) -> str:
    """
    For `a`, `b` in `pair`, this 'slides' `b` over the end of
    `a` from the left, starting at `min_overlap`. Returns as
    soon as a match is found, or the empty string otherwise.
    """
    a, b = pair
    for i in range(min_overlap, min(map(len, pair)) + 1):
        if a[-i:] == b[:i]:
            return b[:i]
    return ""


def join_strings(strings: List[str], joiners: Set[str]) -> str:
    """
    Joins an ordered sequence of `strings`, removing the extraneous `joiner`
    between each joined pair.
    """
    result = "".join(strings)
    for j in joiners:
        result = result.replace(j, "", 1)
    return result


def join_overlapping_pairs(strings: Set[str]) -> str:
    if len(strings) == 1:
        return strings.pop()
    matches = set()
    for pair in itertools.permutations(strings, 2):
        if (match := get_match(pair)):
            matches.add(join_strings(pair, (match,)))
    return join_overlapping_pairs(filter_substrings(matches))


def filter_substrings(strings: Set[str]) -> Set[str]:
    if len(strings) == 1:
        return strings
    largest = set()
    for s1 in strings:
        for s2 in strings - {s1,}:
            if s1 in s2:
                break
        else:
            largest.add(s1)
    if largest == strings:
        return strings
    return largest | filter_substrings(largest)