Boyer Moore Algorithm for Pattern Searching

In this discussion, we will discuss everything about the Boyer-Moore algorithm, exploring its fundamental principles and key components. We’ll understand the algorithm’s core concepts, including the left-to-right scan, the bad character heuristic, and the good suffix heuristic. Moreover, the practical applications of Boyer-Moore across various domains, from text search to bioinformatics, will be discussed.

Check out our Data Structures Course video on YouTube

What is Boyer Moore Algorithm?

The Boyer-Moore algorithm is a string-matching algorithm. It finds all occurrences of a pattern string P or a substring in a text string T. The algorithm pre-processes the string pattern P being searched for but does not pre-process the text string T in which the pattern is being searched.

The Boyer Moore algorithm speeds up searching by using preprocessed data to skip parts of the text, making it faster than many other string search algorithms. Its distinct characteristic involves matching from the pattern’s end instead of the beginning. Also, it traverses the text in larger jumps of multiple characters instead of checking each character one by one.

The Boyer–Moore algorithm looks for instances of pattern P in text T by conducting targeted character comparisons at various alignments. Instead of implementing a brute-force approach that checks all alignments, Boyer–Moore uses preprocessed information from pattern P to skip as many alignments as feasible. For example, consider that we have a text string “Hello World!” and we want to locate the pattern string “World!”. The Boyer Moore algorithm will help find the pattern much quicker than the brute force approach.

Input:

Text: “Hello World!”
Pattern: “World!”

Output:

Pattern found at position: 7

Approaches Used in Boyer Moore Algorithm

The following section consists of two different approaches for searching a pattern in any text string:

Get 100% Hike!

Master Most in Demand Skills Now!

Bad Character Heuristic

In this approach, the goal is to identify a bad character, which is a character in the main string that does not match the corresponding position in the pattern. Upon detecting a mismatch, we can divide the scenario into two cases. In the first case, when the mismatched character in text T is present in pattern P. In this case, we shift the entire pattern until the mismatch transforms into a match. In the second case, when a mismatched character is not present in pattern P, the pattern moves past the bad character. We will discuss both cases in the section below, along with examples:

Case 1:

We will see an example where the mismatched character in the text is present in the pattern, and we will shift the entire pattern until the mismatch converts into a match.

Here we are searching for the pattern “FBCGH” in the text “ABCGEFGHIJKLMNOP.” We got a mismatch between C and F. The bad character ‘F’ in the text is present in the pattern. So here we will shift the pattern until it matches the bad character ‘F’ of the text.

Here is what we will get after shifting it:

We performed the shift because there is a possibility that, under certain circumstances, we might find a matching pattern in that adjusted position.

Case 2: In the second case, where the mismatched character in text T is absent in pattern P, the approach is to shift the pattern P until it surpasses the position of that mismatched character in T.

Here we have the mismatch between “H” and “K” so “K” is the bad character here, and it is not present in the pattern. Therefore, there is no point in re-comparing the pattern to any preceding portion; the strategy is to shift pattern P until it extends beyond the position after the character “K” in the text. After doing this, we will get something like this:

The code for the scenarios discussed above is:

#include <iostream>
#include <unordered_map>
using namespace std;
void badCharHeuristic(const string &pattern, unordered_map<char, int> &badCharacter) {
    int patternLength = pattern.length();
    for (int i = 0; i < patternLength - 1; ++i) {
        badCharacter[pattern[i]] = patternLength - 1 - i;
    }
}
void boyermoore(const string &text, const string &pattern) {
    int m = pattern.size();
    int n = text.size();
    unordered_map<char, int> badCharacter;
    badCharHeuristic(pattern, badCharacter);
    int idx = 0; /* idx defines how much pattern is shifted */
    while (idx <= (n - m)) {
        int j = m - 1;
        /* Reducing j of P until we get a mismatch character, at this shift idx */
        while (j >= 0 && pattern[j] == text[idx + j])
            j--;
        /* In case if we get j=-1 which signifies that P is present at the current shift */
        if (j < 0) {
            cout << "Pattern found at shift = " << idx << "\n";
            /* We will shift P such that the next character in T aligns with the last
            occurrence of it in P. To cover the case when P occurs at the end of T. */
            idx += (idx + m < n) ? m - badCharacter[text[idx + m]] : 1;
        }
        // other case
        else
            /* In this case also, We will shift P such that the next character in T aligns
            with the last occurrence of it in P. To ensure that we get a positive
            shift, we are using the max function. The negative shift may occur when the
            last occurrence of a bad character in P is on the right side of the current
            character. */
            idx += max(1, j - badCharacter[text[idx + j]]);
    }
}
int main() {
    string text = "ABCGEFGHIJKFBCGH";
    string pattern = "FBCGH";
    boyermoore(text, pattern);
    return 0;
}

This C++ code implements the Boyer-Moore algorithm with the Bad Character Heuristic. In the “badCharHeuristic” function, a bad character table is constructed, mapping each character in the pattern to its rightmost occurrence position, excluding the last character.

The “boyermoore” function utilizes this table to efficiently search for occurrences of the pattern within the given text. Using a main loop, the algorithm compares characters from right to left, identifying matches or mismatches. When a match is found, the algorithm prints the shift index and calculates the next shift based on the bad character heuristic. In the case of a mismatch, the algorithm intelligently shifts the pattern, ensuring positive shifts using the “max” function.

The output of this code will be:

Pattern found at shift = 10
Pattern found at shift = 16

Go through these Top 50 Data Structures Interview Questions and Answers to crack your interviews.

Good Suffix Heuristic

The Good Suffix Heuristic is a component of the Boyer-Moore algorithm that strategically considers the properties of a pattern’s suffixes and sequences of characters from the end of the pattern, to optimize shift decisions when a mismatch occurs. Its primary objective is to identify a “good suffix” within the pattern—a suffix that holds the potential to align effectively with the remaining section of the text.

To implement this strategy, the Good Suffix Heuristic relies on a data structure known as the “Good Suffix Table” or “bpos” table. This table is constructed in the preprocessing phase of the Boyer-Moore algorithm and maintains information about positions within the pattern where the longest matching suffix occurs.

During the actual pattern-matching process, when a mismatch is identified, the Good Suffix Heuristic consults the Good Suffix Table to ascertain the maximum length of the matching suffix that precedes the mismatched character. Denoted as ‘j’, this value serves as the indicator for the optimal offset by which the pattern should be shifted. This strategic shift aims to align the identified good suffix with the corresponding section of the text, enhancing the algorithm’s efficiency in locating pattern occurrences.

Given below is the code that demonstrates the implementation of the good suffix heuristic:

#include <iostream>
#include <vector>
using namespace std;
void goodSuffixHeuristic(string pattern, vector<int>& shiftArr) {
    int m = pattern.length();
    int j = 0;
    for (int i = m; i >= 1; i--) {
        while (j < m && pattern[j] != pattern[i - 1]) {
            j = shiftArr[j];
        }
        shiftArr[i - 1] = j + 1;
        j++;
    }
}
int boyerMoore(string text, string pattern) {
    int n = text.length();
    int m = pattern.length();
    vector<int> shiftArr(m);
    goodSuffixHeuristic(pattern, shiftArr);
    int i = 0;
    int j = 0;
    while (i <= n - m) {
        if (pattern[j] == text[i + j]) {
            j++;
            if (j == m) {
                return i;
            }
        } else {
            i += max(1, m - shiftArr[j]);
            j = shiftArr[j];
        }
    }
    return -1;
}
int main() {
    string text = "ABCGEFGHIJKFBCGH";
    string pattern = "FBCGH";
    int pos = boyerMoore(text, pattern);
    if (pos != -1) {
        cout << "Pattern found at position: " << pos << endl;
    } else {
        cout << "Pattern not found" << endl;
    }
    return 0;
}

This program gives the output:

Pattern found at position: 10

In the above code, the goodSuffixHeuristic() function constructs a “good suffix” shift array, which determines the amount to shift the pattern relative to the mismatched character during the search. The boyerMoore() function utilizes this shift array to perform the actual search. It iterates through the text and compares characters from the pattern, employing a two-step process:

First, if a match is found, it continues matching subsequent characters.

Second, if a mismatch occurs, it shifts the pattern intelligently based on the heuristic information. The main() function applies this algorithm to search for the pattern “FBCGH” in the text “ABCGEFGHIJKFBCGH” and outputs the position where the pattern is found or indicates if it is not present.

In the provided example, the output is “Pattern found at position: 10,” which means that the pattern “FBCGH” is present in the text at index 10.

Check out Intellipaat’s C Programming Tutorial.

Time Complexity of Boyer Moore Algorithm

The Boyer-Moore algorithm shows different time complexity characteristics across different scenarios. In the best case, when the pattern is absent from the text, the algorithm operates with a sub-linear time complexity of O(n/m), where n is the text’s length and m is the pattern’s length. This efficient behavior allows the algorithm to quickly scan the text without encountering mismatches.

Conversely, in the worst case, with a time complexity of O(mn), where m is the length of the pattern and n is the length of the text. The Boyer-Moore algorithm faces challenges when the pattern is present in the text, and its characters closely resemble those in the text. This scenario may necessitate multiple comparisons for each pattern character, which leads to a quadratic increase in the overall number of comparisons.

On average, the Boyer-Moore algorithm maintains an O(n) time complexity, indicating good performance in practical scenarios. The algorithm’s effectiveness is attributed to heuristic techniques, such as the Bad Character Heuristic and the Good Suffix Heuristic, which intelligently reduce the need for unnecessary comparisons.

Applications of Boyer Moore Algorithm

The Boyer-Moore algorithm is known for its efficiency in pattern matching and finds application in diverse fields. Some applications of this algorithm are given below:

Boyer Moore Algorithm is used in text editors and processing tools for efficient pattern identification, facilitating quick location of keywords or structures within documents.
It is employed in lexical analysis during compiler construction to recognize and classify tokens within source code, helping in the identification of syntactic elements in programming languages.
In bioinformatics it is used for analyzing DNA sequences, identifying genes, and detecting patterns related to genetic mutations or diseases.
Boyer Moore Algorithm plays a key role in information retrieval systems and search engines, efficiently matching user queries against large document collections to identify relevant documents.
In data mining applications, this algorithm is utilized to identify patterns and relationships within large datasets, facilitating the extraction of hidden patterns and classification based on specific characteristics.
This algorithm is applied in network security systems for identifying suspicious patterns or malicious code in network traffic, helping in the detection of intrusion attempts, and filtering out harmful content.
Through this algorithm we use NLP applications to identify word forms, phrases, or grammatical structures within text, enabling information extraction, sentiment analysis, and other text analysis tasks.
To provide real-time feedback on spelling and grammar errors in educational software, this algorithm helps identify incorrect word usage and improve writing proficiency.
It is utilized in plagiarism detection tools to identify similarities between documents and detect potential copyright infringements, comparing text against online sources for content originality verification.

Conclusion

The Boyer-Moore algorithm is better than other pattern search algorithms due to its preprocessing efficiency, innovative heuristics, and adaptive shift strategies. Its preprocessing step creates heuristics like the Bad Character Heuristic and the Good Suffix Heuristic, which enable informed decisions during the search phase and reduce unnecessary character comparisons.

The algorithm’s left-to-right scan optimizes the comparison process by starting from the end of the pattern, contributing to its efficiency. Boyer-Moore’s adaptability, dynamic shifts, and favorable average-case time complexity of O(n) make it a versatile and powerful tool in various applications such as text search, bioinformatics, compiler design, and information retrieval.

Particularly adept at handling large patterns, Boyer-Moore’s ability is to skip portions of the text based on heuristics sets as it apart in scenarios where other algorithms may face challenges. While Boyer-Moore stands out for its efficiency, the choice of a pattern search algorithm ultimately depends on specific data characteristics and application requirements.