Hello everyone, I have been recently reading about Text Mining and Natural Language Processing (NP). So, I will be sharing a few articles on TextMining from now on. So, that you can have an interest in this beautiful field of Computer Science.
PrerequisitesIn order to understand the main content of the article you need some prerequisites which are as follows :-
- Familiar with Programming.
- Know at least one programming language.
- Basics of Python (Optional but Recommended)
- Have some prior exposure to Data Structures and algorithmic techniques like Backtracking, Dynamic Programming etc.
- Have some gray cells and have patience.
- Passion for Computer Science.
Let's Start,
The first question that should arise in your mind when you hear the term TextMining is
What is TextMining and What does it do ? hence we will understand first what is Text or Data Mining. Lets proceed.
Text Mining or Data MiningText mining, also referred to as text
data mining, roughly equivalent to text analytics, refers to the process of deriving
high-quality information from text. High-quality information is typically derived through the devising of
patterns and trends through means such as
statistical pattern learning.
Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.
To explain it simplyData Mining is a processing of locating or finding important expression, phrases or sentences or events from data or text which can be further processed to get more useful information. Even though data mining tries to collect information from data, its different from
Information Retrieval, instead of the fact that Data Mining, Information Retrieval or Natural Language Processing are often connected together and misunderstood as one, I will explain how they are different some other time and not here. If you wanna develop an application based on Text or where you have to deal with large amount of data and you need an intelligent system which will classify the data and sorts them into different categories or do different processes then you need these concepts a lot.
Real life examples of Data Mining1.
Google Search : Whenever you starting writing a query it automatically suggests the possible search queries. How does it do ? They use these concept to derive the most probable search query.
2.
Spell Correction : You might have seen that if you type wrong spelling's most search engines or word processors mark them with a red underline which means that you have a spelling error. MS Word or Android apps has automatic spell correction. They use these NLP and Text Mining concepts. See the image it will clear a lot of doubts. See how Google does spell correction and the search results are an example of the query I entered.
3. Amazon or Online Shopping Stores : Most of you must have done online shopping a lot. Whenever you visit a site for the first time and search for some specific product. You see some thing written as recommended for you. Next time when you visit that site you might be astonished to see that the products you see when you open the webpage are similar to what you had searched in your last visit but more refined. So, how does it work ? the earlier mentioned concepts are used for these. They collect your browsing history and process the data and finally show you the most recommended products.
There are thousands of examples just Google for it
I hope by now you have somewhat understood what is data or text mining. Now we will come to today's real discussion that is fuzzy string searching. So lets continue.
Today's DiscussionIn this section we discuss about String matching or string similarity algorithms. We will discuss the following algorithms today.
1. Damerau-Lavenshtein Distance.
2. Levenshtein Distance.
3. Wagner-Fisher Algorithm.Strings are among the most important elements of computer science. There are literally 100's of research papers on strings. String searching and matching is one of he most important part in Text Processing and text analysis, since all the data to intelligent systems are fed as string. Strings are nothing but a sequence of characters joined together.
Fuzzy String Searching or Fuzzy String MatchingFuzzy string search algorithms are algorithms that are used to match either exactly or partially of one string with another string.
So, what exactly does fuzzy mean ? Fuzzy by the word we can understand that elements that aren't clear or is like an illusion. Therefore, in short we say that when there is a set of
n elements and another set of
m elements and they both have partially same elements then we can say that the relation between them is fuzzy.
(Funny example Most women have a fuzzy thinking that their men are cheating on them but instead of taking it as fuzzy they take it for granted
)
By Fuzzy string searching (also known as approximate string matching) we identify those strings which are similar to a set of other strings. A number of conclusions can be drawn from these like Similarity co-efficient, similar words etc, etc.
The problem of approximate string matching is typically divided into two sub-problems: finding approximate sub-string matches inside a given string and finding dictionary strings that match the pattern approximately.
Fuzzy search algorithms (also known as similarity search algorithms) serve as a base for
spell-checkers and search engines like Google or Bing or Yahoo etc. For example, these algorithms are used to provide the "Did you mean ..." functionality that is very common.
Example :- If you ask a machine to find the similarity between two words and correct them if required, say, "Boobs" and "Boobes", the machine will return true for the first string "Boobs" as its correct and will return the corrected string as "Boobs" from the erroneous string "Boobes".
Simple Formulation of Fuzzy String Problems :- "Find in the text or dictionary of a finite size n for all the words that match a word given as input (or starts with the given word), taking into account m possible differences (errors)."
Fuzzy search algorithms are characterized by metric - a function of distance between two words, which provides a measure of their similarity. A strict mathematical definition of metric includes a requirement to meet
triangle inequality (p - metric function, Z - a set of words):
LaTeX code for above equation.
p(x,y) \leq p(x,k) + p(k,y), where, (x,y,k) \varepsilon Z
What is Edit-Distance ?Definition From WikipediaEdit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question.
Simply we can say that -
The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern.The primitive operations performed are,
Insertion, Deletion, SubstitutionExample :- 1. Mad -- > Mad
e, Insertion Operation Performed
2. Boo
obs -- > Boobs, Deletion Operation Performed
3.
Lick -- >
Dick, Substitution Operation Performed
Now we will move to discuss the algorithms.
Damerau-Levenshtein DistanceDamerau-Levenshtein Distance is a distance (string metric) between two Strings, say String A and String B, which gives the minimum number of edit operations need to perform to transform String A to String B. Damerau's Algorithm can be used for spell correction with atmost 1 edit-distance.
There are four edit operations that can be performed with this algorithm. They are -
1. Insertion, 2. Deletion, 3. Substitution, and 4. Transposition. We have already seen the operation of the first three algorithms and so let me explain the last operation, that is transposition.
Transposition is the swapping of adjacent characters to bring them to a certain form.
Example :-String A := obbo
String B := boob
When the above two strings are feed to a machine that computes the Damerau-Lavenshtein Distance find that the edit distance is two.
So how it operates ?
We have two string transpositions here :-1.)
obbo ->
bobo
2.) bo
bo -> bo
obHence we get the final result as :- boob (:Heart:)
And the number of edits performed is :
2Please Note that Damerau-Levenshtein Distance
doesn't follow the Triangle inequality. You can see that in the image below :-
"All Theory and No Codes Makes Psycho a Dull guy"I am giving the working function for now. You can see the Psuedocode on Wiki, its is well written.
Here's the code in Python. I would encourage you to write it by yourself too.
def damerau_levenshtein(s1, len_s1, s2, len_s2):
"""
A function to calculate the Damerau-Levenshtein Distance.
:param s1: Parent String
:param len_s1: Length Of s1
:param s2: pattern String
:param len_s2: length of s2
:rtype : int returns the damerau-levenshtein distance
"""
d = {}
len_s1 = len(s1)
len_s2 = len(s2)
for i in range(-1, len_s1 + 1):
d[i, -1] = i + 1
for j in range(-1, len_s2 + 1):
d[-1, j] = j + 1
for i in range(len_s1):
for j in range(len_s2):
if s1[i] == s2[j]:
cost = 0
else:
cost = 1
d[(i, j)] = min(
d[i - 1, j] + 1, # Deletion
d[i, j - 1] + 1, # Insertion
d[i - 1, j - 1] + cost, # Substitution
)
if i and j and s1[i] == s2[j - 1] and s1[i - 1] == s2[j]:
d[i, j] = min(d[i, j], d[i - 2, j - 2] + cost) # transposition
return d[len_s1 - 1, len_s2 - 1]
You can get the complete source code here :
https://github.com/PsychoCoderHC/TextMining-Algorithms/blob/master/Python-Version/Fuzzy-String-Matching/Edit-Distance/Damerau-Lavenshtein.pyTo see the Code in Execution :-
http://code.hackerearth.com/0356acEYou can visualize how the steps or the algorithm works here :
Click MeLevenshtein DistanceLevenshtein distance is also a string metric calculate the edit-distance between two strings. They have found a lot of applications and in most of the real world systems an efficient and modified form of Levenshtein distance is used including various NLP toolkits like LingPipe or NLTK. or GATE. They use different data structures which reduces the time complexity.
The only difference between Lavenshtein and Damerau-Lavenshtein is that this algorithm doesn't supports transposition and we have only can perform only three operations namely, insertion, deletion and substitution.
Mathematically, it can be written as :-
LaTeX code for above equation\qquad\operatorname{lev}_{s,t}(i,j) = \begin{cases}
\max(i,j) \text{ if} \min(i,j)=0, \\
\min \begin{cases}
\operatorname{lev}_{s,t}(i-1,j) + 1 \\
\operatorname{lev}_{s,t}(i,j-1) + 1 \\
\operatorname{lev}_{s,t}(i-1,j-1) + 1_{(s_i \neq t_j)}
\end{cases} \text{ else,}
\end{cases}
Another significant difference between Levenshtein Distance and Damerau-Levenshtein Distance is that the former satisfies the Triangle Inequality whereas the second does not. It is visible from the image given below where the numbers represent the edit distance between two nodes :-
Okay, lets do some programming now. We can follow the above math relation an we can formulate a recursive solution :-
def recursivelevenshtein(s1, len_s1, s2, len_s2):
"""
A function to calculate the Levenshtein Distance using recursion.
:param s1: Parent String
:param len_s1: Length Of s1
:param s2: pattern String
:param len_s2: length of s2
:rtype : int returns the levenshtein distance
"""
if len_s1 == 0:
return len_s2
if len_s2 == 0:
return len_s1
return min(
recursivelevenshtein(s1, len_s1 - 1, s2, len_s2) + 1, # Deletion
recursivelevenshtein(s1, len_s1, s2, len_s2 - 1) + 1, # Insertion
recursivelevenshtein(s1, len_s1 - 1, s2, len_s2 - 1) + (s1[len_s1 - 1] != s2[len_s2 - 1]) # Substitution
)
Full Source :- https://github.com/PsychoCoderHC/TextMining-Algorithms/blob/master/Python-Version/Fuzzy-String-Matching/Edit-Distance/Recursive-Levenshtein.pySee Execution :- http://code.hackerearth.com/4ac2f4u(I haven't given the visualize python link here, but you can do it by yourself)
I hope you understood the process. But this backtracking technique is very very inefficient for larger strings. As the complexity is cubic. The space complexity is large. In order to improve the performance we use the Wagner-Fisher algorithm.
Okay, I will give you a simple task.
Your task is to print the depth of the recursion and show the full stack trace. You can use any programming language you want.Wagner-Fisher AlgorithmWagner-Fisher algorithm presents before us a Dynamic Programming approach to solve the Levenshtein Distance Problem in Quadratic time, O(m * n) where m and n are the length two input strings respectively.
Lets get to the code already :-
def wagner_fisher(s1, len_s1, s2, len_s2):
"""
A function to calculate the Levenshtein Distance using
Wagner-Fisher Algorithm.
:param s1: Parent String
:param len_s1: Length Of s1
:param s2: pattern String
:param len_s2: length of s2
:rtype : int returns the levenshtein distance
"""
d = {}
#Make to use O(min(n,m)) space
if len_s1 > len_s2:
s1, s2 = s2, s1
len_s1, len_s2 = len_s2, len_s1
for i in range(-1, len_s1 + 1):
d[(i, -1)] = i + 1
for j in range(-1, len_s2 + 1):
d[(-1, j)] = j + 1
for i in range(len_s1):
for j in range(len_s2):
if s1[i] == s2[j]:
d[i, j] = d[i - 1, j - 1]
else:
d[(i, j)] = min(
d[(i - 1, j)] + 1, # deletion
d[(i, j - 1)] + 1, # insertion
d[(i - 1, j - 1)] + 1, # substitution
)
return d[len_s1 - 1, len_s2 - 1]
Full Source :- https://github.com/PsychoCoderHC/TextMining-Algorithms/blob/master/Python-Version/Fuzzy-String-Matching/Edit-Distance/Wagner-Fisher.pySee Execution :- http://code.hackerearth.com/75dc9aJ(I haven't given the visualize python link here, but you can do it by yourself)
Lets See how the lookup matrix is generated for the strings : "LAD" and "LLUDO"
Here's the final Matrix, just follow the algorithm step[ by step and you should get the following matrix, if you don't then you have made some mistakes :Yuck:
+------+---+---+---+---+
| S1 → | | L | A | D |
+------+---+---+---+---+
| S2 ↓ | 0 | 1 | 2 | 3 |
+------+---+---+---+---+
| L | 1 | 0 | 1 | 2 |
+------+---+---+---+---+
| L | 2 | 0 | 1 | 2 |
+------+---+---+---+---+
| U | 3 | 1 | 1 | 2 |
+------+---+---+---+---+
| D | 4 | 2 | 2 | 2 |
+------+---+---+---+---+
| O | 5 | 3 | 3 | 3 |
+------+---+---+---+---+
The bottom-right element gives the Levenshtein edit distance.
This process is better than recursive.
Now my question to you is that, Can we improve further. Well I leave this to you. Find a method and write a code to find the edit distance using this algorithm which performs better than the present algorithm.
Using NLTKIf you're using NLTK the you can calculate it using :-
import nltk
nltk.metrics.edit_distance('BoBos','Boobs')
ConclusionSo, all folks we have come to an end of our first encounter with text Processing. If you made it till the end you probably have understood a nice algorithm. I will be posting more articles on these topics as soon as I have time. If you have some doubt then please ask but before that go through the tutorial properly and execute the codes by yourself and practice them by yourself.
Referenceshttp://en.wikipedia.org/wiki/Levenshtein_distancehttp://en.wikipedia.org/wiki/Wagner-Fischer_algorithmhttp://en.wikipedia.org/wiki/Fuzzy_string_searchinghttp://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distancehttps://www8.cs.umu.se/kurser/TDBA77/VT06/algorithms/BOOK/BOOK2/NODE46.HTMOnline Tools UsedOnline LaTeX : http://www.sciweavers.org/free-online-latex-equation-editorDraw Diagrams : https://www.draw.io/Text Table Generator : http://www.tablesgenerator.com/text_tablesCredits to @XxGreenLanternxX for the Tutorial banner image.
Published On my Blog :
http://blog-psychocoder.blogspot.in/2014/05/text-processing-1-fuzzy-string-matching.htmlThank you,
Sincerely,
Psycho_Coder.