When the data we have isn’t available in a useful format, it is easy to fall into patterns of brute force processing or to implement approaches that are not optimal for manipulating or managing data.
How data is used – such as in reports, or searches completed by users – is an ongoing, ever-changing problem. Organizational requirements and available data often change as well, so it is incumbent upon developers to create strategies to work with data even if it isn’t in the most usable state.
This post will examine several strategies for working with data, rather than fighting with it. However, it will not examine issues like big data, high availability, storage, and schema design.
Consider the following scenario
To help illustrate the challenges of working with data and how the approaches discussed in this article can help, imagine the following:
Your company is building a service for teachers, publishers, and journals that will take a document – whether from a student, a journalist, or another writer – and check against a database for the possibility of plagiarism. This service will be called “Master Plagiarism Checker”.
Hashing for fast data comparison
Hashing: Mapping arbitrarily sized data to a fixed size
To first implement the “Master Plagiarism Checker,” a mechanism for comparing the contents of documents is required to be able to check for plagiarism. The primary mechanism that will be used is to break all known documents into sentences and store those sentences in a database. When a document needs to be checked for plagiarism, it will be broken into sentences as well, and each sentence will have to be compared against the sentences in the database.
Based on some extensive license negotiations the founder of your company was able to get access to over 1 billion documents which will be used to check against for plagiarism! Assuming the average document is three pages long, and there are 50 sentences in each document, there would be approximately 150 billion sentences to check against. The average sentence is between 75 and 100 characters long (at 2 bytes per character for the most common databases), so that means approximately 30 trillion bytes (30,000 GB) of comparison! How would such a comparison be handled in a timely manner?
A hash function is a mechanism that takes some form of data input and generates a fixed size output. What this means is that any input size will have an output that is the same size in bytes. For non-cryptographic hashes, this value is usually an integer.
For example, this article might be hashed to a 4-byte integer value, while using the same hash function with Leo Tolstoy’s War and Peace (over 1,200 pages, and almost a half a million words!) would result in a 4-byte integer value as well.
A simple (and a bit naive) comparison of the two hash values would tell whether this article is the same as War and Peace, and since the hash values are different, we can safely say the two documents are different. If the values are the same and the hash function is known to be a good one, it is likely the two documents are the same, but the risk of collisions of hash values (when two different sets of data have the same hash value) only allows a certain level of confidence.
Hashing is an invaluable tool for such problems:
1. Using a pair of hashes instead of full sentence comparisons, an average document will have up to a factor of 7,200 reductions in the number of bytes to process!
2. This reduction to the number of bytes to process provides many secondary benefits:
- Less memory pressure on the servers
- Smaller indices in the database than indices based on text
- Faster comparisons due to less work (in terms of bytes to process), but also in secondary aspects like less memory pressure causing less paging and cache misses with the file system
3. As long as the hash algorithm is properly chosen, it is possible to calculate the size of hash (or how many seeded hash values are needed) to avoid collisions: ensure that 2number of bits used for hash values ≥ numberpossible values. For example with two 32-bit integers for hash value, there would be coverage for 9 quintillion values for a perfect hash.
4. The costs for calculating the hashes of the existing data is paid for in advance (or when new material is added to the database) and has a minimal cost per piece of material added. (For example, an average document added to the Master Plagiarism Checker database would only require 300 KB of space.)
Continuing with the plagiarism example, the checking for exact matching sentences can be modified from checking for strings to checking against hashes.
The database table for each sentence would be changed having two additional fields associated with that sentence that might be called `Hash` and `HashSeeded.` When a document is being checked for plagiarism, it would be broken into its composite sentences, and each sentence would have the two hashes created and associated with it. Then a bulk check against the database of document strings, with only needing to compare hash values, will quickly reveal if any sentences have been lifted from another document. MasterPlagiarismChecker v1.0 is out the door!
N-grams for Fast Partial Data Comparison
N-Grams: A contiguous sequence of n-items (letters, syllables, words) of text.
Now consider feedback that might come in for version 1.0 of the plagiarism checker. Some intelligent users are swapping the order of phrases in the longer sentences and are escaping detection. Also, some intelligent users are tweaking the verb tense, altering punctuation, or plurality of the sentences and escaping detection that way. This means that exact matches are no longer sufficient check and that partial matches have to be added.
An effective mechanism for doing partial matches is using N-grams. The `N` specifies how large the groupings will be, so a trigram would be where `N` = 3, and a quadgram would be where `N` = 4. The grouping can also be arbitrary. How an object might be broken up is dependent upon what scenarios are being explored. If collocations of words were being checked, then a trigram would entail chunks of three words. For example, the sentence “The patient will recover soon, with my help.” would be broken into trigrams of “the patient will”, “patient will recover”, “will recover soon”, “recover soon with”, “soon with my”, “with my help”. Note that the letters were all lowercase and punctuation was ignored, this makes the checking process easier. It should be noted that in some cases, it may be desirable to handle the beginning, and ending of the sentence with “.” representing no word, but this approach will not be used in the examples as it makes them less easy to understand.
When a candidate sentence is provided, is “With his help, the patient will recover soon.” It would have been impossible to match that string exactly, but it is clearly a candidate for having been plagiarized. Its composite chunks would be “with his help”, “his help the”, “help the patient”, “the patient will”, “patient will recover”, and “will recover soon”. The trigrams that match the previous sample are bolded. Three out of six trigrams match, so the match might be scored as a 50% match. A similar low match score might have occurred with putting an “s” at the end of “patient” so that it would be “The patients will recover soon, with my help.”, or “The patients will soon recover, with my help.”
So it would appear that using chunks of words is not sufficient for the plagiarism checker use case. If the chunks are with groups of three letters, then the granularity of the sentence is greater, and instead of six trigrams, the sample sentence would be broken up into 39 trigrams, such as “the”, “he “, etc… The original candidate sentence would also be broken up into numerous trigrams, such as “wit”, “ith”, etc… There are 31 out 39 matches between the two sets of trigrams, or 79.5% match. This seems like a much more accurate score than 50% when only the order of phrases and one-word substitution was made. Sometimes using tricks like removing common words like “the”, often called stop words, will help give more accurate results, but this approach may not be helpful for all scenarios.
With these changes, MasterPlagiarismChecker v2.0 is out the door!
Grouping for Filtering
Filtering: Creating a hash sum, a filter compares the sum against previously defined sums in the program, allowing the code to include or exclude data as needed, automatically.
MasterPlagiarismChecker v2.0 is now a great success. Rather than users in the tens of thousands, over a million copies have been sold! But the bug reports are getting more press than the product. After careful analysis of the bug reports, over half of them relate to a single problem: papers that reference other papers are being cited for plagiarism! An urgent bug fix release is needed, and the VP of Marketing has said that even if the fix let some bad papers through, it must be fixed immediately.
The programmers determine they can quickly check all footnotes and endnotes for paper references. Then any paper references that are in the MasterPlagiarismChecker database will be checked for sentences that are essentially whitelisted.
As before, each sentence of the paper is compared with sentences from the other papers, and each potential match with another paper is collected into a list, along with the document that matched the sentence of the paper under consideration. The list of documents is compared against the documents in the paper’s references, with any explicit matches which are now thrown out. In fact, a side effect of this new check is that false plagiarism results for numerous articles are filtered out because many papers cite other papers, a sentence will show up in multiple documents and not be truly “owned” by a single document.
The VP of Marketing is happy but mentions that there is a secondary set of bug reports that mention spurious results because of one sentence matches, but not multiple. The VP’s worries are easily allayed because one of the developers quickly adds a filter to the group of documents matched requiring that at least three sentences have a match for plagiarism before the paper is flagged as plagiarized. MasterPlagiarismChecker v2.1 is now released, and the VP is now free to contemplate happier thoughts like a better name for the product.
Sometimes large quantities of data can be intimidating, but with the proper tools, the data can have a great impact on the way a business or organization works. Being able to quickly find exact matches, near matches, and groupings are simple but powerful tools that help make data become information.