ONLINE DATING POINT

0
Duplicate Content
I’ve read seemingly hundreds of forum posts discussing duplicate content, none of which gave the full picture, leaving me with more questions than answers. I decided to spend some time doing research to find out exactly what goes on behind the scenes. Here is what I have discovered. Most people are under the assumption that duplicate content is looked at on the page level when in fact it is far more complex than that. Simply saying that “by changing 25 percent of the text on a page it is no longer duplicate content” is not a true or accurate statement. Lets examine why that is. To gain some understanding we need to take a look at the k-shingle algorithm that may or may not be in use by the major search engines (my money is that it is in use). I’ve seen the following used as an example so lets use it here as well. Let’s suppose that you have a page that contains the following text:

The swift brown fox jumped over the lazy dog.

Before we get to this point the search engine has already stripped all tags and HTML from the page leaving just this plain text behind for us to take a look at. The shingling algorithm essentially finds word groups within a body of text in order to determine the uniqueness of the text. The first thing they do is strip out all stop words like and, the, of, to. They also strip out all fill words, leaving us only with action words which are considered the core of the content. Once this is done the following “shingles” are created from the above text. (I'm going to include the stop words for simplicity)

The swift brown fox swift brown fox jumped brown fox jumped over fox jumped over the jumped over the lazy over the lazy dog

These are essentially like unique fingerprints that identify this block of text. The search engine can now compare this “fingerprint” to other pages in an attempt to find duplicate content. As duplicates are found a “duplicate content” score is assigned to the page. If too many “fingerprints” match other documents the score becomes high enough that the search engines flag the page as duplicate content thus sending it to supplemental hell or worse deleting it from their index completely.

My old lady swears that she saw the lazy dog jump over the swift brown fox.

The above gives us the following shingles:

my old lady swears old lady swears that lady swears that she swears that she saw that she saw the she saw the lazy saw the lazy dog the lazy dog jump lazy dog jump over dog jump over the jump over the swift over the swift brown the swift brown fox

Comparing these two sets of shingles we can see that only one matches (”the swift brown fox“). Thus it is unlikely that these two documents are duplicates of one another. No one but Google knows what the percentage match must be for these two documents to be considered duplicates, but some thorough testing would sure narrow it down ;). So what can we take away from the above examples? First and foremost we quickly begin to realize that duplicate content is far more difficult than saying “document A and document B are 50 percent similar”. Second we can see that people adding “stop words” and “filler words” to avoid duplicate content are largely wasting their time. It’s the “action” words that should be the focus. Changing action words without altering the meaning of a body of text may very well be enough to get past these algorithms. Then again there may be other mechanisms at work that we can’t yet see rendering that impossible as well. I suggest experimenting and finding what works for you in your situation.
The last paragraph here is the real important part when generating content. You can't simply add generic stop words here and there and expect to fool anyone. Remember, we're dealing with a computer algorithm here, not some supernatural power. Everything you do should be from the standpoint of a scientist. Think through every decision using logic and reasoning. There is no magic involved in SEO, just raw data and numbers. Always split test and perform controlled experiments.

Post a Comment

 
Top