Don’t Get Scraped: Putting An End To Web Scraping, Content Theft

Filed in Partner & Customer Updates by Bob Bardwell | June 14, 2012 10:45 am

This is a guest blog post written and contributed by Rami Essaid, founder and CEO of Distil[1], a content protection network that helps websites prevent malicious web scraping and stop content theft. Distil is a Rackpace Cloud Tools[2] partner.

Calling All Web Scrapers! Get Rich Quick By Stealing Content!

Did you know you could make a lot of money with web scraping? It is very easy to do[3]. All you have to do is leech off of other people’s websites, duplicate their content and steal their web visitors. But wait isn’t that wrong? Surely no one really does that, right? In fact, for most legitimate businesses, web scraping is a real and persistent threat to their site traffic, revenue, brand and network resources.

Web scraping and content theft are nothing new. They have been around for many years in the form of copyright infringement – duplication of the written word, music, images, etc. However, in the age of the Internet, stealing and duplicating a website’s content has become even easier and more lucrative. The worst part of content theft is that most websites just tolerate it; they assume it is part of doing business on the Internet and think nothing can be done to stop it.

You’re A Witness

You probably witness the effects of web scraping and content theft every day. When was the last time you searched Google for information and came across a site that appeared to be a duplicate and found yourself starring at an article or text that was buried in a sea of ads? It might have been what you were looking for but it was unclear who owned the site or who wrote the content. In most cases, you came across a scraped version of someone else’s content that had been copied and published elsewhere. This practice is commonly referred to as web scraping.

Definition: Web scraping (also called web harvesting or web data extraction) is a computer software technique of extracting information from websites.

Example: Original Vs. Scraped

Here is an example of an original article versus a scraped article:

Original Article

Scraped Article

Unfortunately, malicious web scraping is far too common and costs web publishers nearly $1 billion in losses and damages each year.

Industries Affected By Web Scraping:

“So What?” – The Real Impact Of Web Scraping

Surprisingly, most businesses are aware of web scraping but very few realize the full impact it has on their web traffic, SEO, brand, revenue, total network costs and, ultimately, their business. Here are just a few of the real impacts of malicious web scraping:

How To Track And Prevent Web Scraping

Most businesses don’t have tools to track and identify web scrapers, even if they wanted to. Web scrapers can be highly sophisticated and very rarely show-up on traditional analytics and tracking tools. In some cases, web scrapers appear to be legitimate traffic engaging with your site.

Reactive:
Sometimes it is too late to stop web scrapers from taking your content. In this case, the US Government created the Digital Millennium Copyright Act (DMCA). You need to search the Internet continually for duplications of your content, file a DMCA notice, potentially hire a lawyer and wait for the duplicate content to be removed. This can take months and by the time the duplicate content is removed, another site has also duplicated your content and the process begins again.

Proactive:
There are hardware and software solutions that can partially address the problem. However, we spoke with companies around the world about web scraping and knew there had to be a better way. What if you could stop web scrapers before they ever accessed your content? This is where Distil[4] stepped in. A few years back, after trying to help a company find a solution to its web scraping problems, we realized there were no viable solutions that were easy to setup and inexpensive to use. So we brought together a team of engineers and created the very first content protection network to help websites identify and block malicious web scraping and content theft in real-time.

Mini Case Study

The Background – We recently had a customer move onto our platform because it was seeing its content duplicated across the Internet. This particular company generated more than 100 new articles or posts on its site each day. Given the opportunity, other websites were simply duplicating this content and stealing legitimate traffic.

The Results – After moving onto our platform, the web scraping nearly vanished, and the company’s legitimate traffic increased for the first time in approximately three years. Meanwhile, its server and bandwidth expenses dropped noticeably. In some cases, we were even able to identify who was trying to scrape their content. It turned out there were legitimate businesses that were trying to access the content and were willing to pay for that access via our client’s API.

Summary – The company was able to protect its content, reduce business costs and open new revenue streams that were previous inaccessible.

End Of The Line

So yes it’s true, if you’re willing to cheat, steal and duplicate other people’s content, you can make a lot of money from web scraping.  But we’re convinced; once businesses realize there is something they can do to prevent web scraping, this lucrative line of malicious content theft will come to a very quick end.

Endnotes:
  1. Distil: http://www.distil.it/
  2. Rackpace Cloud Tools: http://www.rackspace.com/cloud/tools/
  3. It is very easy to do: http://www.distil.it/video-why-its-easy-for-someone-else-to-scrape-your-content/
  4. Distil: http://www.distil.it

Source URL: http://www.rackspace.com/blog/dont-get-scraped-putting-an-end-to-web-scraping-content-theft/