Support: 1-800-961-4454
Sales Chat
1-800-961-2888

Don’t Get Scraped: Putting An End To Web Scraping, Content Theft

4

This is a guest blog post written and contributed by Rami Essaid, founder and CEO of Distil, a content protection network that helps websites prevent malicious web scraping and stop content theft. Distil is a Rackpace Cloud Tools partner.

Calling All Web Scrapers! Get Rich Quick By Stealing Content!

Did you know you could make a lot of money with web scraping? It is very easy to do. All you have to do is leech off of other people’s websites, duplicate their content and steal their web visitors. But wait isn’t that wrong? Surely no one really does that, right? In fact, for most legitimate businesses, web scraping is a real and persistent threat to their site traffic, revenue, brand and network resources.

Web scraping and content theft are nothing new. They have been around for many years in the form of copyright infringement – duplication of the written word, music, images, etc. However, in the age of the Internet, stealing and duplicating a website’s content has become even easier and more lucrative. The worst part of content theft is that most websites just tolerate it; they assume it is part of doing business on the Internet and think nothing can be done to stop it.

You’re A Witness

You probably witness the effects of web scraping and content theft every day. When was the last time you searched Google for information and came across a site that appeared to be a duplicate and found yourself starring at an article or text that was buried in a sea of ads? It might have been what you were looking for but it was unclear who owned the site or who wrote the content. In most cases, you came across a scraped version of someone else’s content that had been copied and published elsewhere. This practice is commonly referred to as web scraping.

Definition: Web scraping (also called web harvesting or web data extraction) is a computer software technique of extracting information from websites.

Example: Original Vs. Scraped

Here is an example of an original article versus a scraped article:

Original Article

Scraped Article

Unfortunately, malicious web scraping is far too common and costs web publishers nearly $1 billion in losses and damages each year.

Industries Affected By Web Scraping:

  • Airline and travel industry
  • Digital publishing
  • Directories and classifieds
  • Ecommerce
  • Social media and forums

“So What?” – The Real Impact Of Web Scraping

Surprisingly, most businesses are aware of web scraping but very few realize the full impact it has on their web traffic, SEO, brand, revenue, total network costs and, ultimately, their business. Here are just a few of the real impacts of malicious web scraping:

  • Loss of sales and ancillary revenue
  • Decreased traffic and visitor engagement
  • Legal fees to handle duplicated content and copyright infringement
  • Loss of readership and subscriber base
  • Decreased advertising revenue
  • Lower SEO rankings
  • Deflated brand awareness
  • Increased network and bandwidth costs
  • Poor user experience

How To Track And Prevent Web Scraping

Most businesses don’t have tools to track and identify web scrapers, even if they wanted to. Web scrapers can be highly sophisticated and very rarely show-up on traditional analytics and tracking tools. In some cases, web scrapers appear to be legitimate traffic engaging with your site.

Reactive:
Sometimes it is too late to stop web scrapers from taking your content. In this case, the US Government created the Digital Millennium Copyright Act (DMCA). You need to search the Internet continually for duplications of your content, file a DMCA notice, potentially hire a lawyer and wait for the duplicate content to be removed. This can take months and by the time the duplicate content is removed, another site has also duplicated your content and the process begins again.

Proactive:
There are hardware and software solutions that can partially address the problem. However, we spoke with companies around the world about web scraping and knew there had to be a better way. What if you could stop web scrapers before they ever accessed your content? This is where Distil stepped in. A few years back, after trying to help a company find a solution to its web scraping problems, we realized there were no viable solutions that were easy to setup and inexpensive to use. So we brought together a team of engineers and created the very first content protection network to help websites identify and block malicious web scraping and content theft in real-time.

Mini Case Study

The Background – We recently had a customer move onto our platform because it was seeing its content duplicated across the Internet. This particular company generated more than 100 new articles or posts on its site each day. Given the opportunity, other websites were simply duplicating this content and stealing legitimate traffic.

The Results – After moving onto our platform, the web scraping nearly vanished, and the company’s legitimate traffic increased for the first time in approximately three years. Meanwhile, its server and bandwidth expenses dropped noticeably. In some cases, we were even able to identify who was trying to scrape their content. It turned out there were legitimate businesses that were trying to access the content and were willing to pay for that access via our client’s API.

Summary – The company was able to protect its content, reduce business costs and open new revenue streams that were previous inaccessible.

End Of The Line

So yes it’s true, if you’re willing to cheat, steal and duplicate other people’s content, you can make a lot of money from web scraping.  But we’re convinced; once businesses realize there is something they can do to prevent web scraping, this lucrative line of malicious content theft will come to a very quick end.

About the Author

This is a post written and contributed by Bob Bardwell.

Bob Bardwell is a Racker who works in Rackspace Corporate Development; his background includes financial statement and single audits. He enjoys golf, geopolitics, and networking.


More
4 Comments

The chart suggests that normal requests increased by the same amount as the decrease of scraper requests same time — to what is that increase attributed? One would expect the amount of total requests to drop in Sep-11 in a similar fashion.

avatar Michael on July 29, 2013 | Reply

I thought the same thing for a second. No, the chart starts by already segmenting legit traffic from bot traffic. Notice the blue line is always the legit line. There is no total traffic line. So this outlines that legit traffic was unaffected, and bot traffic dropped off completely.

avatar JustinCrossman on November 11, 2013

Ironic to see this article published here, as most scraping suspects we’ve battled recently launch their efforts from Rackspace IP addresses.

avatar Mike on August 19, 2013 | Reply

I had read somewhere about just doing and intro ( 200-300 character ) intro followed up by the actual link to prevent robot scraping, seem to have lost the article link as I forgot to save it is this possible and how does one do it? I saw one of my youtube videos and the write up was scraped and then reposted on another site with monetised hyperlinks and it was totally word for word not even respun or new words added

avatar Eric Schram on October 16, 2013 | Reply

Leave a New Comment

(Required)


Racker Powered
©2014 Rackspace, US Inc.