When content is king: LinkedIn’s battle with scrapers
For many of today’s titans of technology, the real value of their companies lies in user generated content and connections. Take LinkedIn for example, a company acquired by Microsoft in June for $26.2 billion. With more than 450 million members in over 200 countries around the world, the company has built a business on the collective power of its members’ data, namely resumes and professional connections.
That information and access to it is the lifeblood of LinkedIn – and the primary reason Microsoft acquired the company. And any attempt to compromise that treasure trove of data can have potentially far-reaching effects.
With that in mind, let’s take a look at LinkedIn’s recent attempt to take on the “scrapers” seeking to steal this precious commodity.
LinkedIn v Anonymous
In August 2016, LinkedIn filed a lawsuit against 100 unknown “scrapers.” LinkedIn alleges that since December 2015 unknown persons and/or entities employing various automated software programs (bots) have extracted and copied data from many LinkedIn pages, an activity otherwise known as scraping. To access this information on LinkedIn’s site, the defendants circumvented several technical barriers employed by LinkedIn that prevent mass automated scraping, and knowingly and intentionally violated various access and use restrictions in LinkedIn’s User Agreement. In so doing, it is alleged they have violated federal and state laws.
The lawsuit is a preliminary step designed to compel the court to use the scrapers’ IP addresses to reveal their identities and subsequently take action against them. The intent of this action is to preserve LinkedIn’s most valuable intellectual property: their members’ professional profiles and networks.
While we have yet to see the outcome of the case, this is not the first time scraping has been used as a means of stealing a company’s intellectual property.
A Brazilian analogy
Principals at our own company have experienced this kind of attack first hand in the case of Catho v. Curriculum. This case dates back to the days when LinkedIn and Monster.com were just beginning. At that time, a number of Brazilian companies were competing for the attention of job seekers and employers. Among these companies was Curriculum, which is still the largest recruitment site in Brazil today. Catho Online was founded in 2000 as a direct competitor to Curriculum. Both sites hosted job seekers’ resumes as well as job listings.
In 2002, as competition between the two companies intensified, Curriculum noticed that Catho was experiencing exponential growth in terms of its content while simultaneously Curriculum was experiencing large spikes in activity on its own website. In that era, the Curriculum website might see an average of 500 resume searches per day. During the period in question, however, a single user logged over 63,000 searches in just 24 hours.
A preliminary investigation revealed strong evidence of robot activity, and the IP address of the user was traced back to a computer at Catho Online.
Armed with this information, Curriculum sought help from law enforcement and a warrant was issued. 40 servers and 20 workstations were seized and 1.5 terabytes of data was acquired over three days. The digital forensics effort to analyze the body of evidence took eight months and the final report was over 5,000 pages long. Among the evidence discovered were documents detailing how hackers were hired to “scrape” content from Curriculum and other companies including executable code named “rouba” (Portuguese for theft). The case progressed through various courts, ultimately reaching the Brazilian Supreme Court, where a decision was made in Curriculum’s favor and a fine of BRL63 million (US$42 million) was imposed on Catho Online.
Lessons learned
When we think of cyber threats to companies, hacking in search of personally identifiable information (PII) usually springs to mind. But there are many ways to improperly and/or illegally access and compromise sensitive data. Penetrating an organization’s perimeter security to compromise customers’ personal data or log-in information is perhaps more commonplace, but theft of website content consisting of crucial competitive intellectual property can be just as damaging. LinkedIn has experienced both – a user credential database breach in 2012, and now a content access attack. It is arguable that the latter is as dangerous, if not more so, than the former.
In their lawsuit, LinkedIn noted the scraping put significant strain on its network servers, and the company was forced to expend extra time and resources in responding to the misconduct, including the deletion of thousands of fake profiles.
Any company whose livelihood is dependent on the exclusivity of their content needs to be aware of the potential risk posed by scraping, and take strong measures to prevent any attempts to access, steal or compromise what amounts to their most valuable asset.