Protect to harder your website scraping

Estimated reading time: 1426 words, 9-10 minutes Font: 1,709 Views
Protect to harder your website scraping

Everything is visible if it leaves the server

There’s no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it’s still not a real solution and I would say that anyone determined enough would find a way to deal with it.
First, search engines have to scrape your site for content to be able to index it, so if you find a way to thwart that you won’t be listed on any search engines so no one will find your site to care about it. But if you are that worried about what you have being stolen, don’t publish it. Cause once it’s out there, its gonna stay out there, if not in Google’s cache then somewhere else.

How does web scraping work

Web scraping is a technique that allows you to extract data from websites using automated programs called web scrapers. Web scraping works by sending an HTTP request to the website’s server and then parsing the HTML code in the response to extract the desired data. The specific steps involved in web scraping can vary depending on the website and the tools being used, but generally involve the following:

  1. Making an HTTP request: The first step involves a web scraper requesting access to a server that has the desired data. This request is typically made using the HTTP or HTTPS protocol.
  2. Parsing the HTML: Once the server responds to the request, the web scraper receives an HTML document that contains the website’s content. The web scraper then analyzes this document and extracts the relevant data using tools such as BeautifulSoup or lxml in Python.
  3. Storing the data: Once the data has been extracted, it can be stored in a format such as a CSV or JSON file, or in a database such as MySQL, PostgreSQL or MongoDB.

Web scraping can be used for a variety of purposes such as price monitoring, data analysis, and content aggregation. However, it is important to note that web scraping can be illegal or unethical in certain situations, particularly if it involves scraping sensitive data or violating a website’s terms of service.

Why people need web scraping

Web scraping or web content extraction involves automating the process of collecting information and data from various websites. There are many reasons why web scraping is done, some of which are:

  1. To automate tasks: Web scraping can automate repetitive tasks such as collecting and collating data from multiple sources, which can save valuable time and resources.
  2. To monitor changes: Web scraping can be used to monitor changes on specific web pages, such as stock prices or weather reports.
  3. To gain insights: Web scraping can be used to gain insights by analyzing data from various websites. This is useful for businesses that need to track customer sentiment or for social media monitoring.
  4. To conduct research: Researchers in many fields use web scraping to collect data for research purposes. For example, data scientists may use web scraping to collect data on consumer behavior to develop better marketing strategies.
  5. To build databases: Web scraping is useful for building large databases by extracting data from multiple websites. This is useful for businesses that need to keep track of competitor pricing or customer comments on social media.

Web scraping is a useful technique for collecting and analyzing data from multiple websites for various purposes. However, it is important to respect the terms of service and applicable laws when using web scraping techniques.

The legality of web scraping varies based on the specific circumstances and jurisdiction. Generally, web scraping is legal if you are accessing publicly available data and are not violating any terms of service or other legal agreements. However, there are situations where web scraping can be considered illegal, such as if you are scraping data that is protected by intellectual property laws or if you are using web scraping to engage in unlawful activities such as stealing confidential data or committing fraud. It’s important to consult with a legal expert to ensure that your web scraping activities are in compliance with applicable laws and regulations.

Reduce risk from website scraping

Do not include the inputs into dynamic SQL. Use binding instead.

Use things like logins and restrict access to certain areas.

Blocking all access from cloud hosting.

Try to have levels of extraction from your DB(n-tiered application) where the actual web application will not directly interact with the DB. Properly sanitize, encode and handle all user input.

Do not rely on your own validation and sanitation, use the tools that have put together by dev teams.

Use unit testing in your application, make sure your application can handle all types of input, and fails safe.

Ensure you are not throwing verbose error messages directly from the database.

Never send anything from the server you don’t want the user to see. If the user is not authorized to see it, don’t send it. Don’t “hide” important bits and pieces in jQuery.data() or data-attributes. Don’t squirrel things away in obfuscated JavaScript. Don’t use techniques to hide data on the page until the user logs in, etc.

Use email verified user registration (including some form of GOOD reCaptcha to confound – most of – the bots).

Protect your server by firewall as best you can, make sure you don’t leave any common exploits.

Rate limiting network traffic

Rate limit by user account, IP address, user agent, browser fingerprint etc. – this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account, browser fingerprint or IP address.

JS render or dynamic content

Use JavaScript render/dynamic content, which can add a layer of difficulty. This is a popular technique to use.
Require JavaScript – to ensure the client has some resemblance of an interactive browser, rather than a bare-bones spider.
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.

Encode data as images

This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn’t foolproof of course.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn’t be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).

Control search crawlers

Use robot metatags to deny obvious web spiders, known robot user agents. Use basic HTTP header validations, detect and block all common scraper User-Agent.

Common scraper: Puppeteer, PhantomJS, Selenium web.driver

Data Poisoning – Put in books and links that nobody will want to have, that stall the download for bots that blindly collect everything.

A honeypot trap is a hidden code combine with robots.txt that looks like it contains valuable information but is fake.

Obfuscating HTML

To make the web pages differ in some ways that are not predictable each time they are loaded. Each page using unique identifiers for tags etc. Frequently change your templates, so that the scrapers may fail to find the desired contents. You could change your HTML frequently or change the HTML tag names frequently. Most screen scrapers work by using string comparisons with tag names, or regular expressions searching for particular strings etc. If you are changing the underlying HTML it will make them need to change their software.

Conclusion

No engineer or company will disclosure their technique due to security. No technology completely prevent it, since whatever you do, determined scrapers can still figure out how to scrape. However, you can stop a lot of scraping by doing a few open technique to reduced and protect your network bandwidth.

Found this article helpful? Why not share it on social media and help someone else too?
Twitter (Open in new window) FaceBook (Open in new window)
Posted in categories of Optimize for Robots This page was last modified on
Return to all Optimize for Robots articles

Your feedback is important to us - please leave a comment below.

By creating an account, you agree to our Terms of Service and Private Policy.

Comment Policy