Protect to harder your website scraping

Everything is visible if it leaves the server
There’s no easy solution for this. If the data is available publicly, then it can be scraped. The only thing you can do is make life more difficult for the scraper by making each entry slightly unique by adding/changing the HTML without affecting the layout. This would possibly make it more difficult for someone to harvest the data using regular expressions but it’s still not a real solution and I would say that anyone determined enough would find a way to deal with it.
First, search engines have to scrape your site for content to be able to index it, so if you find a way to thwart that you won’t be listed on any search engines so no one will find your site to care about it. But if you are that worried about what you have being stolen, don’t publish it. Cause once it’s out there, its gonna stay out there, if not in Google’s cache then somewhere else.
How does web scraping work
Web scraping is a technique that allows you to extract data from websites using automated programs called web scrapers. Web scraping works by sending an HTTP request to the website’s server and then parsing the HTML code in the response to extract the desired data. The specific steps involved in web scraping can vary depending on the website and the tools being used, but generally involve the following:
- Making an HTTP request: The first step involves a web scraper requesting access to a server that has the desired data. This request is typically made using the HTTP or HTTPS protocol.
- Parsing the HTML: Once the server responds to the request, the web scraper receives an HTML document that contains the website’s content. The web scraper then analyzes this document and extracts the relevant data using tools such as BeautifulSoup or lxml in Python.
- Storing the data: Once the data has been extracted, it can be stored in a format such as a CSV or JSON file, or in a database such as MySQL, PostgreSQL or MongoDB.
Web scraping can be used for a variety of purposes such as price monitoring, data analysis, and content aggregation. However, it is important to note that web scraping can be illegal or unethical in certain situations, particularly if it involves scraping sensitive data or violating a website’s terms of service.
Why people need web scraping
Web scraping or web content extraction involves automating the process of collecting information and data from various websites. There are many reasons why web scraping is done, some of which are:
- To automate tasks: Web scraping can automate repetitive tasks such as collecting and collating data from multiple sources, which can save valuable time and resources.
- To monitor changes: Web scraping can be used to monitor changes on specific web pages, such as stock prices or weather reports.
- To gain insights: Web scraping can be used to gain insights by analyzing data from various websites. This is useful for businesses that need to track customer sentiment or for social media monitoring.
- To conduct research: Researchers in many fields use web scraping to collect data for research purposes. For example, data scientists may use web scraping to collect data on consumer behavior to develop better marketing strategies.
- To build databases: Web scraping is useful for building large databases by extracting data from multiple websites. This is useful for businesses that need to keep track of competitor pricing or customer comments on social media.
Web scraping is a useful technique for collecting and analyzing data from multiple websites for various purposes. However, it is important to respect the terms of service and applicable laws when using web scraping techniques.
No law declaration web scraping is legal
The legality of web scraping varies based on the specific circumstances and jurisdiction. Generally, web scraping is legal if you are accessing publicly available data and are not violating any terms of service or other legal agreements. However, there are situations where web scraping can be considered illegal, such as if you are scraping data that is protected by intellectual property laws or if you are using web scraping to engage in unlawful activities such as stealing confidential data or committing fraud. It’s important to consult with a legal expert to ensure that your web scraping activities are in compliance with applicable laws and regulations.
Reduce risk from website scraping
Do not include the inputs into dynamic SQL. Use binding instead.
Use things like logins and restrict access to certain areas.
Blocking all access from cloud hosting.
Try to have levels of extraction from your DB(n-tiered application) where the actual web application will not directly interact with the DB. Properly sanitize, encode and handle all user input.
Do not rely on your own validation and sanitation, use the tools that have put together by dev teams.
Use unit testing in your application, make sure your application can handle all types of input, and fails safe.
Ensure you are not throwing verbose error messages directly from the database.
Never send anything from the server you don’t want the user to see. If the user is not authorized to see it, don’t send it. Don’t “hide” important bits and pieces in jQuery.data() or data-attributes. Don’t squirrel things away in obfuscated JavaScript. Don’t use techniques to hide data on the page until the user logs in, etc.
Use email verified user registration (including some form of GOOD reCaptcha to confound – most of – the bots).
Protect your server by firewall as best you can, make sure you don’t leave any common exploits.
Rate limiting network traffic
Rate limit by user account, IP address, user agent, browser fingerprint etc. – this means you restrict the amount of data a particular user group can download in a certain period of time. If you detect a large amount of data being transferred, you shut down the account, browser fingerprint or IP address.
JS render or dynamic content
Use JavaScript render/dynamic content, which can add a layer of difficulty. This is a popular technique to use.
Require JavaScript – to ensure the client has some resemblance of an interactive browser, rather than a bare-bones spider.
Load a page shell, then retrieve the data through an AJAX call and insert it into the DOM. Use sessions to set a hash that must be passed back with the AJAX call as verification. The hash would only be valid for a certain length of time (i.e. 10 seconds). This is really just adding an extra step someone would have to jump through to get the data, but would prevent simple page scraping.
Encode data as images
This is pretty intrusive to regular users, but you could encode some of your data tables or values as images instead of text, which would defeat most text parsers, but isn’t foolproof of course.
Present the data as an image instead of HTML. This requires extra processing on the server side, but wouldn’t be hard with the graphics libs in PHP. Alternatively, you could do this just for requests over a certain size (i.e. all).
Control search crawlers
Use robot metatags to deny obvious web spiders, known robot user agents. Use basic HTTP header validations, detect and block all common scraper User-Agent.
Common scraper: Puppeteer, PhantomJS, Selenium web.driver
Hidden link traps or honeypot
Data Poisoning – Put in books and links that nobody will want to have, that stall the download for bots that blindly collect everything.
A honeypot trap is a hidden code combine with robots.txt that looks like it contains valuable information but is fake.
Obfuscating HTML
To make the web pages differ in some ways that are not predictable each time they are loaded. Each page using unique identifiers for tags etc. Frequently change your templates, so that the scrapers may fail to find the desired contents. You could change your HTML frequently or change the HTML tag names frequently. Most screen scrapers work by using string comparisons with tag names, or regular expressions searching for particular strings etc. If you are changing the underlying HTML it will make them need to change their software.
Conclusion
No engineer or company will disclosure their technique due to security. No technology completely prevent it, since whatever you do, determined scrapers can still figure out how to scrape. However, you can stop a lot of scraping by doing a few open technique to reduced and protect your network bandwidth.
Your feedback is important to us - please leave a comment below.
Comment Policy
We welcome and encourage comments on our site, but we ask that you keep your comments respectful and relevant. Here are a few guidelines for commenting:
Stay on topic: Please keep your comments relevant to the article you are commenting on.
Respect our community: Comments that include profanity, hate speech, or personal attacks will not be tolerated. Comments that are solely promotional or spammy in nature will also be deleted.
Be constructive: We encourage thoughtful discussion and constructive criticism, but please keep your comments respectful and focused on the topic at hand.
Use your real name: We encourage commenters to use their real name or a consistent screen name when commenting. Anonymous comments will be deleted.
Moderation: We reserve the right to moderate, edit, or delete any comments that violate our policy.
- All comments must be relevant to the topic of the site post.
- Comments that are spam or solely promotional in nature will be deleted. The policy of deleting spam or promotional comments is an important measure for maintaining a healthy online community. It ensures that the discussion remains genuine and focused on the topic at hand, rather than being hijacked for personal gain. By upholding such policies, websites and platforms can foster productive and respectful discussions, which ultimately benefits all users involved.
- Comments containing profanity, hate speech, or personal attacks will not be tolerated. Online communities and social media platforms have strict policies on comments containing profanity, hate speech, or personal attacks, as they can be detrimental to the mental well being of individuals and destroy the foundation of the community. Members who engage in such behavior risk facing consequences, including removal from the community, and may have their accounts suspended or terminated altogether. To ensure a safe and respectful environment for all members, it is crucial to report such behavior and discourage it as much as possible.
- Comments that infringe upon intellectual property rights, such as copyright or trademark violations, will not be allowed. It is important to recognize and respect the intellectual property rights of others to avoid potential legal issues related to infringement. If you are unsure whether your comment or content may infringe on someone else's intellectual property rights, it is recommended that you seek legal advice or obtain permission from the copyright or trademark owner before sharing the content.
- Comments that disclose personal information, such as phone numbers or email addresses, will not be allowed.
- Comments that are off-topic or contain irrelevant material will be deleted. To avoid having comments deleted as off-topic or irrelevant, it is important to take the time to read and understand the discussion or content being presented, and to ensure that any comments made are directly related to the subject matter at hand. Additionally, it is important to be respectful and constructive in your comments, and to avoid posting any material that could be considered spam or abusive. This will help to ensure that the conversation and content remain relevant and valuable for all those who are participating in or accessing it.
- The author reserves the right to delete any comments for any reason without notice.
Repeat offenders who violate the comment policy may be banned from commenting on the site. Repeat offenders who regularly engage in negative and disruptive behavior or who violate the comment policy may face a ban, which could be temporary or permanent, from commenting on a website or social media platform. This is a measure that is taken when other corrective measures have failed, as it helps to promote positive and healthy online conversations and maintain a respectful community for all users.
Read in new window