The Web Scraping Landscape in 2023
In 1989, British computer scientist Tim Berners-Lee invented the World Wide Web (WWW) while working at CERN. The original motivation behind this invention was to improve information sharing within the institution and with external collaborators. Tim’s creation proved to be a success and it rapidly expanded beyond academia. Fast-forward to today, and the aggregate of all web pages amounts to an immense volume of web data with approximately 1.13 billion websites on the internet.
Much of the web is optimized to be viewed by human eyes, rather than for use by automated services that could reorganize the data, extend its utility, and pave the way for innovative solutions and applications. The industry of web scraping has emerged to meet this technical need, and provide a means to add structure to otherwise unstructured web data. There are numerous companies offering robust APIs, allowing developers easy access to data without having to grapple with undue complexity. Nevertheless, developers frequently find themselves resorting to web scraping techniques to obtain the data they require.
Web scraping in action
Web scraping is nearly as old as the web itself. In essence, it’s the process of automated extraction of data from websites. As previously noted, the internet is filled with unstructured data. Web scraping techniques can transform this yet untapped value into an organized resource, suitable for a variety of new applications.
Let’s consider a practical example. Imagine you operate a large eCommerce website specializing in PC components. With thousands of items in stock, setting competitive prices to maximize profit is crucial. Prices can fluctuate due to broader economic factors (think NVIDIA graphic cards and the crypto boom) or specific events like seasonal holidays. Failing to match competitors’ prices by being too inexpensive or expensive could put your business at a significant disadvantage. Manually checking all product data would be impractical and time-consuming. As a savvy eCommerce owner, instead of doing the work manually, you could employ a web scraper to bring all that data to your doorstep. You might source it from multiple websites or even just one – for example Amazon.
We spoke with Erez Naveh, VP of product at Bright Data. Erez frames web scraping as follows: How do we know what prices are set by the competition? In the physical world, a common way to do it is to send a mystery shopper who can look at the shelves and see how products are priced. Web scraping those prices online is a digital version of the same process.
Another example comes from the travel industry, where numerous websites offer flights, hotels, and other services. Yet again, prices can fluctuate widely, and the information is often dispersed across multiple platforms. While most booking sites, such as Booking.com or Airbnb, primarily address basic user queries, such as availability of properties for specific dates in a given location, the data they hold and present has value beyond answering that single question. Access to this information can enrich the user experience through innovative travel features and also provide valuable insights for business intelligence, such as trend forecasting and alerting.
Practicalities of web scraping
Let’s delve into the technicalities of setting up a web scraping operation. Once a target for web scraping is identified, the developer faces several challenges and decisions. The first step involves understanding the website’s structure and answering key questions including: What type of data is present? How are the page elements organized? Are there discernible patterns that could streamline the scraping process? Does the site utilize pagination? While modern web development typically follows industry standards, some websites may still prove more difficult to scrape than others. Moreover, if the developer has no control over the target website’s architecture, the scraping code may require frequent updates to adapt to any changes in site structure.
Similarly, in the Python landscape, multiple libraries are available for various aspects of web scraping. The requests library is often used for HTTP requests to fetch web pages. For parsing HTML or XML documents, Beautiful Soup and lxml are popular choices. While Puppeteer can be also used with Python, Playwright emerges to be a popular solution too. Even though it originally has been a framework built for website testing, it does a great job at automating browser tasks which can be used for the extraction of web data.
Not an easy ride – challenges of web scraping
As previously mentioned, developers creating web scrapers usually have no control over the target website but are fully responsible for ensuring their scraping service runs smoothly. Here are some common challenges:
- Website structure changes: If the scraper’s functionality is closely tied to the HTML structure of the target website, even a simple change in layout can completely throw it off. There is no guarantee that the structure will stay the way it is nor is there any assurance that the developer will be notified that something is about to change. This unpredictability can lead to both unexpected costs of upgrading the web scraper and down time in its operation.
- Rate limiting: Websites may regulate the number of requests you can make in a given timeframe. Some of the common algorithms for rate limiting include Token Bucket and Leaky Bucket, which allow for occasional bursts of traffic but constrain the average rate of incoming requests. Rate limits can be set based on IP addresses, user sessions, or API keys. Running into a rate limit, depending on the nature of the data that is being scraped, might mean that obtaining the data will take too long unless the web scraper is using multiple proxies.
- CAPTCHA: Are you a robot? CAPTCHA is a well-known mechanism for distinguishing humans and computers apart by providing challenges that are computationally hard for bots to solve but relatively easy for humans. CAPTCHAs serve as a barrier against web scraping, automated form submission, and brute-force attacks. Nevertheless, they are not foolproof and can be bypassed using techniques like machine learning-based object recognition or sometimes even by employing human-solving services. CAPTCHA is relatively easy to integrate into a website by using a provider like Google’s reCAPTCHA.
Taking web scraping to the next level
Building a web scraper is a time-consuming process with no guarantee that the final product will be maintenance-free. From adapting to the dynamic and sometimes inventive nature of websites to overcoming obstacles designed to hinder a scraper’s effectiveness, the path to creating a reliable web scraper is often fraught with challenges.
Fortunately, solutions like Bright Data—a comprehensive, award-winning suite of web scraping tools—can significantly improve the web scraper development experience. Bright Data is not just another scraping library but a full powerhouse of functionalities, tailored web scraping templates and proxies. Alongside each other, all aspects and features of Bright Data allow developers to abstract the intricacies of scraping away and focus on what they are actually building.
According to Erez Naveh of Bright Data: “We have customers that range from dedicated web scraping teams, to a huge e-commerce business that needs to keep track of all the prices in the market, to single developers that don’t have many resources. While large customers might already have an entire web scraping department with machine learning talent, small ones usually don’t and cannot efficiently deal with the challenges on their own. We have solutions for both of them.”
What makes Bright Data so valuable? Let’s have a look through some of the most useful features:
- Proxies: 72 million strong, ethically sourced, proxy network which includes residential proxies, ISP proxies and even IPs from mobile networks around the world. This extensive network not only allows your web scraper to view websites from various perspectives but also addresses many of the rate-limiting and browser fingerprinting issues we discussed earlier.
- Scraping Browser: A specialized automated browser designed to streamline the web scraping process. It offers a 3-in-1 solution that integrates proxy technology, automated website unblocking, and browser functionalities. Compatible with popular scraping frameworks like Puppeteer, Playwright, and Selenium, the Scraping Browser manages challenges like CAPTCHA solving, proxy rotation, and browser fingerprinting automatically. Hosted on Bright Data’s scalable infrastructure, it allows for cost-effective scaling of data scraping projects.
- Web Scraper IDE: Web Scraper IDE an all-in-one tool for efficient and scalable web scraping. A developer can jumpstart a project with pre-made templates for popular data sources (like LinkedIn, Amazon and YouTube) and debug the results on the fly with interactive previews. If you’re after scraping data from search engines like Google or Bing, Bright Data also provides – SERP API makes it easy by converting actionable data insights from search results.
- Ready datasets: If creating a web scraper is not your thing, maybe taking advantage of data that has been scraped before is a better solution? Bright Data offers fresh datasets from some of the most popular public websites. From LinkedIn to Amazon, there are a lot of ready-made solutions to choose from. It’s also cheaper than scraping the data yourself. Nevertheless, if analyzing the obtained data is also not your thing, you can use Bright Insights to receive actionable eCommerce market intelligence.
In 2023, web scraping remains a pivotal activity for data collection across various industries, from eCommerce to travel. However, the process is often convoluted and laden with challenges like ever-changing website structures and security mechanisms. Bright Data emerges as a comprehensive solution, offering an extensive suite of web scraping tools that streamline the process for developers. It provides a robust proxy network to navigate around rate-limiting issues at scale and a Scraping Browser to facilitate efficient data extraction. Additionally, Bright Data offers pre-scraped datasets, serving as an all-encompassing resource for both novice and experienced web scrapers.
What is coming in the future for web scraping products? While the race in overcoming challenges of accessing websites on a large scale continues as it did, new technological breakthroughs like LLMs allow not only to scrape the websites better but also make better use of the extracted data.
Erez Naveh spoke to us about the future of web scraping and said “We found so many useful use cases of LLMs that I believe that the next year or a couple of years will be just figuring out how to leverage it and optimize it to the benefit and value of our customers. For instance – a fun example. In the pre-collected datasets, users can press a button and add a new smart column and assign a prompt to it. The new column will be filled with data in an AI-enhanced way almost in an instant, without having to spend time training any new models.”
Full disclosure: Bright Data is a sponsor of Software Engineering Daily.