The Web Scraping Landscape in 2023

Introduction

In 1989, British computer scientist Tim Berners-Lee invented the World Wide Web (WWW) while working at CERN. The original motivation behind this invention was to improve information sharing within the institution and with external collaborators. Tim’s creation proved to be a success and it rapidly expanded beyond academia. Fast-forward to today, and the aggregate of all web pages amounts to an immense volume of web data with approximately 1.13 billion websites on the internet.

Much of the web is optimized to be viewed by human eyes, rather than for use by automated services that could reorganize the data, extend its utility, and pave the way for innovative solutions and applications. The industry of web scraping has emerged to meet this technical need, and provide a means to add structure to otherwise unstructured web data. There are numerous companies offering robust APIs, allowing developers easy access to data without having to grapple with undue complexity. Nevertheless, developers frequently find themselves resorting to web scraping techniques to obtain the data they require.

Web scraping in action

Web scraping is nearly as old as the web itself. In essence, it’s the process of automated extraction of data from websites. As previously noted, the internet is filled with unstructured data. Web scraping techniques can transform this yet untapped value into an organized resource, suitable for a variety of new applications.

Let’s consider a practical example. Imagine you operate a large eCommerce website specializing in PC components. With thousands of items in stock, setting competitive prices to maximize profit is crucial. Prices can fluctuate due to broader economic factors (think NVIDIA graphic cards and the crypto boom) or specific events like seasonal holidays. Failing to match competitors’ prices by being too inexpensive or expensive could put your business at a significant disadvantage. Manually checking all product data would be impractical and time-consuming. As a savvy eCommerce owner, instead of doing the work manually, you could employ a web scraper to bring all that data to your doorstep. You might source it from multiple websites or even just one – for example Amazon.

We spoke with Erez Naveh, VP of product at Bright Data. Erez frames web scraping as follows: How do we know what prices are set by the competition? In the physical world, a common way to do it is to send a mystery shopper who can look at the shelves and see how products are priced. Web scraping those prices online is a digital version of the same process.

Another example comes from the travel industry, where numerous websites offer flights, hotels, and other services. Yet again, prices can fluctuate widely, and the information is often dispersed across multiple platforms. While most booking sites, such as Booking.com or Airbnb, primarily address basic user queries, such as availability of properties for specific dates in a given location, the data they hold and present has value beyond answering that single question. Access to this information can enrich the user experience through innovative travel features and also provide valuable insights for business intelligence, such as trend forecasting and alerting.

Practicalities of web scraping

Let’s delve into the technicalities of setting up a web scraping operation. Once a target for web scraping is identified, the developer faces several challenges and decisions. The first step involves understanding the website’s structure and answering key questions including: What type of data is present? How are the page elements organized? Are there discernible patterns that could streamline the scraping process? Does the site utilize pagination? While modern web development typically follows industry standards, some websites may still prove more difficult to scrape than others. Moreover, if the developer has no control over the target website’s architecture, the scraping code may require frequent updates to adapt to any changes in site structure.

Expanding on the technical aspects, once the web scraper is fully configured, it mimics human browsing behavior by sending a series of HTTP requests to the target website’s servers. These requests might include GET or POST methods, depending on what data retrieval or submission is needed. The scraper may also handle cookies, session IDs, and even deal with CAPTCHAs or JavaScript-rendered content if programmed to do so. Typically, the returned data is in HTML format, which then undergoes a parsing process to extract relevant information. Parsing can be done through various methods, for example by traversing the Document Object Model (DOM). Finally, the extracted data is structured into a machine-readable format like JSON or CSV, facilitating easy integration with other applications or data analytics tools.

Although web scraping can be implemented in nearly any modern programming language, Python and JavaScript are nowadays the go-to choices for most developers.

In the JavaScript ecosystem, web scraping is often performed using Node.js with the help of libraries such as axios for HTTP requests and cheerio for HTML parsing. For more dynamic websites that depend on client-side JavaScript rendering, Puppeteer is often the library of choice. It provides a headless browser environment, allowing for the rendering of pages, execution of JavaScript, and interaction with the web page through simulating actions like clicks. This enables the scraping of data that is populated dynamically.

Similarly, in the Python landscape, multiple libraries are available for various aspects of web scraping. The requests library is often used for HTTP requests to fetch web pages. For parsing HTML or XML documents, Beautiful Soup and lxml are popular choices. While Puppeteer can be also used with Python, Playwright emerges to be a popular solution too. Even though it originally has been a framework built for website testing, it does a great job at automating browser tasks which can be used for the extraction of web data. 

Not an easy ride – challenges of web scraping

As previously mentioned, developers creating web scrapers usually have no control over the target website but are fully responsible for ensuring their scraping service runs smoothly. Here are some common challenges:

  • Website structure changes: If the scraper’s functionality is closely tied to the HTML structure of the target website, even a simple change in layout can completely throw it off. There is no guarantee that the structure will stay the way it is nor is there any assurance that the developer will be notified that something is about to change. This unpredictability can lead to both unexpected costs of upgrading the web scraper and down time in its operation.
  • Rate limiting: Websites may regulate the number of requests you can make in a given timeframe. Some of the common algorithms for rate limiting include Token Bucket and Leaky Bucket, which allow for occasional bursts of traffic but constrain the average rate of incoming requests. Rate limits can be set based on IP addresses, user sessions, or API keys. Running into a rate limit, depending on the nature of the data that is being scraped, might mean that obtaining the data will take too long unless the web scraper is using multiple proxies.
  • CAPTCHA: Are you a robot? CAPTCHA is a well-known mechanism for distinguishing humans and computers apart by providing challenges that are computationally hard for bots to solve but relatively easy for humans. CAPTCHAs serve as a barrier against web scraping, automated form submission, and brute-force attacks. Nevertheless, they are not foolproof and can be bypassed using techniques like machine learning-based object recognition or sometimes even by employing human-solving services. CAPTCHA is relatively easy to integrate into a website by using a provider like Google’s reCAPTCHA.
  • Browser Fingerprinting: Websites can store data in cookies and local storage to identify a user. Identifying a user can be as simple as saving one piece of data with a unique identifier. Could a user be still identified and tracked without the ability to use cookies or local storage? Turns out, it can – by using a combination of user-agent string, screen resolution, installed fonts, plugins, and even behavior like mouse movements or keystroke dynamics. In aggregate, these attributes can create a unique “fingerprint” for each user. From the perspective of web scraping, this can pose a challenge as programmatic behavior is usually repetitive in nature and can cause the website to flag it as a potentially automated activity. While hard to circumvent, it’s not impossible using modern methods such as rotating user-agents, modifying street dimensions and even mimicking random mouse movements.

Taking web scraping to the next level

Building a web scraper is a time-consuming process with no guarantee that the final product will be maintenance-free. From adapting to the dynamic and sometimes inventive nature of websites to overcoming obstacles designed to hinder a scraper’s effectiveness, the path to creating a reliable web scraper is often fraught with challenges.

Fortunately, solutions like Bright Data—a comprehensive, award-winning suite of web scraping tools—can significantly improve the web scraper development experience. Bright Data is not just another scraping library but a full powerhouse of functionalities, tailored web scraping templates and proxies. Alongside each other, all aspects and features of Bright Data allow developers to abstract the intricacies of scraping away and focus on what they are actually building.

According to Erez Naveh of Bright Data: “We have customers that range from dedicated web scraping teams, to a huge e-commerce business that needs to keep track of all the prices in the market, to single developers that don’t have many resources. While large customers might already have an entire web scraping department with machine learning talent, small ones usually don’t and cannot efficiently deal with the challenges on their own. We have solutions for both of them.”

What makes Bright Data so valuable? Let’s have a look through some of the most useful features:

  • Proxies: 72 million strong, ethically sourced, proxy network which includes residential proxies, ISP proxies and even IPs from mobile networks around the world. This extensive network not only allows your web scraper to view websites from various perspectives but also addresses many of the rate-limiting and browser fingerprinting issues we discussed earlier.
  • Scraping Browser: A specialized automated browser designed to streamline the web scraping process. It offers a 3-in-1 solution that integrates proxy technology, automated website unblocking, and browser functionalities. Compatible with popular scraping frameworks like Puppeteer, Playwright, and Selenium, the Scraping Browser manages challenges like CAPTCHA solving, proxy rotation, and browser fingerprinting automatically. Hosted on Bright Data’s scalable infrastructure, it allows for cost-effective scaling of data scraping projects.
  • Web Scraper IDE: Web Scraper IDE an all-in-one tool for efficient and scalable web scraping. A developer can jumpstart a project with pre-made templates for popular data sources (like LinkedIn, Amazon and YouTube) and debug the results on the fly with interactive previews. If you’re after scraping data from search engines like Google or Bing, Bright Data also provides – SERP API makes it easy by converting actionable data insights from search results.
  • Ready datasets: If creating a web scraper is not your thing, maybe taking advantage of data that has been scraped before is a better solution? Bright Data offers fresh datasets from some of the most popular public websites. From LinkedIn to Amazon, there are a lot of ready-made solutions to choose from. It’s also cheaper than scraping the data yourself. Nevertheless, if analyzing the obtained data is also not your thing, you can use Bright Insights to receive actionable eCommerce market intelligence.

Conclusion

In 2023, web scraping remains a pivotal activity for data collection across various industries, from eCommerce to travel. However, the process is often convoluted and laden with challenges like ever-changing website structures and security mechanisms. Bright Data emerges as a comprehensive solution, offering an extensive suite of web scraping tools that streamline the process for developers. It provides a robust proxy network to navigate around rate-limiting issues at scale and a Scraping Browser to facilitate efficient data extraction. Additionally, Bright Data offers pre-scraped datasets, serving as an all-encompassing resource for both novice and experienced web scrapers.

What is coming in the future for web scraping products? While the race in overcoming challenges of accessing websites on a large scale continues as it did, new technological breakthroughs like LLMs allow not only to scrape the websites better but also make better use of the extracted data.

Erez Naveh spoke to us about the future of web scraping and said “We found so many useful use cases of LLMs that I believe that the next year or a couple of years will be just figuring out how to leverage it and optimize it to the benefit and value of our customers. For instance – a fun example. In the pre-collected datasets, users can press a button and add a new smart column and assign a prompt to it. The new column will be filled with data in an AI-enhanced way almost in an instant, without having to spend time training any new models.”

Full disclosure: Bright Data is a sponsor of Software Engineering Daily.

Paweł Borkowski

Paweł is a freelance software engineer with a dozen years of experience in building early-stage products a startups and FTSE 100 corporations. Currently working on flat.social (https://flat.social) and glot.space (https://glot.space). Follow Paweł on his personal website (https://pawel.io), Twitter (https://twitter.com/pawel_io) or LinkedIn (https://www.linkedin.com/in/borkowskip/)

Software Daily

Software Daily

 
Subscribe to Software Daily, a curated newsletter featuring the best and newest from the software engineering community.