Back to Blog

Ready for a free 2 GB trial?

Book a call with one of our Data Nerds to unlock a super-sized free trial.

START TRIAL

Web Scraping in the AI Era: Feeding the Machine Learning Beast

Jason Grad
Proxy Network Manager
November 29, 2023

In the race to develop the most adept AI models, one factor consistently emerges as critical: data. Large Language Models (LLMs) highlight this, consuming vast volumes of tokens for training. As these AI models grow in complexity and capability, web scraping, often facilitated by residential proxies, becomes an instrumental tool, ensuring access to the extensive and varied datasets they demand.

Tokens: The Basic Building Blocks

Before delving deeper, it's imperative to understand what a token is. In LLMs, tokens can represent various linguistic units, from individual characters to whole words. Think of tokens as unique pieces in a puzzle; each one holds specific information, and when they come together, they form a coherent picture–or in AI's case, a comprehensive understanding of language. Depending on the approach, a sentence like "Web scraping is essential" might be broken down into five tokens (each word as a token) or more if punctuation and smaller substrings are considered separately.

Voracious Data Appetites of LLMs

The paper titled “Training Compute-Optimal Large Language Models” explores the optimal model size and the number of tokens required to train a transformer language model. The researchers trained over 400 language models with parameters ranging from 70 million to over 16 billion and using 5 to 500 billion tokens1. They discovered that for compute-optimal training, the model size and the number of training tokens should be scaled equally. In other words, for every doubling of the model size, the number of training tokens should also be doubled.

LLMs underscore the increasing hunger for data in AI. OpenAI's model from 2020, trained on 300 billion tokens, had 175 billion parameters. Which implied 1.7 tokens per parameter2. In 2022, DeepMind investigation on the optimal ratio for parameters favors more tokens and fewer parameters with a ratio of 20 tokens per parameter. A model with 70 billion parameters and 1.4 trillion tokens outperformed OpenAI’s 175 billion parameter model and required less fine-tuning and lower inference costs.1 These numbers emphasize the depth and breadth of data that modern AI training demands.

The Role of Web Scraping

Here, web scraping takes center stage. It acts as a digital miner, sifting through the vast internet landscape—from e-commerce sites to forums—to extract valuable data gold. Diversity is key. For comprehensive training, models require a broad spectrum of information, making the role of web scraping even more crucial.

The Role of Residential Proxies in Web Scraping

But how do web scrapers access the required plethora of data without being blocked or flagged? This is where residential proxies come in. These proxies mask the scraper's activities, making them appear as genuine user requests. By routing data extraction processes through real residential IP addresses, these proxies lend web scrapers a cloak of legitimacy. They ensure an advantage in data collection by providing steady, undetected access to a wide range of data sources.

AI and Web Scraping: A Reciprocal Evolution

But the relationship between AI and web scraping is symbiotic. As digital spaces become more intricate, basic scraping tools can struggle. AI-driven algorithms navigate these challenges, identifying patterns and ensuring efficient data extraction.

Tokenization and Data Feeding

With a clearer understanding of tokens, it's evident why they're pivotal to LLMs. Web scraping tools must ensure that the data they retrieve can be efficiently tokenized and processed by AI models. Whether a token is a word or a character can influence how information is extracted and understood.

Ethical and Responsible Data Extraction

As web scraping solidifies its role in feeding data to AI, the weight of responsibility grows heavier. Not only must scrapers operate within legal and ethical bounds, but they also need to ensure the data they extract doesn't imprint biases onto AI models. The challenge deepens when considering LLMs: these models, fed with vast amounts of data, often break it down and recreate it in such intricate ways that discerning plagiarism becomes near impossible.

Legal ramifications are already emerging, with various court cases starting to scrutinize the fine line LLMs tread between inspiration and imitation. Stay tuned for a forthcoming article, where we'll delve deeper into the intriguing legal landscape surrounding LLMs.

In Conclusion

In the intricate dance of AI's evolution, web scraping stands out as a leading partner, sourcing and delivering the data that fuels AI's engine. As AI continues to push boundaries, the relationship between data extraction and advanced models will only grow stronger and more intertwined.


Sources

1https://arxiv.org/abs/2203.15556
2https://www.mssqltips.com/sqlservertip/7786/large-language-models-train-ai-tools-chatgpt/

Read More