How To Make Money Data Labelling

Instagram logo for Matt Bristow's blog LinkedIn logo for Matt Bristow's blog Logo to click to give feedback on Matt Bristows blog.
Brain icon to indicate ability to summarise blog with AI.

Summarise with AI

AI summary

What is the state of AI at the moment?

Have you ever sat through a child's football game, or remember competing in them yourself? 

Then you’ll be familiar with the concept of “everyone chasing the ball”

Kids tend to lack complete awareness of the concepts of spacing, lateral movement, and trying to see or move in a different way from the crowd to create maximum effectiveness. 

In more ways than one, the tech industry is like a group of children.

The AI football has been booted on to the pitch, and everyone is falling over themselves to get to it. 

But there’s another area that I predict is going to be incredibly lucrative, that is largely being ignored, especially by the LinkedIn influencers who claim to wake up before they’ve even gone to bed, and in some cases, can bend the laws of spacetime to their very whims

As the saying goes, when there is a gold rush, sell shovels.

In this case, the “shovels” are the thing powering the whole industry : data.

Why do AI companies need so much data?

AI companies need vast amounts of data to train machine learning models effectively. Large datasets ensure models can learn patterns, make accurate predictions, and generalize well across diverse scenarios, ultimately improving performance and reliability in real-world applications.

It’s no secret that truly powerful AI takes an absolute eye-watering amount of data to train the modelsGPT-4 was trained on 45 gigabytes of training data, which is a whopping 22,500,000,000 words, and there are still some people who say it is comparatively getting stupider. 

Most of that data is coming from web crawling or publicly available sources. 

But a sea change has occurred (that I sort of pseudo predicted in one of my blogs, in the same way you can predict that if I drink 8 shots of tequila, I’m probably going to fall over, not super impressive, but still technically correct) in relation to AI companies harvesting data this way.

“The Social Media Company Formerly Known As Twitter” and Reddit have led the charge, enacting quite honestly wild and anti-user policies on API usage, and taking effort to block web crawlers. 

Combined with the fact that there is growing backlash to AI companies data usage, and AI companies in general, plus the fact that the internet is becoming more and more comprised of AI content, meaning AI companies crawlers are feeding their models AI-generated content in a kind of grotesque Ouroboros, fresh data obtained by non controversial means has never been more in demand. 

This leaves a massive opportunity for companies to utilise their data for financial gain, either by selling it to AI companies, or even better, leveraging it themselves to create industry-specific AI tools that can help their customers.

But when it comes to data, two things are going to be heavily important in the next five years if you want to capitalise on this opportunity:

Eliminating the middleman between users and data sources

There is an argument to be made that UX/UI will never be the same again after the advent of LLM chatbots. 

We could be right at the precipice of a wholescale shift from graphical user interfaces to a conversational UI system in the next five to ten years. Why would you spend hours crawling through lists and collating sources using traditional UIs and search engines, when you could just have a conversation with a chatbot that provides you the information you’re looking for. 

So where do middlemen like marketplaces and data aggregation businesses (think Experian) stand now that the nature of digital behaviour is changing? 

Well, traditionally, there was a relatively equal split in data management and UI development for these businesses. The data had to be good quality and the interface had to be intuitive and snappy. 

This is going to change.

Companies should now be putting most of their emphasis on the data side of the equation, and focusing their UI efforts into creating conversational interfaces powered by AI.

This doesn’t mean that everything will become a chatbot, but it does mean that the very structure of how digital domains work should change.

At the advent of the internet, we had static pages with no interactivity. Then in the Web2 phase, we had digital products where users could interact with the product by uploading their own content. Now, we are going to enter a phase where user interaction can also take the form of conversational input, like telling the product exactly what they want/need from it, and the product will adapt and respond in real time.

Organisations that adopt this key change and prioritise their data collection will see success akin to the companies that capitalised on the emergence of UGC at the start of Web2 i.e. Scrooge McDuck levels of coin. For example, why do you think Meta were so interested in making a text-based social media? Threads had 95 million posts in 24 hours, and I'd bet my bottom dollar that this data will be used to train Meta's next generation of AI models. This kind of proactive thinking will make some companies incredibly powerful, why not be one of them?

Labelling your data can give you a massive advantage in the AI race

Companies can monetize their labeled data by selling it to third-party AI firms. These firms need high-quality, annotated datasets to train their models, enabling the companies to generate revenue while supporting AI development and innovation in various industries.

Generating massive amounts of data isn’t enough. Most businesses, organisations and even people already generate huge amounts of data. In fact, for each second you spend online, you generate 1.7 megabytes (MB) of data.

But the sticking point is that this data is unorganised and crucially, unlabelled.

Ask any data scientist, and they’ll tell you that the first stage of any data project is cleaning and organising the data.

There is money in this stage that can be relatively easily completed by a savvy, data-led organisation. 

OpenAI has taken a similar approach to social media companies, outsourcing data labelling to countries with low operating costs, throwing out contracts worth hundreds of thousands of dollars a year, but is now facing a significant backlash as these workers are paid very little and sometimes asked to label distressing material.

Now imagine you can provide pre-labelled data, allowing OpenAI (or any other AI company fat off VC funding) to bypass the morally ambiguity of outsourcing, and you have a recipe for a hell of a sales pitch to one of the most funded companies in the world.  

Or, if you want to take it a step further, imagine your own language model powered by every piece of marketing collateral/email campaign/ad brief you’ve ever created guiding your prospects through the entire marketing journey. Leveraging your organised data is going to be absolutely huge for companies, especially B2B, where the impacts on sales pipeline nurture will be astronomical if done correctly.

Mastering these two key concepts are going to be the hallmarks of organisations that not just survive the AI boom, but thrive during it.

Logo to click to leave a comment on this blog.

Load comments


No comments yet, be the first!



Post comment