AI, Tech
Acquia, artificial intelligence, Katherine Bailey, machine learning

Using Content as Data to Inform Machine Learning

Processing Big Data was so 2015. 2018 is the year of using tools, algorithms, and platforms for machine learning to find adaptive solutions to complex problems.

As developers we’re tasked with applying the latest technologies to attack problems. And while it’s important to know how to implement the programs, it’s just as important to know when it’s not the right time to use the resources. This is why MassTLC is hosting a special developers’ event in applied machine learning on January 24th. The post below, authored by Katherine Bailey, Principal Data Scientist at Acquia, is an example of the type of discussion that you can expect. This post first appeared on the Acquia blog.


In the past year, few trends have grown as rapidly as machine learning (ML) and artificial intelligence (AI). At Acquia, we are using ML techniques to make digital experiences more engaging. The trick is to understand how you can use content as data. The more we understand about our users and content, the more engaging and fruitful experiences will be. This might sound challenging, but there are a variety of approaches that have been developed by the ML field that make this strategy possible.

Enterprise data is (relatively) small data

Acquia supports a variety of enterprise customers and needs to be smart about working with smaller amounts of data. There are certain techniques that can help, such as transfer learning, few-shot learning and human-in-the-loop machine learning. I’ll explain what these are in a moment, but first, what type of data are we even talking about here?

Playing to our strengths

It is often suggested that companies looking to make use of machine learning techniques should begin with the data that defines them as a business and work from there. At Acquia, we’re looking to provide ML capabilities for products that are used by customers spanning a wide variety of domains of data. No matter the industry, all of our customers have a lot of content. The good news is that we can treat content as data.

This means the words making up blog posts, news articles or product description need to be understood by machines. The thing is, machines don’t understand words, they only understand numbers. This requires us to come up with numeric representations of the words that will somehow work for tasks like classifying content, delivering recommendations of similar content, identifying duplicate content, etc.

Word Embeddings

In 2003, Bengio et al. proposed an ingenious idea to learn representations of words that capture semantic meaning. In 2013, a team of researchers at Google made this possible with the Word2Vec algorithm. Word2Vec and other recent approaches (such as GloVe from Stanford) learn from massive corpora of text, e.g. millions of Google News articles or all of Wikipedia. The representations they learn after chomping through all of that text are numeric vectors of, say, 300 dimensions. Simply put, this means that the representation for a single word is a long list of numbers. The beauty of it is that the mathematical relationship between those vectors manages to capture the semantic relationship between the words. The classic example given is king – man + woman = queen.

What’s really neat is that once those representations have been learned from a massive dataset, they can then be used in other tasks like classifying content into categories.

Learn from the best, transfer to the rest.

The ML technique known as transfer learning is about knowledge gained through training on one task being reused in solving another task. Using pre-trained word embeddings is an example of that. We can take the word embeddings trained by Google or Stanford and transfer them for use in our own tasks.

One such task is similarity-based content recommendations. If we have numeric representations of our content that captures semantics, then we automatically have a measure of similarity between pieces of content. Even if two pieces of content are talking about the exact same topic but using different words, they will still be identified as similar due to the nature of these representations. This is not true of traditional approaches to representing words as numbers in machine learning because the numbers in question were related to counts of particular words in documents.


This technique can understand that while these statements are worded differently, they have the same meaning.

Learning from very few examples

You may have already heard the phrase “data is the new oil,” however, someone took it a step further at the 2017 O’Reilly’s AI conference by proposing that “labeled data is the new ‘new oil.’ ” For classification tasks, few-shot learning is an approach that stands in contrast to standard deep learning approaches because deep learning requires enormous quantities of labeled data.

The key to being able to learn from very few examples is having great representations of your data. For this reason, transfer learning and few shot learning often go hand-in-hand. You transfer the knowledge from some previous task and use it to create representations of your data. Just labeling one or two examples then allows all the others to be labeled automatically. This is our approach to automated content tagging.

‘Human in the Loop’ ML

A solution to the problem of lack of labeled training data is to get humans to label your data. This is called human-in-the-loop (HitL) ML, a term that may well have been coined by the founder of a company called CrowdFlower, which specializes in a crowdsourced approach to this technique. They’ll take your unlabeled data and get humans to label all of it for you. Another company, Mighty AI, is focused specifically on training data for autonomous vehicles. Anyone with an iPhone can earn a few cents a go by labeling pedestrians, lamp posts, parked cars etc. in images.

Humans can be made part of the loop in other, less straight-forward ways than labeling entire training sets to feed into ML algorithms. Any application or service that explicitly asks users for feedback in the form of ratings – Netflix movie ratings for example – can be thought of as employing HitL. The company StitchFix, which provides a clothing service where they send customers a regular “fix” of clothing items selected by a stylist, gets a lot of upfront data from users by asking them to rate styles through a series of photos. The more data they can get from their users up front, the less they have to infer through purchasing behavior. This is important to the success of their service because without HitL initial “fixes” would stand a poor chance of being purchased. Companies that use HitL understand that the UI they present to the human in their loop is of vital importance.

Recognizing where UX and Engineering play their role

In the current wave of excitement over ML, a lot of advice is being offered to companies on how to incorporate these techniques to improve their business. Depending on who’s offering it, the advice differs greatly. Those in the business of training and recruiting data scientists will tell you you need lots of data scientists, whereas those in the business of selling “Machine Learning as a Service” (MLaaS) solutions will say you don’t need any data scientists at all. The reality of course is somewhere in between.

It is definitely important to have people who know how to frame your business’ problems as data science or machine learning problems and make sure the data needed to solve them is available. Simply getting your engineers to feed masses of data into Amazon or Google’s MLaaS is not going to achieve very much. On the other hand, data scientists alone probably can’t do everything. If you’re building a product, just one or two data scientists working with engineers and UX professionals will be far more effective than 10 data scientists. The right mix depends on what you’re trying to accomplish.

At Acquia, we’re using ML to enhance our SaaS offerings and have built a team focused specifically on this area. It includes data scientists, data engineers, front-end engineers, and back-end engineers. The team also works very closely with our UX team. Where we’re using HitL, UX is absolutely vital to ensuring we get the data we need to support our learning algorithms to make them as accurate as possible. Other efforts don’t entail a HitL aspect but require skilled engineers to ensure that services delivering ML predictions are performant and scalable.

We don’t have anyone on the team with a PhD in artificial intelligence or machine learning. Perhaps, one day we will. In the meantime, we have smart people who are familiar with the types of solutions that machine learning research has developed (many of which are available in open source libraries) and the types of problems to which they are best applied. This expertise, coupled with strong engineering and UX skills, is what we need to execute our ML strategy. If we didn’t have a well-thought-out strategy on how to play to our strengths, make use of publicly available datasets and open source libraries, and incorporate the other necessary technical functions in our efforts, an AI PhD would struggle to add value.

Upcoming Events


Related Articles