Introduction to Snorkel
Created in Stanford in 2016, Snorkel is a system for programmatically building and managing training datasets without the need for manual labelling. This replaces the very laborious task of getting humans to label enough data to train accurate machine learning models, saving both time and money. Win-win! In this example we will look at how Snorkel can be used as part of a recommender system to label the relevance of different books for a user. The dataset used will be an augmented and normalised version of the Goodreads dataset, containing user-book pairings and extensive metadata on each book.
A Normalised Data Format
The dataset being used is a normalised version of the Goodreads dataset which has been transformed from wide data with a column for each variable, to long data with a key and value column. Using this format allows for the dataset to easily be expanded in the future simply by adding new rows, without having to worry about adding new columns or tables. Below is an example of the structure of the data:
Using this data format, it is very easy to search for the desired data for building labelling functions, the code snippet below shows how the average rating for each book can be found from the data using python and pyspark:
avg_ratings = (metadata_df.select('item_id', 'value') .filter(meta_df.key == "average.rating") .collect())
Labelling Functions
A core component of the labelling process are a series of labelling functions, each representing a heuristic or noisy labelling strategy that allows users to add domain specific information into the dataset. When it comes to labelling functions, quantity if often better than quality, and it is not essential that any one function be 100% accurate. One of the fantastic things about Snorkel is that it automatically estimates the correlations of these functions and will then concatenate their output labels resulting in high-quality dataset. Below is a python snippet where we import the relevant Snorkel functions, and set up the labels that will be used and their values:
from snorkel.labeling.lf import labeling_function # Define the labels RELEVANT = 1 NOT_RELEVANT = 0 ABSTAIN = -1
Generic Labelling Function
This first type of labelling function will look at making generic assumptions based on the metadata, in this example a the relevance of a book will be labelled based on it’s average rating from all users. For this example, we are going to make the naïve assumption that any book with a rating of greater than four stars is good and will be worth reading, and any book less than 2 stars is bad and won’t be worth reading. The first step is to find the average rating for each book based on it’s ID and build a dictionary that can be passed into the labelling function and used to look up the rating of any given.
# Find ItemIDs and their average rating avg_ratings = (metadata_df.select('item_id', 'value') .filter(meta_df.key == "average.rating") .collect()) # Create a list of ItemIDs ids = [row['ItemID'] for row in avg_ratings] # Create a list of average ratings ratings = [row['value'] for row in avg_ratings] # Create a dictionary where the key is the ItemID and the value is the average rating book_to_avg_rating = dict(zip(ids, ratings))
Next the labelling function itself is created, accepting the lookup dictionary as a resource. The argument x corresponds to a single row in the user-book data that the labelling functions are applied to. So for each ItemID the function will lookup the rating and assign a label based on its value.
@labeling_function(resources=dict(book_to_avg_rating=book_to_avg_rating)) def average_rating(x, book_to_avg_rating): if book_to_avg_rating[x.item_id] >= 4: return RELEVANT elif book_to_avg_rating[x.item_id] <= 2: return NOT_RELEVANT else: return ABSTAIN
A future improvement would be to use a more intelligent approach to deciding the values of the ratings used to determine relevance, such as looking at the distribution of ratings and selecting specific percentiles. Using this basic structure a number of labelling function can be created looking at not only the rating of a book, but the number of ratings or the number of unique tags assigned to the book. Allowing as much domain information to be captured as needed.
User-centric Labelling Function
This next type of labelling function will look at incorporating other books a user has come into contact with in the labelling process. A very simple example of this is looking at the language the books are written in. It is unlikely a book will be relevant to a user if it is written in a language they do not read, making this a fairly self explanatory function. It is also a opportunity to look at how to pull more user information into labelling functions. The below python snippet gives another example of creating a dictionary to lookup information on a book based on its ID, in this instance its publication language. Along with creating a dictionary to lookup a list of read books based on a user ID.
# Find ItemIDs and their language lang_df = meta_df.select('item_id', 'value').filter(meta_df.key == "language.code").collect() # Create a list of ItemIDs ids = [row['item_id'] for row in lang_df] # Create a list of languages languages = [row['value'] for row in lang_df] # Create a dictionary where the key is the ItemID and the value is the language book_to_language = dict(zip(lang_ids, languages)) # Get a list of books for each UserID user_items_df = user_df.groupBy('UserID').agg(functions.collect_list("ItemID")) # Create empty dictionary to store list of books for each user id user_to_books = {} # Iterate through each user for row in user_items_df.collect(): # Create an entry in the dictionary with key as UserID and the value as a list of ItemIDs user_to_books[row['user_id']] = list(set(row['collect_list(item_id)']))
Next the labelling function is creating, this time accepting two dictionaries, the book to language lookup, and the user to books lookup. Then for each UserID the it will find a list of their previous books, and lookup the languages for each. Finally if the language for the current book has been seen previously the book will be labelled as relevant, and if not it will be labelled not relevant.
@labeling_function(resources=dict(book_to_language=book_to_language, user_to_books=user_to_books)) def shared_languages(x, book_to_language, user_to_books): # Get all books for the current user books = user_to_books[x.user_id].copy() # Iterate through the users books and find the language and add to a list languages = [] for book in books: # Add the language to the list languages.append(book_to_language[book]) # Check if the language for the current book is in the list if book_to_language[x.item_id] in languages: return POSITIVE else: return NEGATIVE
Using this structure a whole series of labelling functions can be created to look at if a book has an author that a user has previously read, or the book is of a genre commonly read by the user. Again, there is no limit to the functions that can be created if we have a good understanding of the data available.
Applying Labelling Functions & Training a Model
The final step is to apply the labelling functions to the unlabeled data, this will result in a label matrix where each row is a data point, and each column is the result of a specific labelling function. Next a label model is trained from the label matrix, which will automatically estimate the accuracies and correlations of the labelling functions and reweight and combine the labels to produce a final set of training labels. The following python snippet shows how this is done using the two labelling function we created. If more have been created they are simply added to the list of labelling functions and the rest of the code remains the same.
from snorkel.labeling import PandasLFApplier from snorkel.labeling.model import LabelModel # Create a list of the labeling functions lfs = [average_rating, shared_languages] # Create an applier object to apply the labeling functions applier = PandasLFApplier(lfs) # Apply the labeling functions to the dataset L_ratings = applier.apply(user_df.toPandas()) # Create and train a labelling model label_model = LabelModel(cardinality=2, verbose=True) label_model.fit(L_ratings, n_epochs=5000, seed=123, log_freq=20, lr=0.01) # Predict a relevance all data points labelled_data = majority_model.predict(L=L_ratings)
Closing Thoughts
Snorkel is a fantastic tool for rapidly creating labelled datasets from unlabelled data, and can save a huge amount of time and money that would otherwise be spent manually labelling. There is a little time investment required to create a suitable set of labelling function, between 10 and 100 are recommended, but the format for them is often very similar making them easy to produce as a large amount of code can be duplicated. It is worth noting that the quality of the labels are only as good as the quality of the labelling functions, so taking the time to get to know your dataset properly understand the data and use case is essential.
Topics Covered :
Author
Alexander Billington