Weak supervision has become a popular technique for automatically generating labeled data for machine learning models from multiple noisy label sources and is in use in applications used by billions of people every day like Gmail, AI products at Apple, and search products at Google. But existing weak supervision frameworks like Snorkel have always required a two-step process – first, you need to process all your unlabeled data in a single batch to learn the accuracies of the label sources and generate training labels, and then you need to train a powerful end model to accomplish your task. Both of these steps can take a lot of time, especially for data-intensive applications like video.

Unfortunately, creating label sources can be quite tricky, and the long turnaround time makes it hard to evaluate how well you’re doing until you go through the full loop. We wanted to push towards a much more interactive weak supervision loop by reducing this turnaround time from label source creation to working model. We present FlyingSquid, a first step in that direction.

FlyingSquid learns source accuracies orders of magnitude faster than previous weak supervision frameworks and in some cases obviates the need for an end model. In doing so, FlyingSquid enables interactive development cycles and makes weak supervision for applications like video and online learning easy and practical.

Our paper *Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods* is available on arXiv, and our code is available on GitHub (including a PyTorch integration for online learning)!

# Introducing FlyingSquid

We built FlyingSquid to take the lessons we learned from our previous experiences with Snorkel, and push towards a much more interactive model building loop. In weak supervision frameworks like Snorkel, the name of the game is aggregating noisy labeling function outputs to automatically label training data.

For example, someone trying to build a model to detect interviews with Bernie Sanders on TV may write some simple Python code to look for his name in the transcript, or use an off-the-shelf face detector to detect his face in the video feed. Weak supervision frameworks take a few of these labeling functions, learn their accuracies and correlations, and then generate probabilistic training data for some powerful end model like a ResNet (all without ground truth labels)!

Unfortunately, completing this loop can be very computationally expensive. Learning the accuracy and correlation parameters of labeling functions (which we collectively refer to as a label model) often requires multiple iterations of stochastic gradient descent (SGD), and training up an expensive deep model can take a long time. This creates a very long turnaround between writing labeling functions and seeing how your models perform.

FlyingSquid drastically reduces this turnaround time to an interactive clip. Our key technical innovation is that we figured out how to learn the label model with a set of closed-form solutions, instead of relying on SGD.

This has a few key advantages:

- First, the closed-form solution runs orders of magnitude faster than existing weak supervision tools (especially for applications like image or video analysis where modeling spatial and temporal correlations was previously very slow), so training a label model is now nearly instantaneous.
- Second, removing SGD from the training loop also makes it easier to train an accurate label model with FlyingSquid, since there are fewer hyperparameters to tune (no more learning rates, momentum parameters, etc). In some cases, the label model ends up being more accurate than the end model. This means we can remove the end model from the development loop, and it’s often no worse than if we had spent hours training!
- Third, since there is an exact expression for the label model’s parameters, we are able to provide tight theoretical bounds on model performance. Even without ground truth labels on our dataset, we can train an end model that guarantees the same asymptotic rate as supervised approaches. Our result also takes potential model misspecification into account, showing that we can bound the generalization error even when our label model does not perfectly model the underlying data distribution!

In the rest of this blog post, we’ll discuss FlyingSquid’s key technical insight in latent variable estimation, and present some experimental results showing how FlyingSquid enables exciting applications in video analysis. We also show how FlyingSquid enables new online learning settings (FlyingSquid runs so fast that we can now train the label model in the training loop of a deep network). Full details in our paper on arXiv!

# Triplets Are All You Need

The key technical challenge in weak supervision is estimating the accuracies of – and potentially the correlations among – multiple noisy labeling functions without any ground truth data. There’s been a lot of research showing that using latent variable probabilistic graphical models (PGMs) to model these dependencies is a good approach for getting high performance in weak supervision (Varma et al 2019, Sala et al 2019, Ratner et al 2018). PGMs can model labeling functions as observable variables (lambda’s in the figure below), with the true (unobserved) ground-truth labels as a hidden variable:

In the graph above, edges indicate correlations between different variables. For example, each labeling function (represented by $\lambda$’s) are correlated with the ground truth label ($Y$). We also have that two of the labeling functions ($\lambda_1$ and $\lambda_2$) have an additional correlation that is *not captured* by their relationship to the ground truth label (for example, they could share a sub-routine or express similar heuristics).

The nice thing about modeling labeling functions with PGMs is that we can capture a wide array of dependencies like the ones above and more. For example, we can model temporal dependencies in video tasks by saying that neighboring frames are correlated:

In the above graph, each $Y_i$ models the ground-truth label of a single frame in a video sequence, and labeling functions label individual frames.

Unfortunately, solving the parameters of these PGMs (e.g., learning the weights of the edges) can be very difficult and computationally expensive, especially for tasks like video where modeling more temporal dependencies can result in much wider graphs. Previous approaches often rely on iterations of SGD. This can be very expensive, and also often requires tuning SGD parameters like number of iterations, learning rate, etc.

Our insight was that we can find a closed-form solution to the parameter estimation problem by solving individual triplets of parameters at a time (similar in spirit to previous work in PGMs and crowdsourcing, see the paper for a detailed comparison). In particular, if we identify triplets of conditionally-independent observable variables, we can construct a system of equations based on their agreements and disagreements (their second-order moments). For example, in the above two graphs, we can construct these highlighted groups of triplets:

Since this method has a closed-form solution, it reduces the problem to making a few matrix calculations in numpy – resulting in speedups of multiple orders of magnitude!

We can also theoretically analyze the downstream performance of our method and prove bounds on its sampling and generalization error. In particular,

- We show that the sampling error of the parameters of the graphical model scales as $O(1/\sqrt{n})$ in the number of training samples, and prove that this bound is information-theoretically tight.
- We prove that the generalization error for the end model also scales in $O(1/\sqrt{n})$, which is the same asymptotic rate as supervised approaches.
- We show that this generalization bound holds even when the underlying data distribution cannot be represented with our PGM, and our new analysis approach quantifies these tradeoffs in model selection (more complex models may represent the data better, but also require more training data to learn).

Check out our paper on arXiv for more details on our method and analysis results!

# Applications in Video and Online Learning

Now we’ll give a short preview of some of the ways that we were able to exploit this technical advance to push towards faster and more interactive weak supervision with FlyingSquid – more details about all these experiments and applications in our paper!

We validated FlyingSquid on a number of video analysis applications, ranging from media applications like commercial detection in TV news to sports analysis applications like segmenting tennis rally segments from broadcast tennis footage. FlyingSquid runs up to 4,000 times faster than a previous weak supervision framework for sequential data, while achieving comparable or higher model performance:

End Model Performance (F1), best in bold | ||
---|---|---|

Task | Sequential Snorkel | FlyingSquid (label model in paren.) |

Interviews | 92.0 | 91.9 (94.3) |

Commercials | 89.8 | 92.3 (88.4) |

Tennis Rally | 80.6 | 82.8 (87.3) |

Label Model Training Time (s), best in bold | ||

Interviews | 256.6 | 0.423 |

Commercials | 265.6 | 0.067 |

Tennis Rally | 398.4 | 0.199 |

Sometimes, you can use the label model directly to get better (or comparable) performance as the end models. Often these tasks can take advantage of powerful pre-trained models to express higher-level concepts that it’s difficult to learn directly (for example, the labeling functions for the tennis rally task use off-the-shelf object detectors that know what people look like already). In these cases, completing the weak supervision loop with FlyingSquid is instantaneous, since you don’t have to train an end model. This means that you’ll be able to rapidly iterate on labeling functions and immediately see how they affect your end performance. We’re really excited about what this means for interactive video analysis applications – especially in conjunction with previous work like Rekall.

We can also exploit FlyingSquid’s speed to enable new online learning applications, where we can continuously update label model and end model parameters over time. This means that we now adapt to *distributional drift* over time (when the underlying data distribution is changing in some way). Here’s a synthetic experiment that demonstrates when this can be helpful:

In situations with little to no distributional drift (left), offline learning often works best, since learning methods can optimize over more data; but in settings with heavy distributional drift (right), online learning can continuously adapt the model to changing data streams (whereas offline learning has trouble finding a single model that can account for all the data).

Our paper has more details on these and other experiments – we show simple proofs of concept about how we can use online learning over large video datasets, and we also validate FlyingSquid on benchmark weak supervision tasks that have been used to evaluate previous weak supervision frameworks like Snorkel. We also release a PyTorch layer that automatically integrates FlyingSquid into the end model training loop.

For more details, check out our paper on arXiv, and our code on Github!