Daniel Liu's Blog

Funding Open Source Bioinformatics Software

Sat, 20 Jan 2024 00:00:00 +0000

We need sustainable ways to fund quality open source bioinformatics software.

I’m not really sure what’s the best solution (funding model), so I’ll try to review some of what’s currently happening in the field. I will mainly focus on genomics and biological sequence analysis software since I’m most familiar with this area.

First, let’s motivate why open source bioinformatics is important.

Why open source software?

Exists for a long time even if the original developers move on or go bankrupt.
Benefits both academia and industry: people are not starting from zero for new projects, which leads to faster iteration speed for the entire field.
Modifiable to fit new use cases. Every lab or company seems to have their own custom data, so there’s rarely one-size-fits-all solutions.
Aids the development of new methods by making it easier to build off of, experiment, and compare against previous tools.
Easy to use off-the-shelf software for now, then invest in tuning or customizing later.
For employers: more likely to find people with existing experience.
For employees: allows work to be “transferable” across employers.

Why bioinformatics?

Most recent major scientific discoveries are tool-based (TODO: find the source for this). Bioinformatics software tools are needed to analyze the large amounts of data coming out of genomic sequencers. These tools have applications across metagenomics, pandemic monitoring, cancer biology, immunology, verifying CRISPR edits, etc. I also don’t thinking we are going to get “foundational models” and “AI agents” in biology without foundational bioinformatics tooling first.

Status quo

What is the state of open source bioinformatics now? Who contributes time and money? Here’s what I’ve seen:

Scientists and PhD students in their free time, mainly as byproduct of their government grant funded research projects.
Chan-Zuckerberg Institute (CZI) gives out grants specifically for developing and maintain open source software. This seems to go towards well-known academic labs.
Hardware and instrument companies that want to demonstrate their devices:
- Intel, which develops mm2-fast and bwa-mem2
- Nvidia, which develops Clara Parabricks
- Sequencing companies, like Illumina, Oxford Nanopore, PacBio

Most of these companies seems to have developed more user-facing tools instead of libraries. Libraries seem to be maintained by people (across industry and academia) on their own time.

TODO: Seqera Labs, St. Jude’s genomics infrastructure team

I think academic labs do not provide the optimal environment to develop and maintain open source tools:

PhD students need to publish
Hard to hire good software engineers (maybe exception: Aaron Quinlan’s lab)
Prominent tool builders do not have enough bandwidth to maintain tools (Heng Li’s bwa-mem)

I think simply throwing more money at academic labs may not work.

Which type of funding model could work?

Grant funding specifically for open source software (government funding or CZI)
- Possible to provide relatively stable job and career development for one or more software engineers?
Open core startup
- VCs need to make their money back
- Unclear if existing open core startups in other domains (like software dev tools) will work well in the long run with this model.
- Nonzero interest rates is brutal
Consulting or bootstrapped startup
- Bioinformatics consulting: Fulcrum Genomics
- Possible to just sell a (hosted) version without splitting attention and providing consulting services?
Get big biotech companies to pay employees that focus on open source

One Weird Trick for Getting into CS Research Roles!

Wed, 07 Sep 2022 00:00:00 +0000

I’ve found the following method to somewhat work when applying to CS research teams.

Try to get in touch with someone on the team. Very often, just applying online is not enough. Use prior connections for this. It’s the easiest if you have worked in a similar area as the team, so you know which teams and people to contact. You will also have some common ground to chat about and some credibilty in the area.
Reimplement an algorithm that they use or is related to what they work on. Read some papers and code up something quickly. This helps you learn in detail about the problem they are trying to solve and the algorithms they use. I find this fun to do anyways, since its an opportunity to quickly learn and apply that knowledge by coding something. Often times, gaps in your understanding or important practical details reveal themselves when you try to implement something. This also gives useful knowledge if you get the position. Show them what you have done and the issues you faced or other open questions. This should show initiative, insight, and interest in the team’s work.
???
Profit???

Doing steps 1 and 2 in the opposite order should also work. Pick an area, read some papers in that area, try to reimplement their algorithms, then you can get in contact with people to collaborate on research.

So you want to do some scientific research as a high school student?

Tue, 25 Feb 2020 00:00:00 +0000

It is difficult to get started doing research as a high school student without any existing connections with companies or university labs. Therefore, I have compiled a list of notable activities, programs, and awards that I have heard of. Obviously, I have not included every possible opportunity, but these should get interested and motivated students started with doing STEM research.

Link to the list: here.

Have fun doing research!

The birth of a new sub-sub-field: adversarial attacks on 3D point sets

Thu, 22 Aug 2019 00:00:00 +0000

An interesting phenomenon with neural networks is that it is incredibly easy to perturb or change the input by a little amount, and cause the network to make a completely different prediction. In general, this field of adversarial machine learning contains two main problems that everyone is slowly chipping away at:

How can we attack neural networks with a bounded amount of change/perturbation to the input, under a certain metric?
How can we (empirically or provably) defend neural networks against adversarial attacks?

Currently, this is a perpetual tug of war between the attacker and the defender. There is a lot of work on both attacking and defending, particularly because coming up with an efficient and provably robust defensive method against all attacks is very difficult.

Even in the rapidly expanding field of machine learning, there are stones left unturned. In the short timespan of one year, I was fortunate enough to contribute to the birth of a new subsubfield examining the adversarial robustness of neural networks that learn from 3D point sets. This was incredibly exciting—I got to witness the development and maturation of ideas in a field with many opportunities by participating first-hand!

In this post, I will highlight some interesting ideas born through the intersection of 3D machine learning and adversarial machine learning. This will mostly cover the two papers (first and second papers) I contributed to the field along with my mentors Ronald Yu and Hao Su, and also some work by groups all over the world. I will also provide evaluations of the plausibility and effectiveness of different ideas, especially for defending against adversarial attacks, which is prone to unintentional errors.

Name of the game

Before we dive into the ideas for attacking and defending 3D neural networks, let us examine what kind of ideas we want to extract in this field. Since we are examining 3D point clouds and neural networks built especially for them, we want to create attacks and defenses that are native to 3D point clouds. We can create attacks and defenses that are universal to all types of inputs, but then we would have to “compete” with previously proposed ideas. If we merely apply ideas that were used for 2D images, which are studied more thoroughly, to 3D space, then we are not bringing major contributions to the table. Therefore, we want to exploit the special properties of both 3D point clouds and the point cloud neural networks in our journey of attacking and defending.

As neural networks are quite easy to attack in general, we want to ensure that our attacks satisfy desirable constraints, like imperceptibility. In 3D space, there are a lot of new constraints to explore that are not present in other domains.

Now, let us go over a little background on 3D point clouds and their neural networks.

Point clouds

A 3D point cloud provides an approximation of a 3D object’s shape. This shape is only the boundary of a 3D object, so it is hollow. If we represent the true shape as an infinite set of points \(S\), then we have \(x \subset S\) for a point cloud \(x\). For simplicity, we will assume that \(x\) is evenly sampled from the true shape. When point clouds are obtained through scanning objects in the real world (LiDAR and RGB-D scans, or photogrammetry), they are usually uneven and partially occluded.

3D point clouds differ from 2D images in a few ways:

The order of points does not matter.
Shape, instead of color, is represented. Along with the assumed even sampling of points, this means that there is exploitable structure in 3D point clouds.
Randomly moving the points by a small amount, dropping a few points, and adding a few points on the true shape should still represents the same object.

Learning in 3D space

The challenge of learning in 3D space is to deal with the order and density invariance of point clouds. The general idea of the popular PointNet architecture is to apply the classical multi-layer perceptron (multiplication by weight matrix) for each point separately. From the three color channels of each point, we use multiple layers of matrix multiplication to obtain 1024 features. Then, max pooling is applied elementwise across the 1024-dimensional feature vectors of each point. Since the max operation is symmetrical, this is invariant to the order of points in a point set. Many other architectures basically just extend this idea with subsampling and other ideas.

For density invariance, the PointNet architecture only selects a set of “critical” points from the entire input point set with the max pooling operation. This set of critical points form the skeleton of the point cloud, and the rest of the points are basically ignored by the network.

Basic attacks in 3D

The most straightforward attack is by using gradient descent to directly perturb the position of each point. We are essentially solving the following optimization problem for a point cloud \(x\):

\[\begin{align} &\text{maximize}& &J(f_\theta(x + \delta), y)\\ &\text{subject to}& &||\delta||_p \leq \epsilon \end{align}\]

\(J\) represents the loss function, \(y\) is the label class, and \(f_\theta\) is a neural network parameterized by \(\theta\). Notice that the perturbation \(\delta\) is bounded by \(\epsilon\) under the \(L_p\) norm, and we do not clip \(x + \delta\) because unlike 2D image color channels, the positions of 3D points are unbounded. The \(L_p\) norm also provides a way for us to measure the perceptibility of adversarial perturbations.

Just like with 2D images, this method works very well for generating adversarial attacks. Here are visualizations of adversarial perturbations (bounded with the \(L_2\) norm) on a car and a person from the ModelNet40 dataset:

and

The perturbed points are orange.

A simple extension to this is to ensure that each point is perturbed by the same amount by normalizing the perturbation vector for each point. Also, it is possible to add a few new points and perturb them instead of perturbing the original points in the point cloud (for example, here and here).

Defense: removing outliers

Right away, we notice that constraining the perturbations using the \(L_2\) norm results in some points being perturbed more than others, and those points become outliers in the point set. Therefore, a simple defense would be to just remove those outlier points. A common method for identifying outliers is based on each point and its \(k\)-nearest neighbors. Finding points that are outliers is done by examining the distribution of distances between each point and its nearest neighbors. Afterwards, the outlier points that are too far away from its \(k\)-nearest neigbors are removed from the point set.

Removing outliers is actually very effective as a defense—it performs much better than adversarial training, a classical defense that involves teaching a neural network the correct labels for adversarial examples. In practice, removing outliers works well even if we constrain the amount of perturbation to each point by an \(\epsilon\) so that large perturbations are not possible. The perturbation of each point can even contrained to the average distance between each point and its nearest neighbor in the clean point cloud, and outlier removal would still work.

Interestingly, this method was proposed in parallel in many different papers on defending against adversarial attacks on point clouds. I guess everyone noticed the outliers generated by adversarial perturbations. The simpler method of randomly removing points was also proposed as a defense.

Defense: removing salient points

Since we perturb points by their gradients, it makes sense to remove adversarial points by examining their gradients to hopefully restore the point cloud of an object. The idea is to first calculate the saliency of each point (at index \(i\) in \(x\)) through

\[s[i] = \max_j ||(\nabla_{x^\ast} f_\theta(x^\ast)[j])[i]||_2\]

where \(x^\ast = x + \delta\).

Then, we can sort the points by their saliencies and remove points with high saliencies. In other words, points that have large saliencies, which are the magnitudes of the gradient of each output class with respect to each point, are removed. In practice, this works well as a defense, and it performs better than adversarial training. This defense avoids the issue of being unable to identify adversarial points if there are no outliers, but we are making the assumption that points with large magnitudes of gradients are the perturbed points.

Defense: limitations

The reason why removing points performs so well is actually due them relying on gradient masking. The max pooling operation in PointNet ignores a set of points, which causes them to not get any gradient flow. These points cannot be perturbed with gradient-based methods, so they can accurately represent the unperturbed shape of a point cloud once the perturbed points are removed. However, relying on masked gradients does not lead to truly robust models, as we will see in the shape attacks that are proposed later.

Minimizing the perceptibility of perturbations

So far, when we perturb a point cloud, we do not really take into account the intrinsic shape that it represents with the \(L_p\) norms. Therefore, it may prove fruitful to examine some other metrics for measuring the perturbation on 3D point clouds. One such metric is the Hausdorff distance between two sets:

\[\mathcal{H}(A, B) = \max_{a \in A} \min_{b \in B} ||b - a||_2\]

This is actually not technically a metric, but it enables us to measure the distance between a perturbed point set \(A\) and a clean point set \(B\). In words, the Hausdorff distance is defined as the the maximum distance between each point in \(A\) and its closest point in \(B\). For an adversarial point set \(x^\ast\), we have two ways of using the Hausdorff distance:

\(\mathcal{H}(x^\ast, x)\): comparing the adversarial point cloud to the clean point cloud. The advantage of this over the \(L_p\) norms is that it allows the positions of two points to be swapped, making it a much more natural metric for 3D point clouds that are order-invariant. This was proposed in this paper.
\(\mathcal{H}(x^\ast, S)\): comparing the adversarial point cloud to the true shape of the point cloud \(x\). This takes into the account the shape of the point cloud, and is point density invariant. This was proposed in my papers.

To maximize the loss of the neural network while ensuring that \(\mathcal{H}(x^\ast, S) \leq \epsilon\), we can use projected gradient descent to project the perturbation of each point onto the 3D object that the point clouds were sampled from. If the 3D object is unavailable, we can use a triangulation algorithm, like the alpha shapes algorithm, to infer the object shape. For faster projection speed, we can build some metric tree, like the VP-tree, on the triangular mesh, by representing each triangle as a point. I call this method the “distributional attack”, since it changes the distribution of points near the shape of an 3D object. The great thing about this method is that we can generate perturbations with \(\mathcal{H}(x^\ast, S) = 0\), which means that we only move the points around on the shape \(S\). If the 3D object is available, this creates very imperceptible perturbations, and it reaches around 25% success rate on PointNet/PointNet++. Even with a slightly higher Hausdorff distance and an approximated surface \(S\), the perturbations remain imperceptible, while the success rate of the attack becomes much higher (>80%).

Here is a visualization of the distributional attack on an approximated shape, with a small Hausdorff distance:

Here is a visualization with the true shape and a Hausdorff distance of exactly zero, with perturbed points in orange:

Although the perturbation of each point is very small and kept on the shape of the object, removing points is effective as a defense against this type of attack.

Shape attacks

In addition to constraining the Hausdorff distance, we can add another constraint to ensure an uniform density distribution of points after perturbing points. Therefore, the perturbations must change the overall shape of an point cloud, since small changes to the point distribution are not allowed. This is more realistic than having a few points suspended in mid-air, far away from the main 3D object. Also, it is easier to control the shape than the distribution of points obtained through a scanner. Shape attacks are also effective against point removal defenses, as they modify or destroy the density information that outlier removal relies on, and removing a few points from a perturbed shape does not make it clean again.

Perturbation Resampling

We can express the idea of having the points \(x^\ast\) evenly distributed on a shape \(S^\ast\) as an optimization problem maximizing the distance between points:

\[\begin{align} &\text{maximize}& &\min_{i \in \{1 \ldots N\}} \min_{j \in \{1 \ldots N\}\setminus\{i\}} ||x^\ast[j] - x^\ast[i]||_2\\ &\text{subject to}& &x^\ast \subset S^\ast \end{align}\]

However, we do not need to solve this problem exactly; a greedy approximation using farthest point sampling works fine. The perturbation resampling attack is simple: just perturb points using gradient descent, but resample a portion of the points on the estimated adversarial shape that is determined by the perturbed points.

The final result is

Note that sometimes, the perturbation of a point is so large that the triangulation algorithm cannot include it in the triangulation, so it becomes an outlier.

Adversarial Sticks

We can also add new adversarial features to the shape of the point cloud. In this paper, they propose adding new clusters of points, or even smaller versions of other point clouds that float in mid-air, near a clean point cloud. We take a simpler and more realistic route by adding a few sticks, or line segments, that are attached to the shape of the clean point cloud. Then, the adversarial object will look like a porcupine. Conceptually, we need to figure out where to place the sticks on an object, and the length/direction of each stick. Formally, we are solving the following optimization problem:

\[\begin{align} &\text{maximize}_{\alpha, \beta}& &J(f_\theta(x \cup \mathcal{S}_\kappa(\alpha, \beta)), y)\\ &\text{subject to}& &||\beta||_2 \leq \epsilon,\\ &&&\alpha \subset S \end{align}\]

\(\alpha\) is a set of points representing where the sticks are attached to the shape \(S\), \(\beta\) is a set of vectors representing the orientation of each stick, and \(\mathcal{S}_\kappa\) returns a set of \(\kappa\) points sampled on the sticks.

Like perturbation resampling, we can just approximate this by first perturbing points using gradient descent, and then connecting those points to the closest point on the surface of a 3D object at the end. Finally, we need to sample points on the adversarial sticks. A visualization of the result is:

Adversarial Sinks

With our previous techniques, we need to resample points during gradient descent. This works, but it feels like a weird hack. So is there a fully differentiable way to perturb the shape of a point cloud?

Let us assume that we have a few guide points (\(s_f\)) for this perturbation process. We also have \(s_0 \subset x\), which are the starting point positions for \(s_f\). Then, we can perturb points on the shape by attracting them to the sink points \(s_f\). This attraction falls off over distance according to the Gaussian radial basis function:

\[\phi_{\mu'} = e^{-(\frac{r}{\mu'})^2}\]

The idea is similar to perturbing a few points separately through basic gradient descent, but since the sink points have attraction, they modify the overall shape of the point cloud. Each point is affected by the sum of the attractions of the \(\sigma\) sink points:

\[x^\ast[i] = x[i] + \tanh\big(\sum_{j = 1}^\sigma (s_f[j] - x[i]) \phi_{\mu'} (||s_0[j] - x[i]||_2)\big), \quad\forall i \in \{1 \ldots N\}\]

where \(\mu'\) is

\[\mu' = \frac{\mu}{N} \sum_{i = 1}^N \min_{j \in \{1 \ldots N\}\setminus\{i\}} ||x[j] - x[i]||_2\]

to ensure that the tunable attraction falloff relies of the distribution of points. Note that the \(\tanh\) is used to clip the perturbations. Since the entire perturbation expression is differentiable, we can use something like Adam to maximize the loss while minimizing the perturbation. A visualization of the result:

This method is my favorite out of all the shape attacks because it is fully differentiable, so it feels “clean” and not as hacky as resampling points. It was inspired by black holes due to my interest in strange physics stuff like quantum mechanics and spacetime.

For a quick run down of all the attacks proposed in my second paper, just look at this graphic:

The shape attacks all perform very well against point removal defenses, and they perform much better than the iterative gradient \(L_2\) attack, which represents a naive pointwise attack:

Removing points as an attack

Another avenue for attacking point clouds is through removing points (this and this papers). This is done by dropping points that are part of the critical point set that contribute to decreasing the loss between the model output and the correct class. The saliency (gradient) is used to find these critical points. Note that this idea is very similar to the defense method of removing salient points.

Obviously, removing more points as a defense will not help point clouds that are attacked through point removal. However, this attack is not realistic, as it is difficult to control the lost points from point clouds that are directly scanned from 3D objects.

As a side note, I did independently come up with a similar method to this a long time ago (it is still in my code), but I did not test multiple iterations of this attack, so it did not perform very well. This caused me to (unfortunately) scrap the idea. The idea of removing points is quite popular, and similar attacks has been proposed independently in multiple papers.

An upsampling network (this paper) can be used to defend against adversarially removed points, but it is quite easy to also attack the upsampling network, since it is fully differentiable anyways.

Beyond 3D point sets

Point clouds are easy to perturb because changing the numerical values in the input point set directly changes the point locations. For attacking other formats like voxels or meshes, it may be easier to first convert them to point clouds before perturbing them.

Conclusion and future directions

So far, creating truly robust defenses by using information exclusive to 3D point clouds, like point density, is still an open problem. In the beginning, we would think that 3D space is somehow more easily defensible than 2D space, but my later work showed that this is not true. 3D space and 2D space have different strengths and vulnerabilities, and there are attacks and defenses that are exclusive to one domain but not the other. As with many work on defenses in 2D, creating defenses that are robust in 3D space is not an easy task—it is quite easy to attack the assumptions that are made in certain defense techniques. From my short journey through adversarial machine learning, I envision the truly robust defense methods to be mathematically provable and domain-agnostic, so they do not need to make use of domain-specific properties (like the distribution of points) that may be easily circumvented.

Coming up with new research ideas

Mon, 19 Aug 2019 00:00:00 +0000

This post contains some notes on methods for obtaining research ideas, so I can remember to apply them when I am stuck.

I believe that working efficiently is more important than blindly throwing time at a project. For my research work, I spend a lot more time thinking instead of implementing (at least for projects that are not focused on reimplementing existing ideas). Also, I believe it is important to have solid programming and problem solving (competitive math/programming) fundamentals, so the bottleneck in the research process is in coming up with new ideas, not implementing the ideas. In terms of actually coming up with ideas, here are a few methods:

1. Working forwards

Credit for this idea goes to Avi Srivastava for pointing this out to me.

The idea is to first read existing research papers to closely examine previous work. Then, attempt to find details that they miss, or “future work” that they have not done yet. There should always be something that the previous work misses, because researchers have limited time to put into their work.

The UMICollapse tool (paper) I created was based off of this idea. The insight was to speed up the slow pairwise UMI comparison step that was used in previous tools. This step is a bottleneck in the speed many UMI collapsing algorithms for grouping UMIs, but many previous works only focused on how to accurately group UMIs, not how it can be done efficiently. To ensure that the algorithms I introduce are novel compared to previous algorithms, I adapted the previous well-known algorithms to the UMI deduplication task specifically, by using different tricks based on the UMI grouping algorithms. This is another example of the idea of working forwards from previous ideas by adapting them for a specific domain. For more information on UMI collapsing tricks, read my blog post on it.

2. Working backwards

The idea is to examine the main research problem from a completely different angle that is not yet examined in a previous work. Usually, this involves defining a new metric. For example, if the previous work claims to achieve a high score under a certain metric, then come up with a different meaningful metric and attempt to get better results under this new metric.

My work on adversarial attacks in 3D space (paper) is an example of defining new metrics. In addition to creating effective attacks against 3D neural networks, I examined alternative goals like the perceptibility of the adversarial perturbations to humans, the ease of construction of the adversarial examples in real-life, and the effectiveness of the attacks against defensive techniques. In this case, I came up with new ideas by asking myself, “what other desirable qualities of adversarial examples do we want?”

3. Diversity

When working on a research problem and its subproblems, especially through the working forwards method, try to find diverse, marginally related ideas and attempt to connect them to the research problem. Sometimes, this connection may be novel. Even if this connection proves fruitless, it may reveal an alternative approach to the problem. The marginally related ideas can be found through learning. In fact, when learning about new topics, I often think about connecting those new topics to each of the research problems I have tackled before. Instead of looking for an overarching idea, this method involves looking at the tools, and then asking what those tools can build if they are used together.

To learn different ideas, I do the following in my spare time:

Read relevant research subreddits on Reddit.
Read Hacker News.
Read tweets about blog posts and threads about new papers on Twitter.
Watch videos that summarize recent research, like Two Minute Papers on YouTube.
Skim a paper and then google things in the paper that I do not know, but seem promising. This usually involves reading Wikipedia to learn more.

Sometimes I come up with a random idea not exactly related to what I am working on, and then I google it to find if it has been done. Often times, someone else has already examined this idea. In this case, I can still learn a new concept that I can use in future work. Other times, if I am lucky, then I get a new connection between two different ideas that are related to my current research, and I can get a new algorithm or analysis technique out of it.

An example of this diversity method in action is how I came up with the adversarial sinks attack for 3D points (paper). The general premise of the attack is to pull points on a 3D object towards sink points. This idea was inspired by black holes and how they attract other objects, but it was ultimately named after the source/sinks in network flow problems. I also came up with the idea of actually attempting to approximate the true shape of a 3D point set through alpha shapes, which stems from Delaunay triangulations that I considered for nearest neighbor searches in my bioinformatics work on finding similar DNA sequences.

Another example of this method is how I came up with using a domain-specific language for describing patterns while preprocessing DNA sequences (paper). I knew that the string matching ideas for finding patterns were related to matching regular expressions, so I thought, “why not go one step further and make a full language?” Sometimes, generalizing further works well.

4. Collaborations

It is often very difficult to cover all your bases (eg., analyzing data, paper writing, etc.) when working on a research project alone. Collaborating with others help make this easier. This also allows more ideas to be generated through conversations with someone with a slightly different point of view. For an added benefit, collaborating with others reduces the chance of them “competing” with you to solve a certain research problem.

n-grams + BK-trees: tricks for collapsing UMIs faster

Thu, 08 Aug 2019 00:00:00 +0000

Creating a new algorithm or data structure is often not just one major insight, but a collection tricks that work together to produce better results. With that said, what kind of “better results” are we looking for? For computer scientists, this question often results in three answers: higher accuracy, faster speed, or lower memory footprint. In a recent paper (code), I utilized this answer to explore the problem of efficiently deduplicating sequenced DNA reads with Unique Molecular Identifiers (UMIs) from a computer science perspective.

A bit of background

Each UMI is a short, unique random sequence that corresponds to a certain DNA fragment. When PCR amplification is applied to DNA fragments before sequencing, each DNA fragment is duplicated multiple times. UMIs allow us to figure out which sequenced reads are duplicates by grouping reads through their UMIs. The reason why we do not group using the DNA fragments themselves is because there may be multiple copies of the same DNA fragment before PCR amplification, and we want to be able to accurately count the duplicate DNA fragments. UMIs are applied in many experiments, including single-cell RNA sequencing.

As with many bioinformatics tasks that involve processing sequenced reads, the difficulty lies in handling sequencing and PCR amplification errors. Therefore, the deduplication task is just somehow grouping similar UMIs together, where similarity is defined by counting the number of mismatches between two UMI sequences. Since UMIs are used very often, this task has been thoroughly explored. Most notably, the “directional adjacency” method from UMI-tools involves first obtaining the frequency of each unique UMI, and then grouping low frequency UMIs with high frequency UMIs. The main intuition behind this is idea is that UMIs that appear frequently have a very high chance of being correctly amplified and sequenced, while UMIs that appear less frequently are most likely wrong. This algorithm is discussed in more detail in this blog post. An image comparison of different grouping algorithms, from the blog post:

Consider what happens for each UMI (which we will can the “queried” UMI): we need to quickly find UMIs that are similar to the queried UMI that also have a UMI frequency lower than our threshold. Let us call this step a single “query”. After multiple queries, we build the full UMI graph as shown above by connecting the similar UMIs. Each connection is directional, so we connect higher frequency UMIs to lower frequency UMIs. Then, we need to group lower frequency UMIs together with the high frequency UMI using the graph, and we assume that the corresponding sequenced reads for the grouped UMIs all originate from the same DNA fragment.

Improvements?

Applying the three things that computer scientists care about, we know that we can basically make three improvements to this process: make it more accurate, speed it up, or lower the memory footprint. We can attempt to improve the accuracy with a better algorithm than directional adjacency, but it is difficult without biological insights and actual experience with UMI data. Therefore, we are forced to settle on attempting to improve the other two areas while ensuring that we do not make any compromises on the accuracy of the directional adjacency algorithm. Directly attempting to lower the memory footprint is kind of pointless since we have to store all of the sequenced reads in memory no matter what, and most computers have a ton of memory anyways. Thus, the only option is to improve the speed of deduplication process while ensuring that the memory footprint does increase significantly.

The first step is identifying the bottleneck. There are two main time-consuming procedures in the directional adjacency algorithm: building the graph of UMIs and extracting groups of UMIs. Since each UMI can only belong to one group, then it is easy to see that extracting groups of UMIs from the graph only takes time linear to the number of unique UMIs. Building the graph naively requires comparisons between each pair of unique UMIs, which scales quadratically with the number of UMIs. It turns out that many tools made for grouping UMIs use the naive method for building the graph, which leaves a lot of room for improvement.

n-grams BK-trees: a medley of tricks

As a recap, our goal is to find a list of candidate UMIs that are similar to a queried UMI, and then narrow down the list to only UMIs that are low frequency. The exact frequency value of a “low frequency” UMI depends on the frequency of other UMIs, so in general we will narrow down the list of similar UMIs to only the low frequency UMIs with a fixed frequency threshold that represents the upper bound frequency. Now, we can figure out how to speed this up with different techniques.

Trick 1: n-grams

The first trick we can apply is a simple and intuitive one. The idea is to decompose each UMI into multiple “fingerprints”, and build a mapping from each fingerprint to a list of all UMIs that have that fingerprint. This fingerprint is simply a contiguous segment of the UMI, called a n-gram, and its location in the UMI. The only difficult part of this trick is how we select the segments to allow errors (mismatches) in the UMIs. It is not hard to see that if we want to allow up to \(k\) errors, then we only need to split the UMIs into \(k + 1\) n-grams. Therefore, for each queried UMI, we only need to search other UMIs that share at least one n-gram with the queried UMI. This allows us to prune UMIs by the number of errors we allow. The interesting part of this trick is that as the UMI length increases, each n-gram becomes longer, and thus rarer, and more UMIs can be pruned through this method.

Trick 2: BK-trees

BK-trees are metric trees that partition the (in our case) space of all UMIs into shells. In other words, we pick a UMI and partition the UMI space into multiple shells of different radii that are centered around that picked UMI. The tree structure is constructed by repeatedly picking parent UMIs to partition the UMI space, and connecting each parent UMI to children UMIs that lie in each of the shells of different radii. For a more detailed walkthrough of how BK-trees work, read this blog post, which contains an excellent introduction.

Here is an example of a BK-tree:

The main problem in the n-grams algorithm is that we have to check every single UMI that shares at least one n-gram with our queried UMI. Many of these UMIs may not be similar to our queried UMI, so it is definately a good idea to transform the list used in the n-grams method into something else, like a BK-tree. Then, each unique n-gram maps to a BK-tree that contains all of the UMIs associated with that n-gram, and we can prune even more UMIs from our search space. Here is an example of the n-grams BK-trees algorithm:

The main advantage of the n-grams BK-trees data structure over other “brute force” type algorithms that go through all possible errors that could occur in a UMI string is that it does not directly scale exponentially as the number of errors we allow or the UMI length increases.

Trick 3: prune by frequency

A seemingly obvious property of trees is that once we know that each node in a subtree does not satisfy our criteria, we can just skip that entire subtree. If we compute the minimum frequency of all UMIs in each subtree in a BK-tree and save those values, then we can easily figure out whether a subtree contains at least one UMI with a frequency that is less than our fixed frequency threshold. With this, we can skip subtrees that only contains UMIs with frequencies greater than our threshold.

This property important because it hints that we should prune UMIs by frequency while searching for similar UMIs to build the UMI graph. We can avoid visiting UMIs in the BK-trees that are similar to our queried UMI, but have a UMI frequency higher than our frequency threshold, by keeping track of the minimum UMI frequency of each subtree in each BK-tree.

Unused trick: sorting by frequency

Since we are pruning by frequencies, why not go one step further and also sort the UMIs by frequency before adding them one-by-one into the BK-trees? This allows lower frequency UMIs to be added closer to the root of the tree and higher frequency UMIs to be added near the leaves of the tree. As subtrees often mostly include UMIs that are far away from the root, it is more likely for a subtree to contain UMIs with frequencies higher than the threshold the farther away we get from the root. That means that more subtrees are pruned overall.

In the end, we will not use this when initializing the n-grams BK-trees data structure because it requires sorting the UMIs by frequency, which actually slows down the initialization step in practice.

Trick 4: literally extract UMIs

So far, we are able to obtain the lower frequency UMIs that are similar to a queried higher frequency UMI. After querying with each UMI as the higher frequency UMI, we can build the directed graph of UMIs and group UMIs through the directional adjacency algorithm. However, notice that in the end, each UMI can only belong to one single group. Therefore, we are actually wasting time explicitly building the UMI graph, because after adding an UMI to a group in the directional adjacency algorithm, we do not ever need to revisit that UMI again. All of the extra edges leading to that UMI, which we painstakingly calculated through multiple queries, are essentially useless.

So why do even build the graph in the first place if we do not use most of it? Why not just make the UMI graph implicit, so we compute the edges we need? This actually works, and it basically means that we remove UMIs from our BK-trees after each query. Therefore, UMIs that are added to a group in a previous query are marked as removed and skipped in later queries. We can keep track of subtrees in each BK-tree where all of the UMIs are removed, and skip entire subtrees to save time. Since the UMI graph is not explicitly constructed, we need to merge the graph construction and the UMI extraction/grouping steps of the directional adjacency algorithm together. This means that we essentially interleave grouping the UMIs and modifying the BK-trees that represent the implicit UMI graph.

Trick 5: special encoding

With two strings of nucleotides (A, T, C, or G) of length \(M\), we can compute the Hamming distance (how we measure similarity) in exactly \(O(M)\) time. At first it seems unlikely, but can we do better? The answer is actually yes!

Usually, the encoding method for nucleotides is mapping them to binary: 00, 01, 10, and 11. This is the best we can do if we optimize for size. The problem with this encoding is that we cannot easily find the Hamming distance for two arbitrary strings encoded with them. If we optimize for the speed of computing Hamming distance, we can actually get a different set of encodings: 011, 110, 101, 000. The special property of this set of encodings is that the bitwise Hamming distance between each pair of encodings is exactly 2 (try it; count the number of different bits between each pair). This means that we can easily infer the Hamming distance between two strings of nucleotides when they are encoded with that encoding by calculating the bitwise Hamming distance.

The reason why we can get faster than \(O(M)\) Hamming distance computation is because we can pack a bunch of nucleotides into a single 64-bit computer word. In fact, since computing the bitwise Hamming distance is constant-time (XOR + POPCOUNT operations), we can compute the Hamming distance between two 21 nucleotide strings in constant-time! Note that encoding a nucleotide string still takes \(O(M)\) time so this is only beneficial if multiple Hamming comparisons are made. By packing a bunch of nucleotides into one computer word, we can also make hashing and other comparison operations constant-time.

Implementation trick: reducing copies

When we split UMIs into n-grams, we can avoid copying the pieces of the UMI multiple times by using views on the UMIs. A view basically represents a contiguous segment of an UMI string with only the start and end locations of the segment, which is backed by a reference to the original UMI. We can also cache the hash for each view so we do not need to recalculate it.

Performance in practice

It is important to remember that the tricks for speeding up UMI deduplication may degrade performance compared to other algorithms on very small datasets. This is completely expected, since there are overheads associated with using those tricks.

First, let us see how fast the n-grams BK-trees method performs runs compared to other methods as the number of unique UMIs increases:

The n-grams BK-trees method is able to make use of the benefits of both the n-grams and the BK-tree methods (individually), as it performs better.

If we increase the length of the UMIs, then we see that it ties with the n-grams method:

If we increase the number of errors allowed, then the n-grams BK-trees method scales much more favorably compared to other methods:

Note that in all three experiments, we use simulated (randomly generated) datasets. In practice, when there are less UMIs at a single alignment coordinate, the n-grams BK-trees method does not result in such a dramatic gains in performance.

For a little more information about the n-grams BK-tree data structure, we can look at some statistics about the n-grams with more than 160,000 UMIs:

We can see that the n-grams method is able to prune a significant portion of the UMIs. The largest BK-tree is only built on around 140 UMIs.

Conclusion

The n-grams BK-trees method is essentially a bag of all sorts of tricks for speeding up the UMI deduplication task. However, in general, finding similar strings to a queried string is very useful in a variety of applications. It is vital in natural language processing and bioinformatics tasks that involve clustering and grouping similar strings. Perhaps the insights and tricks behind the n-grams BK-trees method can be applied to accurately finding similar strings under the Hamming distance metric for other tasks.