Daniel Liu's BlogProgramming and Research
https://blog.liudaniel.com
Daniel Liudaniel.liu02@gmail.comhttps://liudaniel.com/So you want to do some scientific research as a high school student?<p>It is difficult to get started doing research as a high school student without any existing connections with companies or university labs. Therefore, I have compiled a list of notable activities, programs, and awards that I have heard of. Obviously, I have not included every possible opportunity, but these should get interested and motivated students started with doing STEM research.</p>
<object style="margin: auto" data="../assets/research_list_general.pdf" type="application/pdf" width="100%" height="750px">
Link to the list: <a href="../assets/research_list_general.pdf">here</a>.
</object>
<p>Have fun doing research!</p>
Tue, 25 Feb 2020 00:00:00 +0000
https://blog.liudaniel.com//high-school-research-opportunities
https://blog.liudaniel.com/high-school-research-opportunitiesThe birth of a new sub-sub-field: adversarial attacks on 3D point sets<p>An interesting phenomenon with neural networks is that it is incredibly easy to perturb or change the input by a little amount, and cause the network to make a completely different prediction. In general, this field of adversarial machine learning contains two main problems that everyone is slowly chipping away at:</p>
<ul>
<li>How can we attack neural networks with a bounded amount of change/perturbation to the input, under a certain metric?</li>
<li>How can we (empirically or provably) defend neural networks against adversarial attacks?</li>
</ul>
<p>Currently, this is a perpetual tug of war between the attacker and the defender. There is a lot of work on both attacking and defending, particularly because coming up with an efficient and provably robust defensive method against all attacks is very difficult.</p>
<p><img src="../assets/cropped_schematics.png" alt="" width="500px" /></p>
<p>Even in the rapidly expanding field of machine learning, there are stones left unturned. In the short timespan of one year, I was fortunate enough to contribute to the birth of a new subsubfield examining the adversarial robustness of neural networks that learn from 3D point sets. This was incredibly exciting—I got to witness the development and maturation of ideas in a field with many opportunities by participating <em>first-hand</em>!</p>
<p>In this post, I will highlight some interesting ideas born through the intersection of 3D machine learning and adversarial machine learning. This will mostly cover the two papers (<a href="https://arxiv.org/abs/1901.03006">first</a> and <a href="https://arxiv.org/abs/1908.06062">second</a> papers) I contributed to the field along with my mentors Ronald Yu and Hao Su, and also some work by groups all over the world. I will also provide evaluations of the plausibility and effectiveness of different ideas, especially for defending against adversarial attacks, which is prone to unintentional errors.</p>
<h1 id="name-of-the-game">Name of the game</h1>
<p>Before we dive into the ideas for attacking and defending 3D neural networks, let us examine what kind of ideas we want to extract in this field. Since we are examining 3D point clouds and neural networks built especially for them, we want to create attacks and defenses that are native to 3D point clouds. We can create attacks and defenses that are universal to all types of inputs, but then we would have to “compete” with previously proposed ideas. If we merely apply ideas that were used for 2D images, which are studied more thoroughly, to 3D space, then we are not bringing major contributions to the table. Therefore, we want to exploit the special properties of both 3D point clouds and the point cloud neural networks in our journey of attacking and defending.</p>
<p>As neural networks are quite easy to attack in general, we want to ensure that our attacks satisfy desirable constraints, like imperceptibility. In 3D space, there are a lot of new constraints to explore that are not present in other domains.</p>
<p>Now, let us go over a little background on 3D point clouds and their neural networks.</p>
<h2 id="point-clouds">Point clouds</h2>
<p>A 3D point cloud provides an approximation of a 3D object’s shape. This shape is only the boundary of a 3D object, so it is hollow. If we represent the true shape as an infinite set of points <script type="math/tex">S</script>, then we have <script type="math/tex">x \subset S</script> for a point cloud <script type="math/tex">x</script>. For simplicity, we will assume that <script type="math/tex">x</script> is evenly sampled from the true shape. When point clouds are obtained through scanning objects in the real world (LiDAR and RGB-D scans, or photogrammetry), they are usually uneven and partially occluded.</p>
<p>3D point clouds differ from 2D images in a few ways:</p>
<ol>
<li>The order of points does not matter.</li>
<li>Shape, instead of color, is represented. Along with the assumed even sampling of points, this means that there is exploitable structure in 3D point clouds.</li>
<li>Randomly moving the points by a small amount, dropping a few points, and adding a few points on the true shape should still represents the same object.</li>
</ol>
<h2 id="learning-in-3d-space">Learning in 3D space</h2>
<p>The challenge of learning in 3D space is to deal with the order and density invariance of point clouds. The general idea of the popular <a href="https://arxiv.org/abs/1612.00593">PointNet</a> architecture is to apply the classical multi-layer perceptron (multiplication by weight matrix) for each point <em>separately</em>. From the three color channels of each point, we use multiple layers of matrix multiplication to obtain 1024 features. Then, max pooling is applied elementwise across the 1024-dimensional feature vectors of each point. Since the max operation is symmetrical, this is invariant to the order of points in a point set. Many other architectures basically just extend this idea with subsampling and other ideas.</p>
<p>For density invariance, the PointNet architecture only selects a set of “critical” points from the entire input point set with the max pooling operation. This set of critical points form the skeleton of the point cloud, and the rest of the points are basically ignored by the network.</p>
<h1 id="basic-attacks-in-3d">Basic attacks in 3D</h1>
<p>The most straightforward attack is by using gradient descent to directly perturb the position of each point. We are essentially solving the following optimization problem for a point cloud <script type="math/tex">x</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
&\text{maximize}& &J(f_\theta(x + \delta), y)\\
&\text{subject to}& &||\delta||_p \leq \epsilon
\end{align} %]]></script>
<p><script type="math/tex">J</script> represents the loss function, <script type="math/tex">y</script> is the label class, and <script type="math/tex">f_\theta</script> is a neural network parameterized by <script type="math/tex">\theta</script>. Notice that the perturbation <script type="math/tex">\delta</script> is bounded by <script type="math/tex">\epsilon</script> under the <script type="math/tex">L_p</script> norm, and we do not clip <script type="math/tex">x + \delta</script> because unlike 2D image color channels, the positions of 3D points are unbounded. The <script type="math/tex">L_p</script> norm also provides a way for us to measure the perceptibility of adversarial perturbations.</p>
<p>Just like with 2D images, this method works very well for generating adversarial attacks. Here are visualizations of adversarial perturbations (bounded with the <script type="math/tex">L_2</script> norm) on a car and a person from the ModelNet40 dataset:</p>
<p><img src="../assets/iter_car.png" alt="" width="250px" /></p>
<p>and</p>
<p><img src="../assets/iter_person.png" alt="" width="250px" /></p>
<p>The perturbed points are orange.</p>
<p>A simple extension to this is to ensure that each point is perturbed by the same amount by normalizing the perturbation vector for each point. Also, it is possible to add a few new points and perturb them instead of perturbing the original points in the point cloud (for example, <a href="https://arxiv.org/abs/1809.07016">here</a> and <a href="https://arxiv.org/abs/1902.10899">here</a>).</p>
<h1 id="defense-removing-outliers">Defense: removing outliers</h1>
<p>Right away, we notice that constraining the perturbations using the <script type="math/tex">L_2</script> norm results in some points being perturbed more than others, and those points become outliers in the point set. Therefore, a simple defense would be to just remove those outlier points. A common method for identifying outliers is based on each point and its <script type="math/tex">k</script>-nearest neighbors. Finding points that are outliers is done by examining the distribution of distances between each point and its nearest neighbors. Afterwards, the outlier points that are too far away from its <script type="math/tex">k</script>-nearest neigbors are removed from the point set.</p>
<p>Removing outliers is actually very effective as a defense—it performs much better than adversarial training, a classical defense that involves teaching a neural network the correct labels for adversarial examples. In practice, removing outliers works well even if we constrain the amount of perturbation to each <em>point</em> by an <script type="math/tex">\epsilon</script> so that large perturbations are not possible. The perturbation of each point can even contrained to the average distance between each point and its nearest neighbor in the clean point cloud, and outlier removal would still work.</p>
<p>Interestingly, this method was proposed in parallel in many different papers on defending against adversarial attacks on point clouds. I guess everyone noticed the outliers generated by adversarial perturbations. The simpler method of randomly removing points was also proposed as a defense.</p>
<h1 id="defense-removing-salient-points">Defense: removing salient points</h1>
<p>Since we perturb points by their gradients, it makes sense to remove adversarial points by examining their gradients to hopefully restore the point cloud of an object. The idea is to first calculate the saliency of each point (at index <script type="math/tex">i</script> in <script type="math/tex">x</script>) through</p>
<script type="math/tex; mode=display">s[i] = \max_j ||(\nabla_{x^\ast} f_\theta(x^\ast)[j])[i]||_2</script>
<p>where <script type="math/tex">x^\ast = x + \delta</script>.</p>
<p>Then, we can sort the points by their saliencies and remove points with high saliencies. In other words, points that have large saliencies, which are the magnitudes of the gradient of each output class with respect to each point, are removed. In practice, this works well as a defense, and it performs better than adversarial training. This defense avoids the issue of being unable to identify adversarial points if there are no outliers, but we are making the assumption that points with large magnitudes of gradients are the perturbed points.</p>
<h1 id="defense-limitations">Defense: limitations</h1>
<p>The reason why removing points performs so well is actually due them relying on gradient masking. The max pooling operation in PointNet ignores a set of points, which causes them to not get any gradient flow. These points cannot be perturbed with gradient-based methods, so they can accurately represent the unperturbed shape of a point cloud once the perturbed points are removed. However, relying on masked gradients does not lead to truly robust models, as we will see in the shape attacks that are proposed later.</p>
<h1 id="minimizing-the-perceptibility-of-perturbations">Minimizing the perceptibility of perturbations</h1>
<p>So far, when we perturb a point cloud, we do not really take into account the intrinsic shape that it represents with the <script type="math/tex">L_p</script> norms. Therefore, it may prove fruitful to examine some other metrics for measuring the perturbation on 3D point clouds. One such metric is the Hausdorff distance between two sets:</p>
<script type="math/tex; mode=display">\mathcal{H}(A, B) = \max_{a \in A} \min_{b \in B} ||b - a||_2</script>
<p>This is actually not technically a metric, but it enables us to measure the distance between a perturbed point set <script type="math/tex">A</script> and a clean point set <script type="math/tex">B</script>. In words, the Hausdorff distance is defined as the the maximum distance between each point in <script type="math/tex">A</script> and its closest point in <script type="math/tex">B</script>. For an adversarial point set <script type="math/tex">x^\ast</script>, we have two ways of using the Hausdorff distance:</p>
<ol>
<li><script type="math/tex">\mathcal{H}(x^\ast, x)</script>: comparing the adversarial point cloud to the clean point cloud. The advantage of this over the <script type="math/tex">L_p</script> norms is that it allows the positions of two points to be swapped, making it a much more natural metric for 3D point clouds that are order-invariant. This was proposed in <a href="https://arxiv.org/abs/1809.07016">this paper</a>.</li>
<li><script type="math/tex">\mathcal{H}(x^\ast, S)</script>: comparing the adversarial point cloud to the true shape of the point cloud <script type="math/tex">x</script>. This takes into the account the shape of the point cloud, and is point density invariant. This was proposed in my papers.</li>
</ol>
<p>To maximize the loss of the neural network while ensuring that <script type="math/tex">\mathcal{H}(x^\ast, S) \leq \epsilon</script>, we can use projected gradient descent to project the perturbation of each point onto the 3D object that the point clouds were sampled from. If the 3D object is unavailable, we can use a triangulation algorithm, like the alpha shapes algorithm, to infer the object shape. For faster projection speed, we can build some metric tree, like the VP-tree, on the triangular mesh, by representing each triangle as a point. I call this method the “distributional attack”, since it changes the distribution of points near the shape of an 3D object. The great thing about this method is that we can generate perturbations with <script type="math/tex">\mathcal{H}(x^\ast, S) = 0</script>, which means that we only move the points around on the shape <script type="math/tex">S</script>. If the 3D object is available, this creates <em>very</em> imperceptible perturbations, and it reaches around 25% success rate on PointNet/PointNet++. Even with a slightly higher Hausdorff distance and an approximated surface <script type="math/tex">S</script>, the perturbations remain imperceptible, while the success rate of the attack becomes much higher (>80%).</p>
<p>Here is a visualization of the distributional attack on an approximated shape, with a small Hausdorff distance:</p>
<p><img src="../assets/dist_lamp.png" alt="" width="250px" /></p>
<p>Here is a visualization with the true shape and a Hausdorff distance of exactly zero, with perturbed points in orange:</p>
<p><img src="../assets/grad_proj_car.png" alt="" width="250px" /></p>
<p>Although the perturbation of each point is very small and kept on the shape of the object, removing points is effective as a defense against this type of attack.</p>
<h1 id="shape-attacks">Shape attacks</h1>
<p>In addition to constraining the Hausdorff distance, we can add another constraint to ensure an uniform density distribution of points after perturbing points. Therefore, the perturbations must change the overall shape of an point cloud, since small changes to the point distribution are not allowed. This is more realistic than having a few points suspended in mid-air, far away from the main 3D object. Also, it is easier to control the shape than the distribution of points obtained through a scanner. Shape attacks are also effective against point removal defenses, as they modify or destroy the density information that outlier removal relies on, and removing a few points from a perturbed shape does not make it clean again.</p>
<h2 id="perturbation-resampling">Perturbation Resampling</h2>
<p>We can express the idea of having the points <script type="math/tex">x^\ast</script> evenly distributed on a shape <script type="math/tex">S^\ast</script> as an optimization problem maximizing the distance between points:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
&\text{maximize}& &\min_{i \in \{1 \ldots N\}} \min_{j \in \{1 \ldots N\}\setminus\{i\}} ||x^\ast[j] - x^\ast[i]||_2\\
&\text{subject to}& &x^\ast \subset S^\ast
\end{align} %]]></script>
<p>However, we do not need to solve this problem exactly; a greedy approximation using farthest point sampling works fine. The perturbation resampling attack is simple: just perturb points using gradient descent, but resample a portion of the points on the estimated adversarial shape that is determined by the perturbed points.</p>
<p>The final result is</p>
<p><img src="../assets/resample_lamp.png" alt="" width="250px" /></p>
<p>Note that sometimes, the perturbation of a point is so large that the triangulation algorithm cannot include it in the triangulation, so it becomes an outlier.</p>
<h2 id="adversarial-sticks">Adversarial Sticks</h2>
<p>We can also add new adversarial features to the shape of the point cloud. In <a href="https://arxiv.org/abs/1809.07016">this paper</a>, they propose adding new clusters of points, or even smaller versions of other point clouds that float in mid-air, near a clean point cloud. We take a simpler and more realistic route by adding a few sticks, or line segments, that are attached to the shape of the clean point cloud. Then, the adversarial object will look like a porcupine. Conceptually, we need to figure out where to place the sticks on an object, and the length/direction of each stick. Formally, we are solving the following optimization problem:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
&\text{maximize}_{\alpha, \beta}& &J(f_\theta(x \cup \mathcal{S}_\kappa(\alpha, \beta)), y)\\
&\text{subject to}& &||\beta||_2 \leq \epsilon,\\
&&&\alpha \subset S
\end{align} %]]></script>
<p><script type="math/tex">\alpha</script> is a set of points representing where the sticks are attached to the shape <script type="math/tex">S</script>, <script type="math/tex">\beta</script> is a set of vectors representing the orientation of each stick, and <script type="math/tex">\mathcal{S}_\kappa</script> returns a set of <script type="math/tex">\kappa</script> points sampled on the sticks.</p>
<p>Like perturbation resampling, we can just approximate this by first perturbing <em>points</em> using gradient descent, and then connecting those points to the closest point on the surface of a 3D object at the end. Finally, we need to sample points on the adversarial sticks. A visualization of the result is:</p>
<p><img src="../assets/sticks_lamp.png" alt="" width="250px" /></p>
<h2 id="adversarial-sinks">Adversarial Sinks</h2>
<p>With our previous techniques, we need to resample points during gradient descent. This works, but it feels like a weird hack. So is there a fully differentiable way to perturb the shape of a point cloud?</p>
<p>Let us assume that we have a few guide points (<script type="math/tex">s_f</script>) for this perturbation process. We also have <script type="math/tex">s_0 \subset x</script>, which are the starting point positions for <script type="math/tex">s_f</script>. Then, we can perturb points on the shape by attracting them to the sink points <script type="math/tex">s_f</script>. This attraction falls off over distance according to the Gaussian radial basis function:</p>
<script type="math/tex; mode=display">\phi_{\mu'} = e^{-(\frac{r}{\mu'})^2}</script>
<p>The idea is similar to perturbing a few points separately through basic gradient descent, but since the sink points have attraction, they modify the overall shape of the point cloud. Each point is affected by the sum of the attractions of the <script type="math/tex">\sigma</script> sink points:</p>
<script type="math/tex; mode=display">x^\ast[i] = x[i] + \tanh\big(\sum_{j = 1}^\sigma (s_f[j] - x[i]) \phi_{\mu'} (||s_0[j] - x[i]||_2)\big), \quad\forall i \in \{1 \ldots N\}</script>
<p>where <script type="math/tex">\mu'</script> is</p>
<script type="math/tex; mode=display">\mu' = \frac{\mu}{N} \sum_{i = 1}^N \min_{j \in \{1 \ldots N\}\setminus\{i\}} ||x[j] - x[i]||_2</script>
<p>to ensure that the tunable attraction falloff relies of the distribution of points. Note that the <script type="math/tex">\tanh</script> is used to clip the perturbations. Since the entire perturbation expression is differentiable, we can use something like Adam to maximize the loss while minimizing the perturbation. A visualization of the result:</p>
<p><img src="../assets/sinks_lamp.png" alt="" width="250px" /></p>
<p>This method is my favorite out of all the shape attacks because it is fully differentiable, so it feels “clean” and not as hacky as resampling points. It was inspired by black holes due to my interest in strange physics stuff like quantum mechanics and spacetime.</p>
<p>For a quick run down of all the attacks proposed in my second paper, just look at this graphic:</p>
<p><img src="../assets/attacks_2d.png" alt="" width="800px" /></p>
<p>The shape attacks all perform very well against point removal defenses, and they perform much better than the iterative gradient <script type="math/tex">L_2</script> attack, which represents a naive pointwise attack:</p>
<p><img src="../assets/removing_points.png" alt="" width="700px" /></p>
<h1 id="removing-points-as-an-attack">Removing points as an attack</h1>
<p>Another avenue for attacking point clouds is through <em>removing</em> points (<a href="https://arxiv.org/abs/1812.01687">this</a> and <a href="https://arxiv.org/abs/1902.10899">this</a> papers). This is done by dropping points that are part of the critical point set that contribute to decreasing the loss between the model output and the correct class. The saliency (gradient) is used to find these critical points. Note that this idea is very similar to the defense method of removing salient points.</p>
<p>Obviously, removing more points as a defense will not help point clouds that are attacked through point removal. However, this attack is not realistic, as it is difficult to control the lost points from point clouds that are directly scanned from 3D objects.</p>
<p>As a side note, I did independently come up with a similar method to this a long time ago (it is still in my code), but I did not test multiple iterations of this attack, so it did not perform very well. This caused me to (unfortunately) scrap the idea. The idea of removing points is quite popular, and similar attacks has been proposed independently in multiple papers.</p>
<p>An upsampling network (<a href="https://arxiv.org/abs/1812.11017">this paper</a>) can be used to defend against adversarially removed points, but it is quite easy to also attack the upsampling network, since it is fully differentiable anyways.</p>
<h1 id="beyond-3d-point-sets">Beyond 3D point sets</h1>
<p>Point clouds are easy to perturb because changing the numerical values in the input point set directly changes the point locations. For attacking other formats like voxels or meshes, it may be easier to first convert them to point clouds before perturbing them.</p>
<h1 id="conclusion-and-future-directions">Conclusion and future directions</h1>
<p>So far, creating truly robust defenses by using information exclusive to 3D point clouds, like point density, is still an open problem. In the beginning, we would think that 3D space is somehow more easily defensible than 2D space, but my later work showed that this is not true. 3D space and 2D space have different strengths and vulnerabilities, and there are attacks and defenses that are exclusive to one domain but not the other. As with many work on defenses in 2D, creating defenses that are robust in 3D space is not an easy task—it is quite easy to attack the assumptions that are made in certain defense techniques. From my short journey through adversarial machine learning, I envision the truly robust defense methods to be mathematically provable and domain-agnostic, so they do not need to make use of domain-specific properties (like the distribution of points) that may be easily circumvented.</p>
Thu, 22 Aug 2019 00:00:00 +0000
https://blog.liudaniel.com//birth-of-a-new-sub-sub-field
https://blog.liudaniel.com/birth-of-a-new-sub-sub-fieldComing up with new research ideas<p>This post contains some notes on methods for obtaining research ideas, so I can remember to apply them when I am stuck.</p>
<p>I believe that working efficiently is more important than blindly throwing time at a project. For my research work, I spend a lot more time <em>thinking</em> instead of <em>implementing</em> (at least for projects that are not focused on reimplementing existing ideas). Also, I believe it is important to have solid programming and problem solving (competitive math/programming) fundamentals, so the bottleneck in the research process is in coming up with new ideas, not implementing the ideas. In terms of actually coming up with ideas, here are a few methods:</p>
<h2 id="1-working-forwards">1. Working forwards</h2>
<p>Credit for this idea goes to <a href="https://twitter.com/k3yavi">Avi Srivastava</a> for pointing this out to me.</p>
<p>The idea is to first read existing research papers to closely examine previous work. Then, attempt to find details that they miss, or “future work” that they have not done yet. There should always be <em>something</em> that the previous work misses, because researchers have limited time to put into their work.</p>
<p>The UMICollapse tool (<a href="https://www.biorxiv.org/content/10.1101/648683v2">paper</a>) I created was based off of this idea. The insight was to speed up the slow pairwise UMI comparison step that was used in previous tools. This step is a bottleneck in the speed many UMI collapsing algorithms for grouping UMIs, but many previous works only focused on how to accurately group UMIs, not how it can be done efficiently. To ensure that the algorithms I introduce are novel compared to previous algorithms, I adapted the previous well-known algorithms to the UMI deduplication task specifically, by using different tricks based on the UMI grouping algorithms. This is another example of the idea of working forwards from previous ideas by adapting them for a specific domain. For more information on UMI collapsing tricks, read my <a href="https://blog.liudaniel.com/n-grams-BK-trees">blog post</a> on it.</p>
<h2 id="2-working-backwards">2. Working backwards</h2>
<p>The idea is to examine the main research problem from a completely different angle that is not yet examined in a previous work. Usually, this involves defining a new metric. For example, if the previous work claims to achieve a high score under a certain metric, then come up with a different meaningful metric and attempt to get better results under this new metric.</p>
<p>My work on adversarial attacks in 3D space (<a href="https://arxiv.org/abs/1908.06062">paper</a>) is an example of defining new metrics. In addition to creating effective attacks against 3D neural networks, I examined alternative goals like the perceptibility of the adversarial perturbations to humans, the ease of construction of the adversarial examples in real-life, and the effectiveness of the attacks against defensive techniques. In this case, I came up with new ideas by asking myself, “what other desirable qualities of adversarial examples do we want?”</p>
<h2 id="3-diversity">3. Diversity</h2>
<p>When working on a research problem and its subproblems, especially through the working forwards method, try to find diverse, marginally related ideas and attempt to connect them to the research problem. Sometimes, this connection may be novel. Even if this connection proves fruitless, it may reveal an alternative approach to the problem. The marginally related ideas can be found through learning. In fact, when learning about new topics, I often think about connecting those new topics to each of the research problems I have tackled before. Instead of looking for an overarching idea, this method involves looking at the tools, and then asking what those tools can build if they are used together.</p>
<p>To learn different ideas, I do the following in my spare time:</p>
<ul>
<li>Read relevant research subreddits on Reddit.</li>
<li>Read Hacker News.</li>
<li>Read tweets about blog posts and threads about new papers on Twitter.</li>
<li>Watch videos that summarize recent research, like <a href="https://www.youtube.com/channel/UCbfYPyITQ-7l4upoX8nvctg">Two Minute Papers</a> on YouTube.</li>
<li>Skim a paper and then google things in the paper that I do not know, but seem promising. This usually involves reading Wikipedia to learn more.</li>
</ul>
<p>Sometimes I come up with a random idea not exactly related to what I am working on, and then I google it to find if it has been done. Often times, someone else has already examined this idea. In this case, I can still learn a new concept that I can use in future work. Other times, if I am lucky, then I get a new connection between two different ideas that are related to my current research, and I can get a new algorithm or analysis technique out of it.</p>
<p>An example of this diversity method in action is how I came up with the adversarial sinks attack for 3D points (<a href="https://arxiv.org/abs/1908.06062">paper</a>). The general premise of the attack is to pull points on a 3D object towards sink points. This idea was inspired by black holes and how they attract other objects, but it was ultimately named after the source/sinks in network flow problems. I also came up with the idea of actually attempting to approximate the true shape of a 3D point set through alpha shapes, which stems from Delaunay triangulations that I considered for nearest neighbor searches in my bioinformatics work on finding similar DNA sequences.</p>
<p>Another example of this method is how I came up with using a domain-specific language for describing patterns while preprocessing DNA sequences (<a href="https://peerj.com/articles/7170/">paper</a>). I knew that the string matching ideas for finding patterns were related to matching regular expressions, so I thought, “why not go one step further and make a full <em>language</em>?” Sometimes, generalizing further works well.</p>
Mon, 19 Aug 2019 00:00:00 +0000
https://blog.liudaniel.com//new-research-ideas
https://blog.liudaniel.com/new-research-ideasn-grams + BK-trees: tricks for collapsing UMIs faster<p>Creating a new algorithm or data structure is often not just one major insight, but a collection tricks that work together to produce better results. With that said, what kind of “better results” are we looking for? For computer scientists, this question often results in three answers: higher accuracy, faster speed, or lower memory footprint. In a recent <a href="https://www.biorxiv.org/content/10.1101/648683v2">paper</a> (<a href="https://github.com/Daniel-Liu-c0deb0t/UMICollapse">code</a>), I utilized this answer to explore the problem of efficiently deduplicating sequenced DNA reads with Unique Molecular Identifiers (UMIs) from a computer science perspective.</p>
<h1 id="a-bit-of-background">A bit of background</h1>
<p>Each UMI is a short, unique random sequence that corresponds to a certain DNA fragment. When PCR amplification is applied to DNA fragments before sequencing, each DNA fragment is duplicated multiple times. UMIs allow us to figure out which sequenced reads are duplicates by grouping reads through their UMIs. The reason why we do not group using the DNA fragments themselves is because there may be multiple copies of the same DNA fragment before PCR amplification, and we want to be able to accurately count the duplicate DNA fragments. UMIs are applied in many experiments, including single-cell RNA sequencing.</p>
<p>As with many bioinformatics tasks that involve processing sequenced reads, the difficulty lies in handling sequencing and PCR amplification errors. Therefore, the deduplication task is just somehow grouping similar UMIs together, where similarity is defined by counting the number of mismatches between two UMI sequences. Since UMIs are used very often, this task has been thoroughly explored. Most notably, the “directional adjacency” method from <a href="https://genome.cshlp.org/content/early/2017/01/18/gr.209601.116.abstract">UMI-tools</a> involves first obtaining the frequency of each unique UMI, and then grouping low frequency UMIs with high frequency UMIs. The main intuition behind this is idea is that UMIs that appear frequently have a very high chance of being correctly amplified and sequenced, while UMIs that appear less frequently are most likely wrong. This algorithm is discussed in more detail in <a href="https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/">this</a> blog post. An image comparison of different grouping algorithms, from the blog post:</p>
<p><img src="../assets/UMI_tools_grouping_methods.png" alt="" width="700px" /></p>
<p>Consider what happens for each UMI (which we will can the “queried” UMI): we need to quickly find UMIs that are similar to the queried UMI that also have a UMI frequency lower than our threshold. Let us call this step a single “query”. After multiple queries, we build the full UMI graph as shown above by connecting the similar UMIs. Each connection is directional, so we connect higher frequency UMIs to lower frequency UMIs. Then, we need to group lower frequency UMIs together with the high frequency UMI using the graph, and we assume that the corresponding sequenced reads for the grouped UMIs all originate from the same DNA fragment.</p>
<h1 id="improvements">Improvements?</h1>
<p>Applying the three things that computer scientists care about, we know that we can basically make three improvements to this process: make it more accurate, speed it up, or lower the memory footprint. We can attempt to improve the accuracy with a better algorithm than directional adjacency, but it is difficult without biological insights and actual experience with UMI data. Therefore, we are forced to settle on attempting to improve the other two areas while ensuring that we do not make any compromises on the accuracy of the directional adjacency algorithm. Directly attempting to lower the memory footprint is kind of pointless since we have to store all of the sequenced reads in memory no matter what, and most computers have a ton of memory anyways. Thus, the only option is to improve the speed of deduplication process while ensuring that the memory footprint does increase significantly.</p>
<p>The first step is identifying the bottleneck. There are two main time-consuming procedures in the directional adjacency algorithm: building the graph of UMIs and extracting groups of UMIs. Since each UMI can only belong to one group, then it is easy to see that extracting groups of UMIs from the graph only takes time linear to the number of unique UMIs. Building the graph naively requires comparisons between each pair of unique UMIs, which scales <em>quadratically</em> with the number of UMIs. It turns out that many tools made for grouping UMIs use the naive method for building the graph, which leaves a lot of room for improvement.</p>
<h1 id="n-grams-bk-trees-a-medley-of-tricks">n-grams BK-trees: a medley of tricks</h1>
<p>As a recap, our goal is to find a list of candidate UMIs that are similar to a queried UMI, and then narrow down the list to only UMIs that are low frequency. The exact frequency value of a “low frequency” UMI depends on the frequency of other UMIs, so in general we will narrow down the list of similar UMIs to only the low frequency UMIs with a fixed frequency threshold that represents the upper bound frequency. Now, we can figure out how to speed this up with different techniques.</p>
<h2 id="trick-1-n-grams">Trick 1: n-grams</h2>
<p>The first trick we can apply is a simple and intuitive one. The idea is to decompose each UMI into multiple “fingerprints”, and build a mapping from each fingerprint to a list of all UMIs that have that fingerprint. This fingerprint is simply a contiguous segment of the UMI, called a n-gram, and its location in the UMI. The only difficult part of this trick is how we select the segments to allow errors (mismatches) in the UMIs. It is not hard to see that if we want to allow up to <script type="math/tex">k</script> errors, then we only need to split the UMIs into <script type="math/tex">k + 1</script> n-grams. Therefore, for each queried UMI, we only need to search other UMIs that share at least one n-gram with the queried UMI. This allows us to prune UMIs by the number of errors we allow. The interesting part of this trick is that as the UMI length increases, each n-gram becomes longer, and thus rarer, and more UMIs can be pruned through this method.</p>
<h2 id="trick-2-bk-trees">Trick 2: BK-trees</h2>
<p>BK-trees are metric trees that partition the (in our case) space of all UMIs into shells. In other words, we pick a UMI and partition the UMI space into multiple shells of different radii that are centered around that picked UMI. The tree structure is constructed by repeatedly picking parent UMIs to partition the UMI space, and connecting each parent UMI to children UMIs that lie in each of the shells of different radii. For a more detailed walkthrough of how BK-trees work, read <a href="https://signal-to-noise.xyz/post/bk-tree/">this</a> blog post, which contains an excellent introduction.</p>
<p>Here is an example of a BK-tree:</p>
<p><img src="../assets/bktree.png" alt="" width="500px" /></p>
<p>The main problem in the n-grams algorithm is that we have to check every single UMI that shares at least one n-gram with our queried UMI. Many of these UMIs may not be similar to our queried UMI, so it is definately a good idea to transform the list used in the n-grams method into something else, like a BK-tree. Then, each unique n-gram maps to a BK-tree that contains all of the UMIs associated with that n-gram, and we can prune even more UMIs from our search space. Here is an example of the n-grams BK-trees algorithm:</p>
<p><img src="../assets/ngrams_bktrees.png" alt="" width="500px" /></p>
<p>The main advantage of the n-grams BK-trees data structure over other “brute force” type algorithms that go through all possible errors that could occur in a UMI string is that it does not directly scale exponentially as the number of errors we allow or the UMI length increases.</p>
<h2 id="trick-3-prune-by-frequency">Trick 3: prune by frequency</h2>
<p>A seemingly obvious property of trees is that once we know that each node in a subtree does not satisfy our criteria, we can just skip that entire subtree. If we compute the minimum frequency of all UMIs in each subtree in a BK-tree and save those values, then we can easily figure out whether a subtree contains at least one UMI with a frequency that is less than our fixed frequency threshold. With this, we can skip subtrees that only contains UMIs with frequencies greater than our threshold.</p>
<p>This property important because it hints that we should prune UMIs by frequency while searching for similar UMIs to build the UMI graph. We can avoid visiting UMIs in the BK-trees that are similar to our queried UMI, but have a UMI frequency higher than our frequency threshold, by keeping track of the minimum UMI frequency of each subtree in each BK-tree.</p>
<h2 id="unused-trick-sorting-by-frequency">Unused trick: sorting by frequency</h2>
<p>Since we are pruning by frequencies, why not go one step further and also sort the UMIs by frequency before adding them one-by-one into the BK-trees? This allows lower frequency UMIs to be added closer to the root of the tree and higher frequency UMIs to be added near the leaves of the tree. As subtrees often mostly include UMIs that are far away from the root, it is more likely for a subtree to contain UMIs with frequencies higher than the threshold the farther away we get from the root. That means that more subtrees are pruned overall.</p>
<p>In the end, we will not use this when initializing the n-grams BK-trees data structure because it requires sorting the UMIs by frequency, which actually slows down the initialization step in practice.</p>
<h2 id="trick-4-literally-extract-umis">Trick 4: literally extract UMIs</h2>
<p>So far, we are able to obtain the lower frequency UMIs that are similar to a queried higher frequency UMI. After querying with each UMI as the higher frequency UMI, we can build the directed graph of UMIs and group UMIs through the directional adjacency algorithm. However, notice that in the end, each UMI can only belong to one single group. Therefore, we are actually wasting time explicitly building the UMI graph, because after adding an UMI to a group in the directional adjacency algorithm, we do not ever need to revisit that UMI again. All of the extra edges leading to that UMI, which we painstakingly calculated through multiple queries, are essentially useless.</p>
<p>So why do even build the graph in the first place if we do not use most of it? Why not just make the UMI graph <em>implicit</em>, so we compute the edges we need? This actually works, and it basically means that we remove UMIs from our BK-trees after each query. Therefore, UMIs that are added to a group in a previous query are marked as removed and skipped in later queries. We can keep track of subtrees in each BK-tree where all of the UMIs are removed, and skip entire subtrees to save time. Since the UMI graph is not explicitly constructed, we need to merge the graph construction and the UMI extraction/grouping steps of the directional adjacency algorithm together. This means that we essentially interleave grouping the UMIs and modifying the BK-trees that represent the implicit UMI graph.</p>
<h2 id="trick-5-special-encoding">Trick 5: special encoding</h2>
<p>With two strings of nucleotides (A, T, C, or G) of length <script type="math/tex">M</script>, we can compute the Hamming distance (how we measure similarity) in exactly <script type="math/tex">O(M)</script> time. At first it seems unlikely, but <em>can we do better</em>? The answer is actually yes!</p>
<p>Usually, the encoding method for nucleotides is mapping them to binary: 00, 01, 10, and 11. This is the best we can do if we optimize for size. The problem with this encoding is that we cannot easily find the Hamming distance for two arbitrary strings encoded with them. If we optimize for the speed of computing Hamming distance, we can actually get a different set of encodings: 011, 110, 101, 000. The special property of this set of encodings is that the bitwise Hamming distance between each pair of encodings is exactly 2 (try it; count the number of different bits between each pair). This means that we can easily infer the Hamming distance between two strings of nucleotides when they are encoded with that encoding by calculating the bitwise Hamming distance.</p>
<p>The reason why we can get faster than <script type="math/tex">O(M)</script> Hamming distance computation is because we can pack a bunch of nucleotides into a single 64-bit computer word. In fact, since computing the bitwise Hamming distance is constant-time (XOR + POPCOUNT operations), we can compute the Hamming distance between two 21 nucleotide strings in constant-time! Note that encoding a nucleotide string still takes <script type="math/tex">O(M)</script> time so this is only beneficial if multiple Hamming comparisons are made. By packing a bunch of nucleotides into one computer word, we can also make hashing and other comparison operations constant-time.</p>
<h2 id="implementation-trick-reducing-copies">Implementation trick: reducing copies</h2>
<p>When we split UMIs into n-grams, we can avoid copying the pieces of the UMI multiple times by using views on the UMIs. A view basically represents a contiguous segment of an UMI string with only the start and end locations of the segment, which is backed by a reference to the original UMI. We can also cache the hash for each view so we do not need to recalculate it.</p>
<h1 id="performance-in-practice">Performance in practice</h1>
<p>It is important to remember that the tricks for speeding up UMI deduplication may <em>degrade</em> performance compared to other algorithms on very small datasets. This is completely expected, since there are overheads associated with using those tricks.</p>
<p>First, let us see how fast the n-grams BK-trees method performs runs compared to other methods as the number of unique UMIs increases:</p>
<p><img src="../assets/umi_increase_run_time.png" alt="" width="700px" /></p>
<p>The n-grams BK-trees method is able to make use of the benefits of both the n-grams and the BK-tree methods (individually), as it performs better.</p>
<p>If we increase the length of the UMIs, then we see that it ties with the n-grams method:</p>
<p><img src="../assets/umi_length_increase_run_time.png" alt="" width="500px" /></p>
<p>If we increase the number of errors allowed, then the n-grams BK-trees method scales much more favorably compared to other methods:</p>
<p><img src="../assets/umi_edits_increase_run_time.png" alt="" width="500px" /></p>
<p>Note that in all three experiments, we use simulated (randomly generated) datasets. In practice, when there are less UMIs at a single alignment coordinate, the n-grams BK-trees method does not result in such a dramatic gains in performance.</p>
<p>For a little more information about the n-grams BK-tree data structure, we can look at some statistics about the n-grams with more than 160,000 UMIs:</p>
<p><img src="../assets/ngrams_stats.png" alt="" width="300px" /></p>
<p>We can see that the n-grams method is able to prune a significant portion of the UMIs. The largest BK-tree is only built on around 140 UMIs.</p>
<h1 id="conclusion">Conclusion</h1>
<p>The n-grams BK-trees method is essentially a bag of all sorts of tricks for speeding up the UMI deduplication task. However, in general, finding similar strings to a queried string is very useful in a variety of applications. It is vital in natural language processing and bioinformatics tasks that involve clustering and grouping similar strings. Perhaps the insights and tricks behind the n-grams BK-trees method can be applied to accurately finding similar strings under the Hamming distance metric for other tasks.</p>
Thu, 08 Aug 2019 00:00:00 +0000
https://blog.liudaniel.com//n-grams-BK-trees
https://blog.liudaniel.com/n-grams-BK-trees