1 Introduction
Data obtained with modern 3D sensors such as laser scanners is predominantly in the irregular
format of point clouds or meshes. Analysis of point clouds has several useful applications such as robot manipulation and autonomous driving. In this work, we aim to develop a new neural network architecture for point cloud processing.
A point cloud consists of a sparse and unordered
set of 3D points. These properties of point clouds make it difficult to use traditional convolutional neural network (CNN) architectures for point cloud processing. As a result, existing approaches that directly operate on point clouds are dominated by handcrafted features. One way to use CNNs for point clouds is by first preprocessing a given point cloud in a form that is amenable to standard spatial convolutions. Following this route, most deep architectures for 3D point cloud analysis require preprocessing of irregular point clouds into either voxel representations (
e.g., [45, 37, 44]) or 2D images by view projection (e.g., [41, 34, 24, 9]). This is due to the ease of implementing convolution operations on regular 2D or 3D grids. However, transforming point cloud representation to either 2D images or 3D voxels would often result in artifacts and more importantly, a loss in some natural invariances present in point clouds.Recently, a few network architectures [33, 35, 48] have been developed to directly work on point clouds. One of the main drawbacks of these architectures is that they do not allow a flexible specification of the extent of spatial connectivity across points (filter neighborhood). Both [33] and [35]
use maxpooling to aggregate information across points either globally
[33] or in a hierarchical manner [35]. This pooling aggregation may lose surface information because the spatial layouts of points are not explicitly considered. It is desirable to capture spatial relationships in point clouds through more general convolution operations while being able to specify filter extents in a flexible manner.In this work, we propose a generic and flexible neural network architecture for processing point clouds that alleviates some of the aforementioned issues with existing deep architectures. Our key observation is that the bilateral convolution layers (BCLs) proposed in [22, 25]
have several favorable properties for point cloud processing. BCL provides a systematic way of filtering unordered points while enabling flexible specifications of the underlying lattice structure on which the convolution operates. BCL smoothly maps input points onto a sparse lattice, performs convolutions on the sparse lattice and then smoothly interpolates the filtered signal back onto the original input points. With BCLs as building blocks, we propose a new neural network architecture, which we refer to as SPLATNet (SParse LATtice Networks), that does hierarchical and spatiallyaware feature learning for unordered points. SPLATNet has several advantages for point cloud processing:

SPLATNet takes the point cloud as input and does not require any preprocessing to voxels or images.

SPLATNet allows an easy specification of filter neighborhood as in standard CNN architectures.

With the use of hash table, our network can efficiently deal with sparsity in the input point cloud by convolving only at locations where data is present.

SPLATNet computes hierarchical and spatiallyaware features of an input point cloud with sparse and efficient lattice filters.

In addition, our network architecture allows an easy mapping of 2D points into 3D space and viceversa. Following this, we propose a joint 2D3D deep architecture that processes both the multiview 2D images and the corresponding 3D point cloud in a single forward pass while being endtoend learnable.
The inputs and outputs of two versions of the proposed network, SPLATNet and SPLATNet, are depicted in Figure 1. We demonstrate the above advantages with experiments on point cloud segmentation. Experiments on both RueMonge2014 facade segmentation [38] and ShapeNet part segmentation [46] demonstrate the superior performance of our technique compared to stateoftheart techniques, while being computationally efficient.
2 Related Work
Below we briefly review existing deep learning approaches for 3D shape processing and explain differences with our work.
Multiview and voxel networks.
Multiview networks preprocess shapes into a set of 2D rendered images encoding surface depth and normals under various 2D projections [41, 34, 3, 24, 9, 20]
. These networks take advantage of high resolution in the input rendered images and transfer learning through finetuning of 2D pretrained imagebased architectures. On the other hand, 2D projections can cause surface information loss due to selfocclusions, while viewpoint selection is often performed through heuristics that are not necessarily optimal for a given task.
Voxelbased methods convert the input 3D shape representation into a 3D volumetric grid. Early voxelbased architectures executed convolution in regular, fixed voxel grids, and were limited to low shape resolutions due to high memory and computation costs [45, 30, 34, 6, 15, 39]. Instead of using fixed grids, more recent approaches preprocess the input shapes into adaptively subdivided, hierarchical grids with denser cells placed near the surface [37, 36, 27, 44, 42]. As a result, they have much lower computational and memory overhead. On the other hand, convolutions are often still executed away from the surface, where most of the shape information resides. An alternative approach is to constrain the execution of volumetric convolutions only along the input sparse set of active voxels of the grid [16]. Our approach generalizes this idea to highdimensional permutohedral lattice convolutions. In contrast to previous work, we do not require preprocessing points into voxels that may cause discretization artifacts and surface information loss. We smoothly map the input surface signal to our sparse lattice, perform convolutions over this lattice, and smoothly interpolate the filter responses back to the input surface. In addition, our architecture can easily incorporate feature representations originating from both 3D point clouds and rendered images within the same lattice, getting the best of both worlds.
Point cloud networks.
Qi et al. [33] pioneered another type of deep networks having the advantage of directly operating on point clouds. The networks learn spatial feature representations for each input point, then the point features are aggregated across the whole point set [33], or hierarchical surface regions [35] through maxpooling. This aggregation may lose surface information since the spatial layout of points is not explicitly considered. In our case, the input points are mapped to a sparse lattice where convolution can be efficiently formulated and spatial relationships in the input data can be effectively captured through flexible filters.
NonEuclidean networks.
An alternative approach is to represent the input surface as a graph (e.g., a polygon mesh or pointbased connectivity graph), convert the graph into its spectral representation, then perform convolution in the spectral domain [8, 19, 11, 4]. However, structurally different shapes tend to have largely different spectral bases, and thus lead to poor generalization. Yi et al. [47] proposed aligning shape basis functions through a spectral transformer, which, however, requires a robust initialization scheme. Another class of methods embeds the input shapes into 2D parametric domains and then execute convolutions within these domains [40, 28, 13]. However, these embeddings can suffer from spatial distortions or require topologically consistent input shapes. Other methods parameterize the surface into local patches and execute surfacebased convolution within these patches [29, 5, 31]. Such nonEuclidean networks have the advantage of being invariant to surface deformations, yet this invariance might not always be desirable in manmade object segmentation and classification tasks where large deformations may change the underlying shape or part functionalities and semantics. We refer to Bronstein et al. [7] for an excellent review of spectral, patch and graphbased methods.
Joint 2D3D networks.
FusionNet [18] combines shape classification scores from a volumetric and a multiview network, yet this fusion happens at a late stage, after the final fully connected layer of these networks, and does not jointly consider their intermediate local and global feature representations. In our case, the 2D and 3D feature representations are mapped onto the same lattice, enabling endtoend learning from both types of input representations.
3 Bilateral Convolution Layer
In this section, we briefly review the Bilateral Convolution Layer (BCL) that forms the basic building block of our SPLATNet architecture for point clouds. BCL provides a way to incorporate sparse highdimensional filtering inside neural networks. In [22, 25], BCL was proposed as a learnable generalization of bilateral filtering [43, 2], hence the name ‘Bilateral Convolution Layer’. Bilateral filtering involves a projection of a given 2D image into a higherdimensional space (e.g., space defined by position and color) and is traditionally limited to handdesigned filter kernels. BCL provides a way to learn filter kernels in highdimensional spaces for bilateral filtering. BCL is also shown to be useful for information propagation across video frames [21]. We observe that BCL has several favorable properties to filter data that is inherently sparse and highdimensional, like point clouds. Here, we briefly describe how a BCL works and then discuss its properties.
3.1 Inputs to BCL
Let be the given input features to a BCL, where denotes the number of input points and denotes the dimensionality of input features at each point. For 3D point clouds, input features can be lowlevel features such as color, position, etc., and can also be highlevel features such as features generated by a neural network.
One of the interesting characteristics of BCL is that it allows a flexible specification of the lattice space in which the convolution operates. This is specified as lattice features at each input point. Let denote lattice features at input points with denoting the dimensionality of the feature space in which convolution operates. For instance, the lattice features can be point position and color () that define a 6dimensional filtering space for BCL. For standard 3D spatial filtering of point clouds, is given as the position () of each point. Thus BCL takes input features and lattice features of input points and performs dimensional filtering of the points.
3.2 Processing steps in BCL
As illustrated in Figure 2, BCL has three processing steps, splat, convolve and slice, that work as follows.
Splat.
BCL first projects the input features onto the dimensional lattice defined by the lattice features , via barycentric interpolation. Following [1], BCL uses a permutohedral lattice instead of a standard Euclidean grid for efficiency purposes. The size of lattice simplices or space between the grid points is controlled by scaling the lattice features , where is a diagonal scaling matrix.
Convolve.
Once the input points are projected onto the dimensional lattice, BCL performs dimensional convolution on the splatted signal with learnable filter kernels. Just like in standard spatial CNNs, BCL allows an easy specification of filter neighborhood in the dimensional space.
Slice.
The filtered signal is then mapped back to the input points via barycentric interpolation. The resulting signal can be passed on to other BCLs for further processing. This step is called ‘slicing’. BCL allows slicing the filtered signal onto a different set of points other than the input points. This is achieved by specifying a different set of lattice features at output points of interest.
All the above three processing steps in BCL can be written as matrix multiplications:
(1) 
where denotes the column/channel of the input feature and denotes the corresponding filtered signal.
3.3 Properties of BCL
There are several properties of BCL that makes it particularly convenient for point cloud processing. Here, we mention some of those properties:

The input points to BCL need not be ordered or lie on a grid as they are projected onto a dimensional grid defined by lattice features .

The input and output points can be different for BCL with the specification of different input and output lattice features and .

Since BCL allows separate specifications of input and lattice features, input signals can be projected into a different dimensional space for filtering. For instance, a 2D image can be projected into 3D space for filtering.

Just like in standard spatial convolutions, BCL allows an easy specification of filter neighborhood.

Since a signal is usually sparse in highdimension, BCL uses hash tables to index the populated vertices and does convolutions only at those locations. This helps in efficient processing of sparse inputs.
4 SPLATNet for Point Cloud Processing
We first introduce SPLATNet, an instantiation of our proposed network architecture which operates directly on 3D point clouds and is readily applicable to many important 3D tasks. The input to SPLATNet is a 3D point cloud , where denotes the number of points and denotes the number of feature dimensions including point locations . Additional features are often available either directly from 3D sensors or through preprocessing. These can be RGB color, surface normal, curvature, etc. at the input points. Note that input features of the first BCL and lattice features in the network each comprises a subset of the feature dimensions: .
As output, SPLATNet produces perpoint predictions. Tasks like 3D semantic segmentation and 3D object part labeling fit naturally under this framework. With simple techniques such as global pooling [33], SPLATNet
can be modified to produce a single output vector and thus can be extended to other tasks such as classification.
Network architecture.
The architecture of SPLATNet is depicted in Figure 3. The network starts with a single CONV layer followed by a series of BCLs. The CONV layer processes each input point separately without any data aggregation. The functionality of BCLs is already explained in Section 3. For SPLATNet, we use BCLs each operating on a 3D lattice () constructed using 3D point locations as lattice features, . We note that different BCLs can use different lattice scales . Recall from Section 3 that is a diagonal matrix that controls the spacing between the grid points in the lattice. For BCLs in SPLATNet, we use the same lattice scales along each of the , and directions, i.e., , where is a scalar and denotes a identity matrix. We start with an initial lattice scale for the first BCL and subsequently divide the lattice scale by a factor of 2 () for the next BCLs. In other words, SPLATNet with BCLs use the following lattice scales: . Lower lattice scales imply coarser lattices and larger receptive fields for the filters. Thus, in SPLATNet, deeper BCLs have longerrange connectivity between input points compared to earlier layers. We will discuss more about the effects of different lattice spaces and their scales later. Like in standard CNNs, SPLATNet allows an easy specification of filter neighborhoods. For all the BCLs, we use filters operating on onering neighborhoods and refer to the supp. material for details on the number of filters per layer.
The responses of the BCLs are concatenated and then passed through two additional
CONV layers. Finally, a softmax layer produces pointwise class label probabilities. The concatenation operation aggregates information from BCLs operating at different lattice scales. Similar techniques of concatenating outputs from network layers at different depths have been useful in 2D CNNs
[17]. All parameterized layers, except for the last CONV layer, are followed by ReLU and BatchNorm. More details about the network architecture are given in the supp. material.
Lattice spaces and their scales.
The use of BCLs in SPLATNet allows easy specifications of lattice spaces via lattice features and also lattice scales via a scaling matrix.
Changing the lattice scales directly affects the resolution of the signal on which the convolution operates. This gives us direct control over the receptive fields of network layers. Figure 4 shows lattice cell visualizations for different lattice spaces and scales. Using coarser lattice can increase the effective receptive field of a filter. Another way to increase the receptive field of a filter is by increasing its neighborhood size. But, in highdimensions, this will significantly increase the number of filter parameters. For instance, 3D filters of size on a regular Euclidean grid have parameters respectively. On the other hand, making the lattice coarser would not increase the number of filter parameters leading to more computationally efficient network architectures.
We observe that it is beneficial to use finer lattices (larger lattice scales) earlier in the network, and then coarser lattices (smaller lattice scales) going deeper. This is consistent with the common knowledge in 2D CNNs: increasing receptive field gradually through the network can help build hierarchical representations with varying spatial extents and abstraction levels.
Although we mainly experiment with lattices in this work, BCL allows for other lattice spaces such as position and color space () or normal space. Using different lattice spaces enforces different connectivity across input points that may be beneficial to the task. In one of the experiments, we experimented with a variant of SPLATNet, where we add an extra BCL with position and normal lattice features () and observed minor performance improvements.
5 Joint 2D3D Processing with SPLATNet
Oftentimes, 3D point clouds are accompanied by 2D images of the same target. For instance, many modern 3D sensors capture RGBD streams and perform 3D reconstruction to obtain 3D point clouds, resulting in both 2D images and point clouds of a scene together with point correspondences between 2D and 3D. One could also easily sample point clouds along with 2D renderings from a given 3D mesh. When such aligned 2D3D data is present, SPLATNet provides an extremely flexible framework for joint processing. We propose SPLATNet, another SPLATNet instantiation designed for such joint processing.
The network architecture of the SPLATNet is depicted in the green box of Figure 3. SPLATNet encompasses SPLATNet as one of its components and adds extra computational modules for joint 2D3D processing. Next, we explain each of these extra components of SPLATNet, in the order of their computations.
Cnn.
Bcl.
CNN outputs features of the image pixels, whose 3D locations often do not exactly correspond to points in the 3D point cloud. We project information from the pixels onto the point cloud using a BCL with only splat and slice operations. As mentioned in Section 3, one of the interesting properties of BCL is that it allows for different input and output points by separate specifications of input and output lattice features, and . Using this property, we use BCL to splat 2D features onto the 3D lattice space and then slice the 3D splatted signal on the point cloud. We refer to this BCL, without a convolution operation, as BCL as illustrated in Figure 5. Specifically, we use 3D locations of the image pixels as input lattice features, , where denotes the number of input image pixels. In addition, we use 3D locations of points in the point cloud as output lattice features, , which are the same lattice features used in SPLATNet. The lattice scale, , controls the smoothness of the projection and can be adjusted according to the sparsity of the point cloud.
2D3D Fusion.
At this point, we have the result of CNN projected onto 3D points and also the intermediate features from SPLATNet that exclusively operates on the input point cloud. Since both of these signals are embedded in the same 3D space, we concatenate these two signals and then use a series of CONV layers for further processing. The output of the ‘2D3D Fusion’ module is passed on to a softmax layer to compute class probabilities at each input point of the point cloud.
Bcl.
Sometimes, we are also interested in segmenting 2D images and want to leverage relevant 3D information for better 2D segmentation. For this purpose, we backproject the 3D features computed by the ‘2D3D Fusion’ module onto the 2D images by a BCL module. This is the reverse operation of BCL, where the input and output lattice features are swapped. Similarly, a hyperparameter controls the smoothness of the projection.
Cnn.
We then concatenate the output from CNN, input images and the output of BCL, and pass them through another 2D CNN, CNN, to obtain refined 2D semantic predictions. In our experiments, we find that a simple 2layered network is good enough for this purpose.
All components in this 2D3D joint processing framework are differentiable, and can be trained endtoend. Depending on the availability of 2D or 3D groundtruth labels, loss functions can be defined on either one of the two domains, or on both domains in a multitask learning setting. More details of the network architecture are provided in the supp. material. We believe that this joint processing capability offered by SPLATNet
can result in better predictions for both 2D images and 3D point clouds. For 2D images, leveraging 3D features helps in viewconsistent predictions across multiple viewpoints. For point clouds, incorporating 2D CNNs help leverage powerful 2D deep CNN features computed on highresolution images.6 Experiments
We evaluate SPLATNet on tasks on two different benchmark datasets of RueMonge2014 [38] and ShapeNet [46]
. On RueMonge2014, we conducted experiments on the tasks of 3D point cloud labeling and multiview image labeling. On ShapeNet, we evaluated SPLATNet on 3D part segmentation. We use Caffe
[23] neural network framework for all the experiments. Full code and trained models are publicly available on our project website^{1}^{1}1http://viswww.cs.umass.edu/splatnet.6.1 RueMonge2014 facade segmentation
Method  Average IoU  Runtime (min) 
With only 3D data  
OctNet [37]  59.2   
Autocontext [14]  54.4  16 
SPLATNet (Ours)  65.4  0.06 
With both 2D and 3D data  
Autocontext [14]  62.9  87 
SPLATNet (Ours)  69.8  1.20 
Method  Average IoU  Runtime (min) 
Autocontext [14]  60.5  117 
Autocontext [14]  62.7  146 
DeepLab [10]  69.3  0.84 
SPLATNet (Ours)  70.6  4.34 
Here, the task is to assign semantic label to every point in a point cloud and/or corresponding multiview 2D images.
Dataset.
RueMonge2014 [38] provides a standard benchmark for 2D and 3D facade segmentation and also inverse procedural modeling. The dataset consists of 428 highresolution and multiview images obtained from a street in Paris. A point cloud with approximately 1M points is reconstructed using the multiview images. A groundtruth labeling with seven semantic classes of door, shop, balcony, window, wall, sky and roof are provided for both 2D images and the point cloud. Sample point cloud sections and 2D images with their corresponding ground truths are shown in Figure 6 and 7 respectively. For evaluation, Intersection over Union (IoU) score is computed for each of the seven classes and then averaged to get a single overall IoU.
Point cloud labeling.
We use our SPLATNet architecture for the task of point cloud labeling on this dataset. We use 5 BCLs followed by a couple of CONV layers. Input features to the network comprise of a 7dimensional vector at each point representing RGB color, normal and height above the ground. For all the BCLs, we use lattice space () with . Experimental results with average IoU and runtime are shown in Table 1. Results show that, with only 3D data, our method achieves an IoU of 65.4 which is a considerable improvement (6.2 IoU ) over the stateoftheart deep network, OctNet [37].
Since this dataset comes with multiview 2D images, one could leverage the information present in 2D data for better point cloud labeling. We use SPLATNet to leverage 2D information and obtain better 3D segmentations. Table 1 shows the experimental results when using both the 2D and 3D data as input. SPLATNet obtains an average IoU of 69.8 outperforming the previous stateoftheart by a large margin (6.9 IoU ), thereby setting up a new stateoftheart on this dataset. This is also a significant improvement from the IoU obtained with SPLATNet demonstrating the benefit of leveraging 2D and 3D information in a joint framework. Runtimes in Table 1 also indicate that our SPLATNet approach is much faster compared to traditional Autocontext techniques. Sample visual results for 3D facade labeling are shown in Figure 6.
Multiview image labeling.
As illustrated in Section 5, we extend 2D CNNs with SPLATNet to obtain better multiview image segmentation. Table 1 shows the results of multiview image labeling on this dataset using different techniques. Using DeepLab (CNN) already outperforms existing stateoftheart by a large margin. Leveraging 3D information via SPLATNet boosts the performance to 70.6 IoU. An increase of 1.3 IoU from only using CNN demonstrates the potential of our joint 2D3D framework in leveraging 3D information for better 2D segmentation.
6.2 ShapeNet part segmentation
#instances  2690  76  55  898  3758  69  787  392  1547  451  202  184  283  66  152  5271  
class  instance  air  bag  cap  car  chair  ear  guitar  knife  lamp  laptop  motor  mug  pistol  rocket  skate  table  
avg.  avg.  plane  phone  bike  board  
Yi et al. [46]  79.0  81.4  81.0  78.4  77.7  75.7  87.6  61.9  92.0  85.4  82.5  95.7  70.6  91.9  85.9  53.1  69.8  75.3 
3DCNN [33]  74.9  79.4  75.1  72.8  73.3  70.0  87.2  63.5  88.4  79.6  74.4  93.9  58.7  91.8  76.4  51.2  65.3  77.1 
Kdnetwork [27]  77.4  82.3  80.1  74.6  74.3  70.3  88.6  73.5  90.2  87.2  81.0  94.9  57.4  86.7  78.1  51.8  69.9  80.3 
PointNet [33]  80.4  83.7  83.4  78.7  82.5  74.9  89.6  73.0  91.5  85.9  80.8  95.3  65.2  93.0  81.2  57.9  72.8  80.6 
PointNet++ [35]  81.9  85.1  82.4  79.0  87.7  77.3  90.8  71.8  91.0  85.9  83.7  95.3  71.6  94.1  81.3  58.7  76.4  82.6 
SyncSpecCNN [47]  82.0  84.7  81.6  81.7  81.9  75.2  90.2  74.9  93.0  86.1  84.7  95.6  66.7  92.7  81.6  60.6  82.9  82.1 
SPLATNet  82.0  84.6  81.9  83.9  88.6  79.5  90.1  73.5  91.3  84.7  84.5  96.3  69.7  95.0  81.7  59.2  70.4  81.3 
SPLATNet  83.7  85.4  83.2  84.3  89.1  80.3  90.7  75.5  92.1  87.1  83.9  96.3  75.6  95.8  83.8  64.0  75.5  81.8 
The task of part segmentation is to assign a part category label to each point in a point cloud representing a 3D object.
Dataset.
The ShapeNet Part dataset [46] is a subset of ShapeNet, which contains 16681 objects from 16 categories, each with 26 part labels. The objects are consistently aligned and scaled to fit into a unit cube, and the groundtruth annotations are provided on sampled points on the shape surfaces. It is common to assume that the category of the input 3D object is known, narrowing the possible part labels to the ones specific to the given object category. We report standard IoU scores for evaluation of part segmentation. An IoU score is computed for each object and then averaged within the objects in a category to compute mean IoU (mIoU) for each object category. In addition to reporting mIoU score for each object category, we also report ‘class average mIoU’ which is the average mIoU across all object categories, and also ‘instance average mIoU’, which is the average mIoU across all objects.
3D part segmentation.
We evaluate both SPLATNet and SPLATNet for this task. First, we discuss the architecture and results with SPLATNet that uses only 3D point clouds as input. Since the category of the input object is assumed to be known, we train separate networks for each object category. SPLATNet network architecture for this taks is also composed of 5 BCLs. Point locations are used as input features as well as lattice features for all the BCLs and the lattice scale for the first BCL layer is . Experimental results are shown in Table 2. SPLATNet obtains a class average mIoU of 82.0 and an instance average mIoU of 84.6, which is onpar with the best networks that only take point clouds as input (PointNet++ [35] uses surface normals as additional inputs).
We also adopt our SPLATNet network, which operates on both 2D and 3D data, for this task. For the joint framework to work, we need rendered 2D views and corresponding 3D locations for each pixel in the renderings. We first render 3channel images: Phong shading [32], depth, and height from ground. Cameras are placed on the 20 vertices of a dodecahedron from a fixed distance, pointing towards the object’s center. The 2D3D correspondences can be generated by carrying the coordinates of 3D points into the rendering rasterization pipeline so that each pixel also acquires coordinate values from the surface point projected onto it. Results in Table 2 show that incorporating 2D information allows SPLATNet to improve noticeably from SPLATNet with 1.7 and 0.8 increase in class and instance average mIoU respectively. SPLATNet obtains a class average IoU of 83.7 and an instance average IoU of 85.4, outperforming existing stateoftheart approaches.
On one Nvidia GeForce GTX 1080 Ti, SPLATNet runs at shapessec, while SPLATNet is slower at shapessec due to a relatively large 2D network operating on 20 highresolution () views, which takes up more than of the computation time. In comparison, PointNet++ runs at shapessec on the same hardware^{2}^{2}2We use the public implementation released by the authors (https://github.com/charlesq34/pointnet2) with settings: , , (in consistence with our experiments)..
Sixdimensional filtering.
We experiment with a variant of SPLATNet where an additional BCL with 6dimensional position and normal lattice features () is added between the last two CONV layers. This modification gave only a marginal improvement of IoU over standard SPLATNet in terms of both class and instance average mIoU scores.
7 Conclusion
In this work, we propose the SPLATNet architecture for point cloud processing. SPLATNet directly takes point clouds as input and computes hierarchical and spatiallyaware features with sparse and efficient lattice filters. In addition, SPLATNet allows an easy mapping of 2D information into 3D and viceversa, resulting in a novel network architecture for joint processing of point clouds and multiview images. Experiments on two different benchmark datasets show that the proposed networks compare favorably against stateoftheart approaches for segmentation tasks. In the future, we would like to explore the use of additional input features (e.g., texture) and also the use of other highdimensional lattice spaces in our networks.
Acknowledgements
Maji acknowledges support from NSF (Grant No. 1617917). Kalogerakis acknowledges support from NSF (Grant No. 1422441 and 1617333). Yang acknowledges support from NSF (Grant No. 1149783). We acknowledge the MassTech Collaborative grant for funding the UMass GPU cluster.
References
 [1] A. Adams, J. Baek, and M. A. Davis. Fast highdimensional filtering using the permutohedral lattice. Computer Graphics Forum, 29(2):753–762, 2010.
 [2] V. Aurich and J. Weule. Nonlinear Gaussian filters performing edge preserving diffusion. In DAGM, pages 538–545. Springer, 1995.
 [3] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. J. Latecki. GIFT: a realtime and scalable 3D shape search engine. In Proc. CVPR, 2016.
 [4] D. Boscaini, J. Masci, S. Melzi, M. M. Bronstein, U. Castellani, and P. Vandergheynst. Learning classspecific descriptors for deformable shapes using localized spectral convolutional networks. In Proc. SGP, 2015.
 [5] D. Boscaini, J. Masci, E. Rodolà, and M. M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In Proc. NIPS, 2016.
 [6] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Generative and discriminative voxel modeling with convolutional neural networks. arXiv:1608.04236, 2016.
 [7] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 [8] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. In Proc. ICLR, 2014.
 [9] Z. Cao, Q. Huang, and K. Ramani. 3D object classification via spherical projections. In Proc. 3DV, 2017.
 [10] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. In Proc. ICLR, 2015.
 [11] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. arXiv:1606.09375, 2016.
 [12] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes Challenge: A retrospective. IJCV, 111(1):98–136, Jan. 2015.
 [13] D. Ezuz, J. Solomon, V. G. Kim, and M. BenChen. GWCNN: A metric alignment layer for deep shape analysis. Computer Graphics Forum, 36(5), 2017.
 [14] R. Gadde, V. Jampani, R. Marlet, and P. Gehler. Efficient 2D and 3D facade segmentation using autocontext. PAMI, 2017.
 [15] A. GarciaGarcia, F. GomezDonoso, J. G. Rodríguez, S. Orts, M. Cazorla, and J. A. López. PointNet: A 3D convolutional neural network for realtime object class recognition. In Proc. IJCNN, 2016.
 [16] B. Graham and L. van der Maaten. Submanifold sparse convolutional networks. arXiv:1706.01307, 2017.
 [17] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and finegrained localization. In Proc. CVPR, pages 447–456, 2015.
 [18] V. Hegde and R. Zadeh. FusionNet: 3D object classification using multiple data representations. arXiv:1607.05695, 2016.
 [19] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graphstructured data. arXiv:1506.05163, 2015.
 [20] H. Huang, E. Kalegorakis, S. Chaudhuri, D. Ceylan, V. Kim, and E. Yumer. Learning local shape descriptors with viewbased convolutional neural networks. ACM Trans. Graph., 2018.
 [21] V. Jampani, R. Gadde, and P. V. Gehler. Video propagation networks. In Proc. CVPR, 2017.
 [22] V. Jampani, M. Kiefel, and P. V. Gehler. Learning sparse high dimensional filters: Image filtering, dense CRFs and bilateral neural networks. In Proc. CVPR, 2016.
 [23] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proc. ACM Multimedia, 2014.
 [24] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3D shape segmentation with projective convolutional networks. In Proc. CVPR, 2017.
 [25] M. Kiefel, V. Jampani, and P. V. Gehler. Permutohedral lattice CNNs. In ICLR workshops, May 2015.
 [26] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
 [27] R. Klokov and V. Lempitsky. Escape from cells: Deep KdNetworks for the recognition of 3D point cloud models. In Proc. ICCV, 2017.
 [28] H. Maron, M. Galun, N. Aigerman, M. Trope, N. Dym, E. Yumer, V. G. Kim, and Y. Lipman. Convolutional neural networks on surfaces via seamless toric covers. ACM Trans. Graph., 36(4), 2017.
 [29] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on Riemannian manifolds. In Proc. ICCV workshops, 2015.
 [30] D. Maturana and S. Scherer. 3D convolutional neural networks for landing zone detection from LiDAR. In Proc. ICRA, 2015.
 [31] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model CNNs. In Proc. CVPR, 2017.
 [32] B. T. Phong. Illumination for computer generated pictures. Commun. ACM, 18(6), 1975.
 [33] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In Proc. CVPR, 2017.
 [34] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview CNNs for object classification on 3D data. In Proc. CVPR, 2016.
 [35] C. R. Qi, L. Yi, H. Su, and L. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proc. NIPS, 2017.
 [36] G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger. OctNetFusion: Learning depth fusion from data. In Proc. 3DV, 2017.
 [37] G. Riegler, A. O. Ulusoys, and A. Geiger. Octnet: Learning deep 3D representations at high resolutions. In Proc. CVPR, 2017.

[38]
H. Riemenschneider, A. BódisSzomorú, J. Weissenberg, and L. Van Gool.
Learning where to classify in multiview semantic segmentation.
In Proc. ECCV, 2014.  [39] N. Sedaghat, M. Zolfaghari, E. Amiri, and T. Brox. Orientationboosted voxel nets for 3D object recognition. In Proc. BMVC, 2017.
 [40] A. Sinha, J. Bai, and K. Ramani. Deep learning 3D shape surfaces using geometry images. In Proc. ECCV, 2016.
 [41] H. Su, S. Maji, E. Kalogerakis, and E. G. LearnedMiller. Multiview convolutional neural networks for 3D shape recognition. In Proc. ICCV, 2015.
 [42] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3D outputs. In Proc. ICCV, 2017.
 [43] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In Proc. ICCV, 1998.
 [44] P.S. Wang, Y. Liu, Y.X. Guo, C.Y. Sun, and X. Tong. OCNN: Octreebased convolutional neural networks for 3D shape analysis. ACM Trans. Graph., 36(4), 2017.
 [45] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D shapenets: A deep representation for volumetric shapes. In Proc. CVPR, 2015.
 [46] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, A. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3D shape collections. ACM Trans. Graph., 35(6):210, 2016.
 [47] L. Yi, H. Su, X. Guo, and L. Guibas. SyncSpecCNN: Synchronized spectral CNN for 3D shape segmentation. In Proc. CVPR, 2017.
 [48] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola. Deep sets. In Proc. NIPS, pages 3394–3404, 2017.
Appendix A Point Cloud Density Normalization
BCL has a normalization scheme to deal with uneven point density, or more specifically, the fact that some lattice vertices are supported by more data points than others. Input signals are filtered directly with the learnable filter kernels, and are also filtered in a separate second round with their values replaced by s with a Gaussian kernel. The filter responses in the second round are then used for normalizing responses from the first round. This is similar to using homogeneous coordinates, which are widely adopted in bilateral filtering implementations such as [1].
Appendix B RueMonge2014 Facade Segmentation
Network architecture of SPLATNet.
We use 5 BCLs ()
followed by
2 CONV layers in SPLATNet for the facade segmentation task.
We omit the initial CONV layer
since we find it has no effect on the overall performance.
The number of output channels in each layer are:
64
128B128B128B64C64C7 .
Note that although written as a linear structure, the network has skip connections from all BCLs
(layers start with ‘
}’) to the penultimate $1\times 1$ CONV layer. We use an initial scale $\Lambda_0=32I_3$ for scaling lattice features $XYZ$, and divide the scale in half after eachCL: . The unit of raw input features is meter, with (aligned with gravity axis) having a range of meters. For all the BCLs, we use filters operating on onering neighborhoods on the lattice.
Network architecture of SPLATNet.
We use SPLATNet as described above as the 3D component
of our 2D3D joint model. The ‘2D3D Fusion’ component has 2 CONV layers: 64
7 .
DeepLab [10] segmentation architecture is used as CNN.
CNN is a small network with 2 CONV layers: 32
7 , where the first layer has
filters and 32 output channels, and the second one has filters and 7 output channels.
We use and
for 2D3D projections with BCLs. Note that the dataset provides onetomany mappings from 3D points
to pixels. By using a very large scale (i.e., ), 3D unaries are directly mapped to
the corresponding 2D pixel locations without any interpolation.
Training.
We randomly sample facade segments of 60k points and use a batch size of 4 when training SPLATNet. CNN is initialized with Pascal VOC [12] pretrained weights and finetuned for 2D facade segmentation. Adam optimizer [26] with an initial learning rate of is used for training both SPLATNet and SPLATNet. Since the training data is small, we augment point cloud training data with random rotations, translations, and small color perturbations. We also augment 2D image data with small color perturbations during training.
Appendix C ShapeNet Part Segmentation
Network architecture of SPLATNet.
We use a CONV layer in the beginning,
followed by 5 BCLs (), and then 2 CONV layers in
SPLATNet for the ShapeNet part segmentation task.
The number of output channels in each layer are:
32B64B128B256B256B256
128Cx .
‘x’ in the last CONV layer denotes the number of part categories, and
ranges from 26 for different object categories.
We use an initial scale
for scaling lattice features , and divide the scale in half after each BCL:
.
Network architecture of SPLATNet.
We use SPLATNet as described above as the 3D component
of the joint model. The ‘2D3D Fusion’ component has 2 CONV layers: 128
x .
The same DeepLab architecture is used for CNN.
We use in BCL.
Since 2D prediction is not needed, CNN and
BCL are omitted.
Training.
We train separate models for each object category. CNN is initialized the same way as in the facade experiment. Adam optimizer with an initial learning rate of is used. We augment point cloud data with random rotations, translations, and scalings during training.
We train our networks until validation loss plateaus. Training SPLATNet and SPLATNet take about and days respectively. With default settings, training PointNet++ takes days on the same hardware.
Dataset labeling issues.
We observed a few types of labeling issues in the ShapeNet Part dataset:

Some object part categories are frequently labeled incorrectly. E.g., skateboard axles are often mistakenly labeled as ‘deck’ or ‘wheel’ (Figure 8(a)).

Some object parts, e.g. ‘fin’ of some rockets, have incomplete range or coverage (Figure 8(b)).

Some object part categories are labeled inconsistently between shapes. E.g., airplane landing gears are seen labeled as ‘body’, ‘engine’, or ‘wings’ (Figure 8(c)).

Some categories have parts that are labeled as ‘other’, which can be confusing for the classifier as these parts do not have clear semantic meanings or structures. E.g., in the case of earphones, anything that is not ‘headband’ or ‘earphone’ are given the same label (‘other’) (Figure 8(d)).
The first two issues make evaluations and comparisons on the benchmark less reliable, while the other two make learning illposed or unnecessarily hard for the networks.
Comments
There are no comments yet.