DeepGCNs: Can GCNs Go as Deep as CNNs?
Typically, most state-of-the-art GCN models are no deeper than 3 or 4 layers. Showing in the title, the main purpose here is to stack more layers into GCN without the common vanishing gradient problem.
adapting residual/dense connections, and dilated
convolutions to GCN.
The equation 1 show how a gernel GCN work. GCNs extract richer features at a vertex by aggragating features of vertices from its neighborhood.
The aggregation function can be a mean aggregator, a max-pooling aggregator, an attention aggregator or an LSTM aggregator. The update function can be a multi-layer perceptron, a gated network, etc.
In this work, the vertex feature aggragation function is max-pooling and update function is MLP with batch normalization.
Recent work demonstrates that dynamic graph convolution, where the graph structure is allowed to change in each layer, can learn better graph
representations compared to GCNs with fixed graph structure. Dynamically changing neighbors in GCNs helps alleviate the over-smoothing problem and results in an effectively larger receptive field.
figure 2, 3 show that how res-connection and dilated convolution work in GCN.
Detail of Dilated Conv: A Dilated k-NN to find dilated neighbors is used after every GCN layer and construct a Dilated Graph. In particular, for an input graph G = (V, E) with Dilated k-NN and d as the dilation rate, the Dilated k-NN returns the k nearest neighbors within the k × d neighborhood region by skipping every d neighbors. The nearest neighbors are determined based on a pre-defined distance metric.
Effect of residual graph connections.
When the residual graph connections between layers are removed (i.e. in PlainGCN-28), performance dramatically degrades (-12% mIoU).
Effect of dilation
Results in Table 1 (Dilation) show that dilated graph convolutions account for a 2.85% improvement in mean IoU (row 3), motivated primarily by the expansion of the network’s receptive field.
Effect of nearest neighbors.
Results in Table 1 (Neighbors) show that a larger number of neighbors helps in general. As the number of neighbors is decreased by a factor of 2 and 4, the performance drops by 2.5% and 3.3% respectively.
HOW POWERFUL ARE GRAPH NEURAL NETWORKS?
GNNs broadly follow a recursive neighborhood aggregation (or message passing) scheme, where each node aggregates feature vectors of its neighbors to compute its new feature vector. After k iterations of aggregation, a node is represented by its transformed feature vector, which captures the structural information within the node’s k-hop neighborhood. The representation of an entire graph can then be obtained through pooling, for example, by summing the representation vectors of all nodes in the graph.
In this paper, author propose a theoretical framework to analyze the representation power of GNNs inpired from Weisfeiler-Lehman(WL) test. The WL test was used to distinguish a broad class of graphs by iteratively updating a given node’s feature vector by aggregating feature vectors of its network neighbors. What makes the WL test so powerful is its injective aggregation update that maps different node neighborhoods to different feature vectors. Thus, A GNN can have as large discriminative power as the WL test if the GNN’s aggregation scheme is highly expressive and can model injective functions.
- Showing that GNNs are at most as powerful as the WL test in distinguishing graph structures.
- establishing conditions on the neighbor aggregation and graph readout functions under which the resulting GNN is as powerful as the WL test.
- identifing graph structures that cannot be distinguished by popular GNN variants, such as GCN and GraphSAGE, and we precisely characterize the kinds of graph structures such GNN-based models can capture.
- developing a simple neural architecture, Graph Isomorphism Network (GIN), and show that its discriminative/representational power is equal to the power of the WL test.
readout for graph classification
The figure above show how the pooling aggregators would affect the expressive ability. Mean and max have trouble distinguishing graphs with nodes that have repeating features. But the mean aggregator may perform well if, for the task, the statistical and distributional information in the graph is more important than the exact structure. Only sum could capture the graph structure. Max-pooling captures neither the exact structure nor the distribution. However, it may be suitable for tasks where it is important to identify representative elements or the “skeleton”, rather than to distinguish the exact structure or distribution.