CAIR Deep Learning Research Publications

2020

Lamprecht DB, Barnard E. Using a meta-model to compensate for training-evaluation mismatches. In: Southern African Conference for Artificial Intelligence Research. South Africa; 2020. https://sacair.org.za/proceedings/.

One of the fundamental assumptions of machine learning is that learnt models are applied to data that is identically distributed to the training data. This assumption is often not realistic: for example, data collected from a single source at different times may not be distributed identically, due to sampling bias or changes in the environment. We propose a new architecture called a meta-model which predicts performance for unseen models. This approach is applicable when several ‘proxy’ datasets are available to train a model to be deployed on a ‘target’ test set; the architecture is used to identify which regression algorithms should be used as well as which datasets are most useful to train for a given target dataset. Finally, we demonstrate the strengths and weaknesses of the proposed meta-model by making use of artificially generated datasets using a variation of the Friedman method 3 used to generate artificial regression datasets, and discuss real-world applications of our approach.

@{404,
  author = {Dylan Lamprecht and Etienne Barnard},
  title = {Using a meta-model to compensate for training-evaluation mismatches},
  abstract = {One of the fundamental assumptions of machine learning is that learnt models are applied to data that is identically distributed to the training data. This assumption is often not realistic: for example, data collected from a single source at different times may not be distributed identically, due to sampling bias or changes in the environment. We propose a new architecture called a meta-model which predicts performance for unseen models. This approach is applicable when several ‘proxy’ datasets are available to train a model to be deployed on a ‘target’ test set; the architecture is used to identify which regression algorithms should be used as well as which datasets are most useful to train for a given target dataset. Finally, we demonstrate the strengths and weaknesses of the proposed meta-model by making use of artificially generated datasets using a variation of the Friedman method 3 used to generate artificial regression datasets, and discuss real-world applications of our approach.},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {321-334},
  month = {22/02/2021 - 26/02/2021},
  address = {South Africa},
  isbn = {978-0-620-89373-2},
  url = {https://sacair.org.za/proceedings/},
}
Heyns N, Barnard E. Optimising word embeddings for recognised multilingual speech. In: Southern African Conference for Artificial Intelligence Research. South Africa; 2020. https://sacair.org.za/proceedings/.

Word embeddings are widely used in natural language processing (NLP) tasks. Most work on word embeddings focuses on monolingual languages with large available datasets. For embeddings to be useful in a multilingual environment, as in South Africa, the training techniques have to be adjusted to cater for a) multiple languages, b) smaller datasets and c) the occurrence of code-switching. One of the biggest roadblocks is to obtain datasets that include examples of natural code-switching, since code switching is generally avoided in written material. A solution to this problem is to use speech recognised data. Embedding packages like Word2Vec and GloVe have default hyper-parameter settings that are usually optimised for training on large datasets and evaluation on analogy tasks. When using embeddings for problems such as text classification in our multilingual environment, the hyper-parameters have to be optimised for the specific data and task. We investigate the importance of optimising relevant hyper-parameters for training word embeddings with speech recognised data, where code-switching occurs, and evaluate against the real-world problem of classifying radio and television recordings with code switching. We compare these models with a bag of words baseline model as well as a pre-trained GloVe model.

@{403,
  author = {Nuette Heyns and Etienne Barnard},
  title = {Optimising word embeddings for recognised multilingual speech},
  abstract = {Word embeddings are widely used in natural language processing (NLP) tasks. Most work on word embeddings focuses on monolingual languages with large available datasets. For embeddings to be useful in a multilingual environment, as in South Africa, the training techniques have to be adjusted to cater for a) multiple languages, b) smaller datasets and c) the occurrence of code-switching. One of the biggest roadblocks is to obtain datasets that include examples of natural code-switching, since code switching is generally avoided in written material. A solution to this problem is to use speech recognised data. Embedding packages like Word2Vec and GloVe have default hyper-parameter settings that are usually optimised for training on large datasets and evaluation on analogy tasks. When using embeddings for problems such as text classification in our multilingual environment, the hyper-parameters have to be optimised for the specific data and task. We investigate the importance of optimising relevant hyper-parameters for training word embeddings with speech recognised data, where code-switching occurs, and evaluate against the real-world problem of classifying radio and television recordings with code switching. We compare these models with a bag of words baseline model as well as a pre-trained GloVe model.},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {102-116},
  month = {22/02/2021 - 26/02/2021},
  address = {South Africa},
  isbn = {978-0-620-89373-2},
  url = {https://sacair.org.za/proceedings/},
}
Haasbroek DG, Davel MH. Exploring neural network training dynamics through binary node activations. In: Southern African Conference for Artificial Intelligence Research. South Africa; 2020. https://sacair.org.za/proceedings/.

Each node in a neural network is trained to activate for a specific region in the input domain. Any training samples that fall within this domain are therefore implicitly clustered together. Recent work has highlighted the importance of these clusters during the training process but has not yet investigated their evolution during training. Towards this goal, we train several ReLU-activated MLPs on a simple classification task (MNIST) and show that a consistent training process emerges: (1) sample clusters initially increase in size and then decrease as training progresses, (2) the size of sample clusters in the first layer decreases more rapidly than in deeper layers, (3) binary node activations, especially of nodes in deeper layers, become more sensitive to class membership as training progresses, (4) individual nodes remain poor predictors of class membership, even if accurate when applied as a group. We report on the detail of these findings and interpret them from the perspective of a high-dimensional clustering process.

@{402,
  author = {Daniël Haasbroek and Marelie Davel},
  title = {Exploring neural network training dynamics through binary node activations},
  abstract = {Each node in a neural network is trained to activate for a specific region in the input domain. Any training samples that fall within this domain are therefore implicitly clustered together. Recent work has highlighted the importance of these clusters during the training process but has not yet investigated their evolution during training. Towards this goal, we train several ReLU-activated MLPs on a simple classification task (MNIST) and show that a consistent training process emerges: (1) sample clusters initially increase in size and then decrease as training progresses, (2) the size of sample clusters in the first layer decreases more rapidly than in deeper layers, (3) binary node activations, especially of nodes in deeper layers, become more sensitive to class membership as training progresses, (4) individual nodes remain poor predictors of class membership, even if accurate when applied as a group. We report on the detail of these findings and interpret them from the perspective of a high-dimensional clustering process.},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {304-320},
  month = {22/02/2021 - 26/02/2021},
  address = {South Africa},
  isbn = {978-0-620-89373-2},
  url = {https://sacair.org.za/proceedings/},
}
Venter AEW, Theunissen MW, Davel MH. Pre-interpolation loss behavior in neural networks. In: Southern African Conference for Artificial Intelligence Research. South Africa: Springer; 2020. doi:https://doi.org/10.1007/978-3-030-66151-9_19.

When training neural networks as classifiers, it is common to observe an increase in average test loss while still maintaining or improving the overall classification accuracy on the same dataset. In spite of the ubiquity of this phenomenon, it has not been well studied and is often dismissively attributed to an increase in borderline correct classifications. We present an empirical investigation that shows how this phenomenon is actually a result of the differential manner by which test samples are processed. In essence: test loss does not increase overall, but only for a small minority of samples. Large representational capacities allow losses to decrease for the vast majority of test samples at the cost of extreme increases for others. This effect seems to be mainly caused by increased parameter values relating to the correctly processed sample features. Our findings contribute to the practical understanding of a common behaviour of deep neural networks. We also discuss the implications of this work for network optimisation and generalisation.

@{401,
  author = {Arthur Venter and Marthinus Theunissen and Marelie Davel},
  title = {Pre-interpolation loss behavior in neural networks},
  abstract = {When training neural networks as classifiers, it is common to observe an increase in average test loss while still maintaining or improving the overall classification accuracy on the same dataset. In spite of the ubiquity of this phenomenon, it has not been well studied and is often dismissively attributed to an increase in borderline correct classifications. We present an empirical investigation that shows how this phenomenon is actually a result of the differential manner by which test samples are processed. In essence: test loss does not increase overall, but only for a small minority of samples. Large representational capacities allow losses to decrease for the vast majority of test samples at the cost of extreme increases for others. This effect seems to be mainly caused by increased parameter values relating to the correctly processed sample features. Our findings contribute to the practical understanding of a common behaviour of deep neural networks. We also discuss the implications of this work for network optimisation and generalisation.},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {296-309},
  month = {22/02/2021 - 26/02/2021},
  publisher = {Springer},
  address = {South Africa},
  isbn = {978-3-030-66151-9},
  doi = {https://doi.org/10.1007/978-3-030-66151-9_19},
}
Myburgh JC, Mouton C, Davel MH. Tracking translation invariance in CNNs. In: Southern African Conference for Artificial Intelligence Research. South Africa: Springer; 2020. doi:https://doi.org/10.1007/978-3-030-66151-9_18.

Although Convolutional Neural Networks (CNNs) are widely used, their translation invariance (ability to deal with translated inputs) is still subject to some controversy. We explore this question using translation-sensitivity maps to quantify how sensitive a standard CNN is to a translated input. We propose the use of cosine similarity as sensitivity metric over Euclidean distance, and discuss the importance of restricting the dimensionality of either of these metrics when comparing architectures. Our main focus is to investigate the effect of different architectural components of a standard CNN on that network’s sensitivity to translation. By varying convolutional kernel sizes and amounts of zero padding, we control the size of the feature maps produced, allowing us to quantify the extent to which these elements influence translation invariance. We also measure translation invariance at different locations within the CNN to determine the extent to which convolutional and fully connected layers, respectively, contribute to the translation invariance of a CNN as a whole. Our analysis indicates that both convolutional kernel size and feature map size have a systematic influence on translation invariance. We also see that convolutional layers contribute less than expected to translation invariance, when not specifically forced to do so.

@{400,
  author = {Johannes Myburgh and Coenraad Mouton and Marelie Davel},
  title = {Tracking translation invariance in CNNs},
  abstract = {Although Convolutional Neural Networks (CNNs) are widely used, their translation invariance (ability to deal with translated inputs) is still subject to some controversy. We explore this question using translation-sensitivity maps to quantify how sensitive a standard CNN is to a translated input. We propose the use of cosine similarity as sensitivity metric over Euclidean distance, and discuss the importance of restricting the dimensionality of either of these metrics when comparing architectures. Our main focus is to investigate the effect of different architectural components of a standard CNN on that network’s sensitivity to translation. By varying convolutional kernel sizes and amounts of zero padding, we control the size of the feature maps produced, allowing us to quantify the extent to which these elements influence translation invariance. We also measure translation invariance at different locations within the CNN to determine the extent to which convolutional and fully connected layers, respectively, contribute to the translation invariance of a CNN as a whole. Our analysis indicates that both convolutional kernel size and feature map size have a systematic influence on translation invariance. We also see that convolutional layers contribute less than expected to translation invariance, when not specifically forced to do so.},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {282-295},
  month = {22/02/2021 - 26/02/2021},
  publisher = {Springer},
  address = {South Africa},
  isbn = {978-3-030-66151-9},
  doi = {https://doi.org/10.1007/978-3-030-66151-9_18},
}
Mouton C, Myburgh JC, Davel MH. Stride and translation invariance in CNNs. In: Southern African Conference for Artificial Intelligence Research. South Africa: Springer; 2020. doi:https://doi.org/10.1007/978-3-030-66151-9_17.

Convolutional Neural Networks have become the standard for image classification tasks, however, these architectures are not invariant to translations of the input image. This lack of invariance is attributed to the use of stride which subsamples the input, resulting in a loss of information, and fully connected layers which lack spatial reasoning. We show that stride can greatly benefit translation invariance given that it is combined with sufficient similarity between neighbouring pixels, a characteristic which we refer to as local homogeneity. We also observe that this characteristic is dataset-specific and dictates the relationship between pooling kernel size and stride required for translation invariance. Furthermore we find that a trade-off exists between generalization and translation invariance in the case of pooling kernel size, as larger kernel sizes lead to better invariance but poorer generalization. Finally we explore the efficacy of other solutions proposed, namely global average pooling, anti-aliasing, and data augmentation, both empirically and through the lens of local homogeneity.

@{399,
  author = {Coenraad Mouton and Johannes Myburgh and Marelie Davel},
  title = {Stride and translation invariance in CNNs},
  abstract = {Convolutional Neural Networks have become the standard for image classification tasks, however, these architectures are not invariant to translations of the input image. This lack of invariance is attributed to the use of stride which subsamples the input, resulting in a loss of information, and fully connected layers which lack spatial reasoning. We show that stride can greatly benefit translation invariance given that it is combined with sufficient similarity between neighbouring pixels, a characteristic which we refer to as local homogeneity. We also observe that this characteristic is dataset-specific and dictates the relationship between pooling kernel size and stride required for translation invariance. Furthermore we find that a trade-off exists between generalization and translation invariance in the case of pooling kernel size, as larger kernel sizes lead to better invariance but poorer generalization. Finally we explore the efficacy of other solutions proposed, namely global average pooling, anti-aliasing, and data augmentation, both empirically and through the lens of local homogeneity.},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {267-281},
  month = {22/02/2021 - 26/02/2021},
  publisher = {Springer},
  address = {South Africa},
  isbn = {978-3-030-66151-9},
  doi = {https://doi.org/10.1007/978-3-030-66151-9_17},
}
Strydom RA, Barnard E. Classifying recognised speech with deep neural networks. In: Southern African Conference for Artificial Intelligence Research. South Africa: Southern African Conference for Artificial Intelligence Research; 2020.

We investigate whether word embeddings using deep neural networks can assist in the analysis of text produced by a speechrecognition system. In particular, we develop algorithms to identify which words are incorrectly detected by a speech-recognition system in broadcast news. The multilingual corpus used in this investigation contains speech from the eleven official South African languages, as well as Hindi. Popular word embedding algorithms such as Word2Vec and fastText are investigated and compared with context-specific embedding representations such as Doc2Vec and non-context specific statistical sentence embedding methods such as term frequency-inverse document frequency (TFIDF), which is used as our baseline method. These various embeddding methods are then used as fixed length input representations for a logistic regression and feed forward neural network classifier. The output is used as an additional categorical input feature to a CatBoost classifier to determine whether the words were correctly recognised. Other methods are also investigated, including a method that uses the word embedding itself and cosine similarity between specific keywords to identify whether a specific keyword was correctly detected. When relying only on the speech-text data, the best result was obtained using the TFIDF document embeddings as input features to a feed forward neural network. Adding the output from the feed forward neural network as an additional feature to the CatBoost classifier did not enhance the classifier’s performance compared to using the non-textual information provided, although adding the output from a weaker classifier was somewhat beneficial

@{398,
  author = {Rhyno Strydom and Etienne Barnard},
  title = {Classifying recognised speech with deep neural networks},
  abstract = {We investigate whether word embeddings using deep neural networks can assist in the analysis of text produced by a speechrecognition system. In particular, we develop algorithms to identify which words are incorrectly detected by a speech-recognition system in broadcast news. The multilingual corpus used in this investigation contains speech from the eleven official South African languages, as well as Hindi. Popular word embedding algorithms such as Word2Vec and fastText are investigated and compared with context-specific embedding representations such as Doc2Vec and non-context specific statistical sentence embedding methods such as term frequency-inverse document frequency (TFIDF), which is used as our baseline method. These various embeddding methods are then used as fixed length input representations for a logistic regression and feed forward neural network classifier. The output is used as an additional categorical input feature to a CatBoost classifier to determine whether the words were correctly recognised. Other methods are also investigated, including a method that uses the word embedding itself and cosine similarity between specific keywords to identify whether a specific keyword was correctly detected. When relying only on the speech-text data, the best result was obtained using the TFIDF document embeddings as input features to a feed forward neural network. Adding the output from the feed forward neural network as an additional feature to the CatBoost classifier did not enhance the classifier’s performance compared to using the non-textual information provided, although adding the output from a weaker classifier was somewhat beneficial},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {191-205},
  month = {22/02/2021 - 26/02/2021},
  publisher = {Southern African Conference for Artificial Intelligence Research},
  address = {South Africa},
  isbn = {978-0-620-89373-2},
}
Theunissen MW, Davel MH, Barnard E. Benign interpolation of noise in deep learning. South African Computer Journal. 2020;32(2). doi:https://doi.org/10.18489/sacj.v32i2.833.

The understanding of generalisation in machine learning is in a state of flux, in part due to the ability of deep learning models to interpolate noisy training data and still perform appropriately on out-of-sample data, thereby contradicting long-held intuitions about the bias-variance trade off in learning. We expand upon relevant existing work by discussing local attributes of neural network training within the context of a relatively simple framework.We describe how various types of noise can be compensated for within the proposed framework in order to allow the deep learning model to generalise in spite of interpolating spurious function descriptors. Empirically,we support our postulates with experiments involving overparameterised multilayer perceptrons and controlled training data noise. The main insights are that deep learning models are optimised for training data modularly, with different regions in the function space dedicated to fitting distinct types of sample information. Additionally,we show that models tend to fit uncorrupted samples first. Based on this finding, we propose a conjecture to explain an observed instance of the epoch-wise double-descent phenomenon. Our findings suggest that the notion of model capacity needs to be modified to consider the distributed way training data is fitted across sub-units.

@article{394,
  author = {Marthinus Theunissen and Marelie Davel and Etienne Barnard},
  title = {Benign interpolation of noise in deep learning},
  abstract = {The understanding of generalisation in machine learning is in a state of flux, in part due to the ability of deep learning models to interpolate noisy training data and still perform appropriately on out-of-sample data, thereby contradicting long-held intuitions about the bias-variance trade off in learning. We expand upon relevant existing work by discussing local attributes of neural network training within the context of a relatively simple framework.We describe how various types of noise can be compensated for within the proposed framework in order to allow the deep learning model to generalise in spite of interpolating spurious function descriptors. Empirically,we support our postulates with experiments involving overparameterised multilayer perceptrons and controlled training data noise. The main insights are that deep learning models are optimised for training data modularly, with different regions in the function space dedicated to fitting distinct types of sample information. Additionally,we show that models tend to fit uncorrupted samples first. Based on this finding, we propose a conjecture to explain an observed instance of the epoch-wise double-descent phenomenon. Our findings suggest that the notion of model capacity needs to be modified to consider the distributed way training data is fitted across sub-units.},
  year = {2020},
  journal = {South African Computer Journal},
  volume = {32},
  pages = {80-101},
  issue = {2},
  publisher = {South African Institute of Computer Scientists and Information Technologists},
  isbn = {ISSN: 1015-7999; E:2313-7835},
  doi = {https://doi.org/10.18489/sacj.v32i2.833},
}
Beukes JP, Davel MH, Lotz S. Pairwise networks for feature ranking of a geomagnetic storm model. South African Computer Journal. 2020;32(2). doi:https://doi.org/10.18489/sacj.v32i2.860.

Feedforward neural networks provide the basis for complex regression models that produce accurate predictions in a variety of applications. However, they generally do not explicitly provide any information about the utility of each of the input parameters in terms of their contribution to model accuracy. With this is mind, we develop the pairwise network, an adaptation to the fully connected feedforward network that allows the ranking of input parameters according to their contribution to the model output. The application is demonstrated in the context of a space physics problem. Geomagnetic storms are multi-day events characterised by significant perturbations to the magnetic field of the Earth, driven by solar activity. Previous storm forecasting efforts typically use solar wind measurements as input parameters to a regression problem tasked with predicting a perturbation index such as the 1-minute cadence symmetric-H (Sym-H) index. We re-visit the task of predicting Sym-H from solar wind parameters, with two 'twists': (i) Geomagnetic storm phase information is incorporated as model inputs and shown to increase prediction performance. (ii) We describe the pairwise network structure and training process - first validating ranking ability on synthetic data, before using the network to analyse the Sym-H problem.

@article{392,
  author = {Jacques Beukes and Marelie Davel and Stefan Lotz},
  title = {Pairwise networks for feature ranking of a geomagnetic storm model},
  abstract = {Feedforward neural networks provide the basis for complex regression models that produce accurate predictions in a variety of applications. However, they generally do not explicitly provide any information about the utility of each of the input parameters in terms of their contribution to model accuracy. With this is mind, we develop the pairwise network, an adaptation to the fully connected feedforward network that allows the ranking of input parameters according to their contribution to the model output. The application is demonstrated in the context of a space physics problem. Geomagnetic storms are multi-day events characterised by significant perturbations to the magnetic field of the Earth, driven by solar activity. Previous storm forecasting efforts typically use solar wind measurements as input parameters to a regression problem tasked with predicting a perturbation index such as the 1-minute cadence symmetric-H (Sym-H) index. We re-visit the task of predicting Sym-H from solar wind parameters, with two 'twists': (i) Geomagnetic storm phase information is incorporated as model inputs and shown to increase prediction performance. (ii) We describe the pairwise network structure and training process - first validating ranking ability on synthetic data, before using the network to analyse the Sym-H problem.},
  year = {2020},
  journal = {South African Computer Journal},
  volume = {32},
  pages = {35-55},
  issue = {2},
  publisher = {South African Institute of Computer Scientists and Information Technologists},
  isbn = {ISSN: 1015-7999; E:2313-7835},
  doi = {https://doi.org/10.18489/sacj.v32i2.860},
}
Davel MH. Using summary layers to probe neural network behaviour. South African Computer Journal. 2020;32(2). doi:https://doi.org/10.18489/sacj.v32i2.861.

No framework exists that can explain and predict the generalisation ability of deep neural networks in general circumstances. In fact, this question has not been answered for some of the least complicated of neural network architectures: fully-connected feedforward networks with rectified linear activations and a limited number of hidden layers. For such an architecture, we show how adding a summary layer to the network makes it more amenable to analysis, and allows us to define the conditions that are required to guarantee that a set of samples will all be classified correctly. This process does not describe the generalisation behaviour of these networks,but produces a number of metrics that are useful for probing their learning and generalisation behaviour. We support the analytical conclusions with empirical results, both to confirm that the mathematical guarantees hold in practice, and to demonstrate the use of the analysis process.

@article{391,
  author = {Marelie Davel},
  title = {Using summary layers to probe neural network behaviour},
  abstract = {No framework exists that can explain and predict the generalisation ability of deep neural networks in general circumstances. In fact, this question has not been answered for some of the least complicated of neural network architectures: fully-connected feedforward networks with rectified linear activations and a limited number of hidden layers. For such an architecture, we show how adding a summary layer to the network makes it more amenable to analysis, and allows us to define the conditions that are required to guarantee that a set of samples will all be classified correctly. This process does not describe the generalisation behaviour of these networks,but produces a number of metrics that are useful for probing their learning and generalisation behaviour. We support the analytical conclusions with empirical results, both to confirm that the mathematical guarantees hold in practice, and to demonstrate the use of the analysis process.},
  year = {2020},
  journal = {South African Computer Journal},
  volume = {32},
  pages = {102-123},
  issue = {2},
  publisher = {South African Institute of Computer Scientists and Information Technologists},
  isbn = {ISSN: 1015-7999; E:2313-7835},
  url = {http://hdl.handle.net/10394/36916},
  doi = {https://doi.org/10.18489/sacj.v32i2.861},
}
Davel MH, Theunissen MW, Pretorius AP, Barnard E. DNNs as layers of cooperating classifiers. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20). New York; 2020.

A robust theoretical framework that can describe and predict the generalization ability of deep neural networks (DNNs) in general circumstances remains elusive. Classical attempts have produced complexity metrics that rely heavily on global measures of compactness and capacity with little investigation into the effects of sub-component collaboration. We demonstrate intriguing regularities in the activation patterns of the hidden nodes within fully-connected feedforward networks. By tracing the origin of these patterns, we show how such networks can be viewed as the combination of two information processing systems: one continuous and one discrete. We describe how these two systems arise naturally from the gradient-based optimization process, and demonstrate the classification ability of the two systems, individually and in collaboration. This perspective on DNN classification offers a novel way to think about generalization, in which different subsets of the training data are used to train distinct classifiers; those classifiers are then combined to perform the classification task, and their consistency is crucial for accurate classification.

@{236,
  author = {Marelie Davel and Marthinus Theunissen and Arnold Pretorius and Etienne Barnard},
  title = {DNNs as layers of cooperating classifiers},
  abstract = {A robust theoretical framework that can describe and predict the generalization ability of deep neural networks (DNNs) in general circumstances remains elusive. Classical attempts have produced complexity metrics that rely heavily on global measures of compactness and capacity with little investigation into the effects of sub-component collaboration. We demonstrate intriguing regularities in the activation patterns of the hidden nodes within fully-connected feedforward networks. By tracing the origin of these patterns, we show how such networks can be viewed as the combination of two information processing systems: one continuous and one discrete. We describe how these two systems arise naturally from the gradient-based optimization process, and demonstrate the classification ability of the two systems, individually and in collaboration. This perspective on DNN classification offers a novel way to think about generalization, in which different subsets of the training data are used to train distinct classifiers; those classifiers are then combined to perform the classification task, and their consistency is crucial for accurate classification.},
  year = {2020},
  journal = {The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)},
  pages = {3725 - 3732},
  month = {07/02-12/02/2020},
  address = {New York},
}
Thirion JWF, Van Heerden CJ, Giwa O, Davel MH. The South African directory enquiries (SADE) name corpus. Language Resources & Evaluation. 2020;54(1). doi:10.1007/s10579-019-09448-6.

We present the design and development of a South African directory enquiries corpus. It contains audio and orthographic transcriptions of a wide range of South African names produced by first-language speakers of four languages, namely Afrikaans, English, isiZulu and Sesotho. Useful as a resource to understand the effect of name language and speaker language on pronunciation, this is the first corpus to also aim to identify the ‘‘intended language’’: an implicit assumption with regard to word origin made by the speaker of the name. We describe the design, collection, annotation, and verification of the corpus. This includes an analysis of the algorithms used to tag the corpus with meta information that may be beneficial to pronunciation modelling tasks.

@article{280,
  author = {Jan Thirion and Charl Van Heerden and Oluwapelumi Giwa and Marelie Davel},
  title = {The South African directory enquiries (SADE) name corpus},
  abstract = {We present the design and development of a South African directory enquiries corpus. It contains audio and orthographic transcriptions of a wide range of South African names produced by first-language speakers of four languages, namely Afrikaans, English, isiZulu and Sesotho. Useful as a resource to understand the effect of name language and speaker language on pronunciation, this is the first corpus to also aim to identify the ‘‘intended language’’: an implicit assumption with regard to word origin made by the speaker of the name. We describe the design, collection, annotation, and verification of the corpus. This includes an analysis of the algorithms used to tag the corpus with meta information that may be beneficial to pronunciation modelling tasks.},
  year = {2020},
  journal = {Language Resources & Evaluation},
  volume = {54},
  pages = {155-184},
  issue = {1},
  publisher = {Springer},
  address = {Cape Town, South Africa},
  doi = {10.1007/s10579-019-09448-6},
}

2019

Lotz S, Beukes JP, Davel MH. A neural network based method for input parameter selection (Poster). In: Machine Learning in Heliophysics. Amsterdam, The Netherlands; 2019.

No Abstract

@{368,
  author = {Stefan Lotz and Jacques Beukes and Marelie Davel},
  title = {A neural network based method for input parameter selection (Poster)},
  abstract = {No Abstract},
  year = {2019},
  journal = {Machine Learning in Heliophysics},
  address = {Amsterdam, The Netherlands},
}
Theunissen MW, Davel MH, Barnard E. Insights regarding overfitting on noise in deep learning. In: South African Forum for Artificial Intelligence Research (FAIR). Cape Town, South Africa; 2019.

The understanding of generalization in machine learning is in a state of flux. This is partly due to the elatively recent revelation that deep learning models are able to completely memorize training data and still perform appropriately on out-of-sample data, thereby contradicting long-held intuitions about generalization. The phenomenon was brought to light and discussed in a seminal paper by Zhang et al. [24]. We expand upon this work by discussing local attributes of neural network training within the context of a relatively simple and generalizable framework. We describe how various types of noise can be compensated for within the proposed framework in order to allow the global deep learning model to generalize in spite of interpolating spurious function descriptors. Empirically, we support our postulates with experiments involving overparameterized multilayer perceptrons and controlled noise in the training data. The main insights are that deep learning models are optimized for training data modularly, with different regions in the function space dedicated to fitting distinct kinds of sample information. Detrimental overfitting is largely prevented by the fact that different regions in the function space are used for prediction based on the similarity between new input data and that which has been optimized for.

@{284,
  author = {Marthinus Theunissen and Marelie Davel and Etienne Barnard},
  title = {Insights regarding overfitting on noise in deep learning},
  abstract = {The understanding of generalization in machine learning is in a state of flux. This is partly due to the elatively recent revelation that deep learning models are able to completely memorize training data and still perform appropriately on out-of-sample data, thereby contradicting long-held intuitions about generalization. The phenomenon was brought to light and discussed in a seminal paper by Zhang et al. [24]. We expand upon this work by discussing local attributes of neural network training within the context of a relatively simple and generalizable framework. We describe how various types of noise can be compensated for within the proposed framework in order to allow the global deep learning model to generalize in spite of interpolating spurious function descriptors. Empirically, we support our postulates with experiments involving overparameterized multilayer perceptrons and controlled noise in the training data. The main insights are that deep learning models are optimized for training data modularly, with different regions in the function space dedicated to fitting distinct kinds of sample information. Detrimental overfitting is largely prevented by the fact that different regions in the function space are used for prediction based on the similarity between new input data and that which has been optimized for.},
  year = {2019},
  journal = {South African Forum for Artificial Intelligence Research (FAIR)},
  pages = {49-63},
  address = {Cape Town, South Africa},
}
Pretorius AP, Barnard E, Davel MH. ReLU and sigmoidal activation functions. In: South African Forum for Artificial Intelligence Research (FAIR). Cape Town, South Africa: CEUR Workshop Proceedings; 2019.

The generalization capabilities of deep neural networks are not well understood, and in particular, the influence of activation functions on generalization has received little theoretical attention. Phenomena such as vanishing gradients, node saturation and network sparsity have been identified as possible factors when comparing different activation functions [1]. We investigate these factors using fully connected feedforward networks on two standard benchmark problems, and find that the most salient differences between networks with sigmoidal and ReLU activations relate to the way that class-distinctive information is propagated through a network.

@{279,
  author = {Arnold Pretorius and Etienne Barnard and Marelie Davel},
  title = {ReLU and sigmoidal activation functions},
  abstract = {The generalization capabilities of deep neural networks are not well understood, and in particular, the influence of activation functions on generalization has received little theoretical attention. Phenomena such as vanishing gradients, node saturation and network sparsity have been identified as possible factors when comparing different activation functions [1]. We investigate these factors using fully connected feedforward networks on two standard benchmark problems, and find that the most salient differences between networks with sigmoidal and ReLU activations relate to the way that class-distinctive information is propagated through a network.},
  year = {2019},
  journal = {South African Forum for Artificial Intelligence Research (FAIR)},
  pages = {37-48},
  month = {04/12-07/12},
  publisher = {CEUR Workshop Proceedings},
  address = {Cape Town, South Africa},
}
  • CSIR
  • DSI
  • Covid-19