CAIR Deep Learning Research Publications

2022

Heymans W, Davel MH, Van Heerden CJ. Efficient acoustic feature transformation in mismatched environments using a Guided-GAN. Speech Communication. 2022;143. doi:https://doi.org/10.1016/j.specom.2022.07.002.

We propose a new framework to improve automatic speech recognition (ASR) systems in resource-scarce environments using a generative adversarial network (GAN) operating on acoustic input features. The GAN is used to enhance the features of mismatched data prior to decoding, or can optionally be used to fine-tune the acoustic model. We achieve improvements that are comparable to multi-style training (MTR), but at a lower computational cost. With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER). Experiments demonstrate that the framework can be very useful in under-resourced environments where training data and computational resources are limited. The GAN does not require parallel training data, because it utilises a baseline acoustic model to provide an additional loss term that guides the generator to create acoustic features that are better classified by the baseline.

@article{492,
  author = {Walter Heymans and Marelie Davel and Charl Van Heerden},
  title = {Efficient acoustic feature transformation in mismatched environments using a Guided-GAN},
  abstract = {We propose a new framework to improve automatic speech recognition (ASR) systems in resource-scarce environments using a generative adversarial network (GAN) operating on acoustic input features. The GAN is used to enhance the features of mismatched data prior to decoding, or can optionally be used to fine-tune the acoustic model. We achieve improvements that are comparable to multi-style training (MTR), but at a lower computational cost. With less than one hour of data, an ASR system trained on good quality data, and evaluated on mismatched audio is improved by between 11.5% and 19.7% relative word error rate (WER). Experiments demonstrate that the framework can be very useful in under-resourced environments where training data and computational resources are limited. The GAN does not require parallel training data, because it utilises a baseline acoustic model to provide an additional loss term that guides the generator to create acoustic features that are better classified by the baseline.},
  year = {2022},
  journal = {Speech Communication},
  volume = {143},
  pages = {10 - 20},
  month = {09/2022},
  doi = {https://doi.org/10.1016/j.specom.2022.07.002},
}
Oosthuizen AJ, Davel MH, Helberg A. Multi-Layer Perceptron for Channel State Information Estimation: Design Considerations. In: Southern Africa Telecommunication Networks and Applications Conference (SATNAC). Fancourt, George; 2022.

The accurate estimation of channel state information (CSI) is an important aspect of wireless communications. In this paper, a multi-layer perceptron (MLP) is developed as a CSI estimator in long-term evolution (LTE) transmission conditions. The representation of the CSI data is investigated in conjunction with batch normalisation and the representational ability of MLPs. It is found that discontinuities in the representational feature space can cripple an MLP’s ability to accurately predict CSI when noise is present. Different ways in which to mitigate this effect are analysed and a solution developed, initially in the context of channels that are only affected by additive white Guassian noise. The developed architecture is then applied to more complex channels with various delay profiles and Doppler spread. The performance of the proposed MLP is shown to be comparable with LTE minimum mean squared error (MMSE), and to outperform least square (LS) estimation over a range of channel conditions.

@{491,
  author = {Andrew Oosthuizen and Marelie Davel and Albert Helberg},
  title = {Multi-Layer Perceptron for Channel State Information Estimation: Design Considerations},
  abstract = {The accurate estimation of channel state information (CSI) is an important aspect of wireless communications. In this paper, a multi-layer perceptron (MLP) is developed as a CSI estimator in long-term evolution (LTE) transmission conditions. The representation of the CSI data is investigated in conjunction with batch normalisation and the representational ability of MLPs. It is found that discontinuities in the representational feature space can cripple an MLP’s ability to accurately predict CSI when noise is present. Different ways in which to mitigate this effect are analysed and a solution developed, initially in the context of channels that are only affected by additive white
Guassian noise. The developed architecture is then applied to more complex channels with various delay profiles and Doppler spread. The performance of the proposed MLP is shown to be comparable with LTE minimum mean squared error (MMSE), and to outperform least square (LS) estimation over a range of channel conditions.},
  year = {2022},
  journal = {Southern Africa Telecommunication Networks and Applications Conference (SATNAC)},
  pages = {94 - 99},
  month = {08/2022},
  address = {Fancourt, George},
}
Modipa T, Davel MH. Two Sepedi‑English code‑switched speech corpora. Language Resources and Evaluation. 2022;56. doi:https://doi.org/10.1007/s10579-022-09592-6 (Read here: https://rdcu.be/cO6lD).

We report on the development of two reference corpora for the analysis of SepediEnglish code-switched speech in the context of automatic speech recognition. For the first corpus, possible English events were obtained from an existing corpus of transcribed Sepedi-English speech. The second corpus is based on the analysis of radio broadcasts: actual instances of code switching were transcribed and reproduced by a number of native Sepedi speakers. We describe the process to develop and verify both corpora and perform an initial analysis of the newly produced data sets. We find that, in naturally occurring speech, the frequency of code switching is unexpectedly high for this language pair, and that the continuum of code switching (from unmodified embedded words to loanwords absorbed into the matrix language) makes this a particularly challenging task for speech recognition systems.

@article{483,
  author = {Thipe Modipa and Marelie Davel},
  title = {Two Sepedi‑English code‑switched speech corpora},
  abstract = {We report on the development of two reference corpora for the analysis of SepediEnglish code-switched speech in the context of automatic speech recognition. For the first corpus, possible English events were obtained from an existing corpus of transcribed Sepedi-English speech. The second corpus is based on the analysis of radio broadcasts: actual instances of code switching were transcribed and reproduced by a number of native Sepedi speakers. We describe the process to develop and verify both corpora and perform an initial analysis of the newly produced data sets. We find that, in naturally occurring speech, the frequency of code switching is unexpectedly high for this language pair, and that the continuum of code switching (from unmodified embedded words to loanwords absorbed into the matrix language) makes this a particularly challenging task for speech recognition systems.},
  year = {2022},
  journal = {Language Resources and Evaluation},
  volume = {56},
  pages = {https://rdcu.be/cO6lD)},
  publisher = {Springer},
  address = {South Africa},
  url = {https://rdcu.be/cO6lD},
  doi = {https://doi.org/10.1007/s10579-022-09592-6 (Read here: https://rdcu.be/cO6lD)},
}
Heymans W, Davel MH, Van Heerden CJ. Multi-style Training for South African Call Centre Audio. Communications in Computer and Information Science. 2022;1551. doi:https://doi.org/10.1007/978-3-030-95070-5_8.

Mismatched data is a challenging problem for automatic speech recognition (ASR) systems. One of the most common techniques used to address mismatched data is multi-style training (MTR), a form of data augmentation that attempts to transform the training data to be more representative of the testing data; and to learn robust representations applicable to different conditions. This task can be very challenging if the test conditions are unknown. We explore the impact of different MTR styles on system performance when testing conditions are different from training conditions in the context of deep neural network hidden Markov model (DNN-HMM) ASR systems. A controlled environment is created using the LibriSpeech corpus, where we isolate the effect of different MTR styles on final system performance. We evaluate our findings on a South African call centre dataset that contains noisy, WAV49-encoded audio.

@article{480,
  author = {Walter Heymans and Marelie Davel and Charl Van Heerden},
  title = {Multi-style Training for South African Call Centre Audio},
  abstract = {Mismatched data is a challenging problem for automatic speech recognition (ASR) systems. One of the most common techniques used to address mismatched data is multi-style training (MTR), a form of data augmentation that attempts to transform the training data to be more representative of the testing data; and to learn robust representations applicable to different conditions. This task can be very challenging if the test conditions are unknown. We explore the impact of different MTR styles on system performance when testing conditions are different from training conditions in the context of deep neural network hidden Markov model (DNN-HMM) ASR systems. A controlled environment is created using the LibriSpeech corpus, where we isolate the effect of different MTR styles on final system performance. We evaluate our findings on a South African call centre dataset that contains noisy, WAV49-encoded audio.},
  year = {2022},
  journal = {Communications in Computer and Information Science},
  volume = {1551},
  pages = {111 - 124},
  publisher = {Southern African Conference for Artificial Intelligence Research},
  address = {South Africa},
  doi = {https://doi.org/10.1007/978-3-030-95070-5_8},
}
Mouton C, Davel MH. Exploring layerwise decision making in DNNs. Communications in Computer and Information Science. 2022;1551. doi:https://doi.org/10.1007/978-3-030-95070-5_10.

While deep neural networks (DNNs) have become a standard architecture for many machine learning tasks, their internal decision-making process and general interpretability is still poorly understood. Conversely, common decision trees are easily interpretable and theoretically well understood. We show that by encoding the discrete sample activation values of nodes as a binary representation, we are able to extract a decision tree explaining the classification procedure of each layer in a ReLU-activated multilayer perceptron (MLP). We then combine these decision trees with existing feature attribution techniques in order to produce an interpretation of each layer of a model. Finally, we provide an analysis of the generated interpretations, the behaviour of the binary encodings and how these relate to sample groupings created during the training process of the neural network.

@article{479,
  author = {Coenraad Mouton and Marelie Davel},
  title = {Exploring layerwise decision making in DNNs},
  abstract = {While deep neural networks (DNNs) have become a standard architecture for many machine learning tasks, their internal decision-making process and general interpretability is still poorly understood. Conversely, common decision trees are easily interpretable and theoretically well understood. We show that by encoding the discrete sample activation values of nodes as a binary representation, we are able to extract a decision tree explaining the classification procedure of each layer in a ReLU-activated multilayer perceptron (MLP). We then combine these decision trees with existing feature attribution techniques in order to produce an interpretation of each layer of a model. Finally, we provide an analysis of the generated interpretations, the behaviour of the binary encodings and how these relate to sample groupings created during the training process of the neural network.},
  year = {2022},
  journal = {Communications in Computer and Information Science},
  volume = {1551},
  pages = {140 - 155},
  publisher = {Artificial Intelligence Research (SACAIR 2021)},
  doi = {https://doi.org/10.1007/978-3-030-95070-5_10},
}

2021

Van Wyk L, Davel MH, Van Heerden CJ. Unsupervised fine-tuning of speaker diarisation pipelines using silhouette coefficients. In: Southern African Conference for Artificial Intelligence Research. South Africa; 2021. https://2021.sacair.org.za/proceedings/.

We investigate the use of silhouette coefficients in cluster analysis for speaker diarisation, with the dual purpose of unsupervised fine-tuning during domain adaptation and determining the number of speakers in an audio file. Our main contribution is to demonstrate the use of silhouette coefficients to perform per-file domain adaptation, which we show to deliver an improvement over per-corpus domain adaptation. Secondly, we show that this method of silhouette-based cluster analysis can be used to accurately determine more than one hyperparameter at the same time. Finally, we propose a novel method for calculating the silhouette coefficient of clusters using a PLDA score matrix as input.

@{482,
  author = {Lucas Van Wyk and Marelie Davel and Charl Van Heerden},
  title = {Unsupervised fine-tuning of speaker diarisation pipelines using silhouette coefficients},
  abstract = {We investigate the use of silhouette coefficients in cluster analysis for speaker diarisation, with the dual purpose of unsupervised fine-tuning during domain adaptation and determining the number of speakers in an audio file. Our main contribution is to demonstrate the use of silhouette coefficients to perform per-file domain adaptation, which we show to deliver an improvement over per-corpus domain adaptation. Secondly, we show that this method of silhouette-based cluster analysis can be used to accurately determine more than one hyperparameter at the same time. Finally, we propose a novel method for calculating the silhouette coefficient of clusters using a PLDA score matrix as input.},
  year = {2021},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {202 - 216},
  month = {06/12/2021 - 10/12/2021},
  address = {South Africa},
  isbn = {978-0-620-94410-6},
  url = {https://2021.sacair.org.za/proceedings/},
}
Oosthuizen AJ, Davel MH, Helberg A. Exploring CNN-based automatic modulation classification using small modulation sets. In: Southern Africa Telecommunication Networks and Applications Conference. South Africa; 2021. https://www.satnac.org.za/proceedings.

We investigate the effect of a reduced modulation scheme pool on a CNN-based automatic modulation classifier. Similar classifiers in literature are typically used to classify sets of five or more different modulation types [1] [2], whereas our analysis is of a CNN classifier that classifies between two modulation types, 16-QAM and 8-PSK, only. While implementing the network, we observe that the network’s classification accuracy improves for lower SNR instead of reducing as expected. This analysis exposes characteristics of such classifiers that can be used to improve CNN classifiers on larger sets of modulation types. We show that presenting the SNR data as an extra data point to the network can significantly increase classification accuracy.

@{481,
  author = {Andrew Oosthuizen and Marelie Davel and Albert Helberg},
  title = {Exploring CNN-based automatic modulation classification using small modulation sets},
  abstract = {We investigate the effect of a reduced modulation scheme pool on a CNN-based automatic modulation classifier. Similar classifiers in literature are typically used to classify sets of five or more different modulation types [1] [2], whereas our
analysis is of a CNN classifier that classifies between two modulation types, 16-QAM and 8-PSK, only. While implementing the network, we observe that the network’s classification accuracy improves for lower SNR instead of reducing as expected. This analysis exposes characteristics of such classifiers that can be used to improve CNN classifiers on larger sets of modulation types. We show that presenting the SNR data as an extra data point to the network can significantly increase classification accuracy.},
  year = {2021},
  journal = {Southern Africa Telecommunication Networks and Applications Conference},
  pages = {20 - 24},
  month = {21/11/2021 - 23/11/2021},
  address = {South Africa},
  url = {https://www.satnac.org.za/proceedings},
}

2020

Lamprecht DB, Barnard E. Using a meta-model to compensate for training-evaluation mismatches. In: Southern African Conference for Artificial Intelligence Research. South Africa; 2020. https://sacair.org.za/proceedings/.

One of the fundamental assumptions of machine learning is that learnt models are applied to data that is identically distributed to the training data. This assumption is often not realistic: for example, data collected from a single source at different times may not be distributed identically, due to sampling bias or changes in the environment. We propose a new architecture called a meta-model which predicts performance for unseen models. This approach is applicable when several ‘proxy’ datasets are available to train a model to be deployed on a ‘target’ test set; the architecture is used to identify which regression algorithms should be used as well as which datasets are most useful to train for a given target dataset. Finally, we demonstrate the strengths and weaknesses of the proposed meta-model by making use of artificially generated datasets using a variation of the Friedman method 3 used to generate artificial regression datasets, and discuss real-world applications of our approach.

@{404,
  author = {Dylan Lamprecht and Etienne Barnard},
  title = {Using a meta-model to compensate for training-evaluation mismatches},
  abstract = {One of the fundamental assumptions of machine learning is that learnt models are applied to data that is identically distributed to the training data. This assumption is often not realistic: for example, data collected from a single source at different times may not be distributed identically, due to sampling bias or changes in the environment. We propose a new architecture called a meta-model which predicts performance for unseen models. This approach is applicable when several ‘proxy’ datasets are available to train a model to be deployed on a ‘target’ test set; the architecture is used to identify which regression algorithms should be used as well as which datasets are most useful to train for a given target dataset. Finally, we demonstrate the strengths and weaknesses of the proposed meta-model by making use of artificially generated datasets using a variation of the Friedman method 3 used to generate artificial regression datasets, and discuss real-world applications of our approach.},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {321-334},
  month = {22/02/2021 - 26/02/2021},
  address = {South Africa},
  isbn = {978-0-620-89373-2},
  url = {https://sacair.org.za/proceedings/},
}
Heyns N, Barnard E. Optimising word embeddings for recognised multilingual speech. In: Southern African Conference for Artificial Intelligence Research. South Africa; 2020. https://sacair.org.za/proceedings/.

Word embeddings are widely used in natural language processing (NLP) tasks. Most work on word embeddings focuses on monolingual languages with large available datasets. For embeddings to be useful in a multilingual environment, as in South Africa, the training techniques have to be adjusted to cater for a) multiple languages, b) smaller datasets and c) the occurrence of code-switching. One of the biggest roadblocks is to obtain datasets that include examples of natural code-switching, since code switching is generally avoided in written material. A solution to this problem is to use speech recognised data. Embedding packages like Word2Vec and GloVe have default hyper-parameter settings that are usually optimised for training on large datasets and evaluation on analogy tasks. When using embeddings for problems such as text classification in our multilingual environment, the hyper-parameters have to be optimised for the specific data and task. We investigate the importance of optimising relevant hyper-parameters for training word embeddings with speech recognised data, where code-switching occurs, and evaluate against the real-world problem of classifying radio and television recordings with code switching. We compare these models with a bag of words baseline model as well as a pre-trained GloVe model.

@{403,
  author = {Nuette Heyns and Etienne Barnard},
  title = {Optimising word embeddings for recognised multilingual speech},
  abstract = {Word embeddings are widely used in natural language processing (NLP) tasks. Most work on word embeddings focuses on monolingual languages with large available datasets. For embeddings to be useful in a multilingual environment, as in South Africa, the training techniques have to be adjusted to cater for a) multiple languages, b) smaller datasets and c) the occurrence of code-switching. One of the biggest roadblocks is to obtain datasets that include examples of natural code-switching, since code switching is generally avoided in written material. A solution to this problem is to use speech recognised data. Embedding packages like Word2Vec and GloVe have default hyper-parameter settings that are usually optimised for training on large datasets and evaluation on analogy tasks. When using embeddings for problems such as text classification in our multilingual environment, the hyper-parameters have to be optimised for the specific data and task. We investigate the importance of optimising relevant hyper-parameters for training word embeddings with speech recognised data, where code-switching occurs, and evaluate against the real-world problem of classifying radio and television recordings with code switching. We compare these models with a bag of words baseline model as well as a pre-trained GloVe model.},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {102-116},
  month = {22/02/2021 - 26/02/2021},
  address = {South Africa},
  isbn = {978-0-620-89373-2},
  url = {https://sacair.org.za/proceedings/},
}
Haasbroek DG, Davel MH. Exploring neural network training dynamics through binary node activations. In: Southern African Conference for Artificial Intelligence Research. South Africa; 2020. https://sacair.org.za/proceedings/.

Each node in a neural network is trained to activate for a specific region in the input domain. Any training samples that fall within this domain are therefore implicitly clustered together. Recent work has highlighted the importance of these clusters during the training process but has not yet investigated their evolution during training. Towards this goal, we train several ReLU-activated MLPs on a simple classification task (MNIST) and show that a consistent training process emerges: (1) sample clusters initially increase in size and then decrease as training progresses, (2) the size of sample clusters in the first layer decreases more rapidly than in deeper layers, (3) binary node activations, especially of nodes in deeper layers, become more sensitive to class membership as training progresses, (4) individual nodes remain poor predictors of class membership, even if accurate when applied as a group. We report on the detail of these findings and interpret them from the perspective of a high-dimensional clustering process.

@{402,
  author = {Daniël Haasbroek and Marelie Davel},
  title = {Exploring neural network training dynamics through binary node activations},
  abstract = {Each node in a neural network is trained to activate for a specific region in the input domain. Any training samples that fall within this domain are therefore implicitly clustered together. Recent work has highlighted the importance of these clusters during the training process but has not yet investigated their evolution during training. Towards this goal, we train several ReLU-activated MLPs on a simple classification task (MNIST) and show that a consistent training process emerges: (1) sample clusters initially increase in size and then decrease as training progresses, (2) the size of sample clusters in the first layer decreases more rapidly than in deeper layers, (3) binary node activations, especially of nodes in deeper layers, become more sensitive to class membership as training progresses, (4) individual nodes remain poor predictors of class membership, even if accurate when applied as a group. We report on the detail of these findings and interpret them from the perspective of a high-dimensional clustering process.},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {304-320},
  month = {22/02/2021 - 26/02/2021},
  address = {South Africa},
  isbn = {978-0-620-89373-2},
  url = {https://sacair.org.za/proceedings/},
}
Venter AEW, Theunissen MW, Davel MH. Pre-interpolation loss behaviour in neural networks. Communications in Computer and Information Science. 2020;1342. doi:https://doi.org/10.1007/978-3-030-66151-9_19.

When training neural networks as classifiers, it is common to observe an increase in average test loss while still maintaining or improving the overall classification accuracy on the same dataset. In spite of the ubiquity of this phenomenon, it has not been well studied and is often dismissively attributed to an increase in borderline correct classifications. We present an empirical investigation that shows how this phenomenon is actually a result of the differential manner by which test samples are processed. In essence: test loss does not increase overall, but only for a small minority of samples. Large representational capacities allow losses to decrease for the vast majority of test samples at the cost of extreme increases for others. This effect seems to be mainly caused by increased parameter values relating to the correctly processed sample features. Our findings contribute to the practical understanding of a common behaviour of deep neural networks. We also discuss the implications of this work for network optimisation and generalisation.

@article{484,
  author = {Arthur Venter and Marthinus Theunissen and Marelie Davel},
  title = {Pre-interpolation loss behaviour in neural networks},
  abstract = {When training neural networks as classifiers, it is common to observe an increase in average test loss while still maintaining or improving the overall classification accuracy on the same dataset. In spite of the ubiquity of this phenomenon, it has not been well studied and is often dismissively attributed to an increase in borderline correct classifications. We present an empirical investigation that shows how this phenomenon is actually a result of the differential manner by which test samples are processed. In essence: test loss does not increase overall, but only for a small minority of samples. Large representational capacities allow losses to decrease for the vast majority of test samples at the cost of extreme increases for others. This effect seems to be mainly caused by increased parameter values relating to the correctly processed sample features. Our findings contribute to the practical understanding of a common behaviour of deep neural networks. We also discuss the implications of this work for network optimisation and generalisation.},
  year = {2020},
  journal = {Communications in Computer and Information Science},
  volume = {1342},
  pages = {296-309},
  publisher = {Southern African Conference for Artificial Intelligence Research},
  address = {South Africa},
  isbn = {978-3-030-66151-9},
  doi = {https://doi.org/10.1007/978-3-030-66151-9_19},
}
Myburgh JC, Mouton C, Davel MH. Tracking translation invariance in CNNs. Communications in Computer and Information Science. 2020;1342. doi:https://doi.org/10.1007/978-3-030-66151-9_18.

Although Convolutional Neural Networks (CNNs) are widely used, their translation invariance (ability to deal with translated inputs) is still subject to some controversy. We explore this question using translation-sensitivity maps to quantify how sensitive a standard CNN is to a translated input. We propose the use of cosine similarity as sensitivity metric over Euclidean distance, and discuss the importance of restricting the dimensionality of either of these metrics when comparing architectures. Our main focus is to investigate the effect of different architectural components of a standard CNN on that network’s sensitivity to translation. By varying convolutional kernel sizes and amounts of zero padding, we control the size of the feature maps produced, allowing us to quantify the extent to which these elements influence translation invariance. We also measure translation invariance at different locations within the CNN to determine the extent to which convolutional and fully connected layers, respectively, contribute to the translation invariance of a CNN as a whole. Our analysis indicates that both convolutional kernel size and feature map size have a systematic influence on translation invariance. We also see that convolutional layers contribute less than expected to translation invariance, when not specifically forced to do so.

@article{485,
  author = {Johannes Myburgh and Coenraad Mouton and Marelie Davel},
  title = {Tracking translation invariance in CNNs},
  abstract = {Although Convolutional Neural Networks (CNNs) are widely used, their translation invariance (ability to deal with translated inputs) is still subject to some controversy. We explore this question using translation-sensitivity maps to quantify how sensitive a standard CNN is to a translated input. We propose the use of cosine similarity as sensitivity metric over Euclidean distance, and discuss the importance of restricting the dimensionality of either of these metrics when comparing architectures. Our main focus is to investigate the effect of different architectural components of a standard CNN on that network’s sensitivity to translation. By varying convolutional kernel sizes and amounts of zero padding, we control the size of the feature maps produced, allowing us to quantify the extent to which these elements influence translation invariance. We also measure translation invariance at different locations within the CNN to determine the extent to which convolutional and fully connected layers, respectively, contribute to the translation invariance of a CNN as a whole. Our analysis indicates that both convolutional kernel size and feature map size have a systematic influence on translation invariance. We also see that convolutional layers contribute less than expected to translation invariance, when not specifically forced to do so.},
  year = {2020},
  journal = {Communications in Computer and Information Science},
  volume = {1342},
  pages = {282-295},
  publisher = {Southern African Conference for Artificial Intelligence Research},
  isbn = {978-3-030-66151-9},
  doi = {https://doi.org/10.1007/978-3-030-66151-9_18},
}
Mouton C, Myburgh JC, Davel MH. Stride and translation invariance in CNNs. Communications in Computer and Information Science . 2020;1342. doi:https://doi.org/10.1007/978-3-030-66151-9_17.

Convolutional Neural Networks have become the standard for image classification tasks, however, these architectures are not invariant to translations of the input image. This lack of invariance is attributed to the use of stride which subsamples the input, resulting in a loss of information, and fully connected layers which lack spatial reasoning. We show that stride can greatly benefit translation invariance given that it is combined with sufficient similarity between neighbouring pixels, a characteristic which we refer to as local homogeneity. We also observe that this characteristic is dataset-specific and dictates the relationship between pooling kernel size and stride required for translation invariance. Furthermore we find that a trade-off exists between generalization and translation invariance in the case of pooling kernel size, as larger kernel sizes lead to better invariance but poorer generalization. Finally we explore the efficacy of other solutions proposed, namely global average pooling, anti-aliasing, and data augmentation, both empirically and through the lens of local homogeneity.

@article{486,
  author = {Coenraad Mouton and Johannes Myburgh and Marelie Davel},
  title = {Stride and translation invariance in CNNs},
  abstract = {Convolutional Neural Networks have become the standard for image classification tasks, however, these architectures are not invariant to translations of the input image. This lack of invariance is attributed to the use of stride which subsamples the input, resulting in a loss of information, and fully connected layers which lack spatial reasoning. We show that stride can greatly benefit translation invariance given that it is combined with sufficient similarity between neighbouring pixels, a characteristic which we refer to as local homogeneity. We also observe that this characteristic is dataset-specific and dictates the relationship between pooling kernel size and stride required for translation invariance. Furthermore we find that a trade-off exists between generalization and translation invariance in the case of pooling kernel size, as larger kernel sizes lead to better invariance but poorer generalization. Finally we explore the efficacy of other solutions proposed, namely global average pooling, anti-aliasing, and data augmentation, both empirically and through the lens of local homogeneity.},
  year = {2020},
  journal = {Communications in Computer and Information Science},
  volume = {1342},
  pages = {267-281},
  publisher = {Southern African Conference for Artificial Intelligence Research},
  address = {South Africa},
  isbn = {978-3-030-66151-9},
  doi = {https://doi.org/10.1007/978-3-030-66151-9_17},
}
Strydom RA, Barnard E. Classifying recognised speech with deep neural networks. In: Southern African Conference for Artificial Intelligence Research. South Africa: Southern African Conference for Artificial Intelligence Research; 2020.

We investigate whether word embeddings using deep neural networks can assist in the analysis of text produced by a speechrecognition system. In particular, we develop algorithms to identify which words are incorrectly detected by a speech-recognition system in broadcast news. The multilingual corpus used in this investigation contains speech from the eleven official South African languages, as well as Hindi. Popular word embedding algorithms such as Word2Vec and fastText are investigated and compared with context-specific embedding representations such as Doc2Vec and non-context specific statistical sentence embedding methods such as term frequency-inverse document frequency (TFIDF), which is used as our baseline method. These various embeddding methods are then used as fixed length input representations for a logistic regression and feed forward neural network classifier. The output is used as an additional categorical input feature to a CatBoost classifier to determine whether the words were correctly recognised. Other methods are also investigated, including a method that uses the word embedding itself and cosine similarity between specific keywords to identify whether a specific keyword was correctly detected. When relying only on the speech-text data, the best result was obtained using the TFIDF document embeddings as input features to a feed forward neural network. Adding the output from the feed forward neural network as an additional feature to the CatBoost classifier did not enhance the classifier’s performance compared to using the non-textual information provided, although adding the output from a weaker classifier was somewhat beneficial

@{398,
  author = {Rhyno Strydom and Etienne Barnard},
  title = {Classifying recognised speech with deep neural networks},
  abstract = {We investigate whether word embeddings using deep neural networks can assist in the analysis of text produced by a speechrecognition system. In particular, we develop algorithms to identify which words are incorrectly detected by a speech-recognition system in broadcast news. The multilingual corpus used in this investigation contains speech from the eleven official South African languages, as well as Hindi. Popular word embedding algorithms such as Word2Vec and fastText are investigated and compared with context-specific embedding representations such as Doc2Vec and non-context specific statistical sentence embedding methods such as term frequency-inverse document frequency (TFIDF), which is used as our baseline method. These various embeddding methods are then used as fixed length input representations for a logistic regression and feed forward neural network classifier. The output is used as an additional categorical input feature to a CatBoost classifier to determine whether the words were correctly recognised. Other methods are also investigated, including a method that uses the word embedding itself and cosine similarity between specific keywords to identify whether a specific keyword was correctly detected. When relying only on the speech-text data, the best result was obtained using the TFIDF document embeddings as input features to a feed forward neural network. Adding the output from the feed forward neural network as an additional feature to the CatBoost classifier did not enhance the classifier’s performance compared to using the non-textual information provided, although adding the output from a weaker classifier was somewhat beneficial},
  year = {2020},
  journal = {Southern African Conference for Artificial Intelligence Research},
  pages = {191-205},
  month = {22/02/2021 - 26/02/2021},
  publisher = {Southern African Conference for Artificial Intelligence Research},
  address = {South Africa},
  isbn = {978-0-620-89373-2},
}
Theunissen MW, Davel MH, Barnard E. Benign interpolation of noise in deep learning. South African Computer Journal. 2020;32(2). doi:https://doi.org/10.18489/sacj.v32i2.833.

The understanding of generalisation in machine learning is in a state of flux, in part due to the ability of deep learning models to interpolate noisy training data and still perform appropriately on out-of-sample data, thereby contradicting long-held intuitions about the bias-variance trade off in learning. We expand upon relevant existing work by discussing local attributes of neural network training within the context of a relatively simple framework.We describe how various types of noise can be compensated for within the proposed framework in order to allow the deep learning model to generalise in spite of interpolating spurious function descriptors. Empirically,we support our postulates with experiments involving overparameterised multilayer perceptrons and controlled training data noise. The main insights are that deep learning models are optimised for training data modularly, with different regions in the function space dedicated to fitting distinct types of sample information. Additionally,we show that models tend to fit uncorrupted samples first. Based on this finding, we propose a conjecture to explain an observed instance of the epoch-wise double-descent phenomenon. Our findings suggest that the notion of model capacity needs to be modified to consider the distributed way training data is fitted across sub-units.

@article{394,
  author = {Marthinus Theunissen and Marelie Davel and Etienne Barnard},
  title = {Benign interpolation of noise in deep learning},
  abstract = {The understanding of generalisation in machine learning is in a state of flux, in part due to the ability of deep learning models to interpolate noisy training data and still perform appropriately on out-of-sample data, thereby contradicting long-held intuitions about the bias-variance trade off in learning. We expand upon relevant existing work by discussing local attributes of neural network training within the context of a relatively simple framework.We describe how various types of noise can be compensated for within the proposed framework in order to allow the deep learning model to generalise in spite of interpolating spurious function descriptors. Empirically,we support our postulates with experiments involving overparameterised multilayer perceptrons and controlled training data noise. The main insights are that deep learning models are optimised for training data modularly, with different regions in the function space dedicated to fitting distinct types of sample information. Additionally,we show that models tend to fit uncorrupted samples first. Based on this finding, we propose a conjecture to explain an observed instance of the epoch-wise double-descent phenomenon. Our findings suggest that the notion of model capacity needs to be modified to consider the distributed way training data is fitted across sub-units.},
  year = {2020},
  journal = {South African Computer Journal},
  volume = {32},
  pages = {80-101},
  issue = {2},
  publisher = {South African Institute of Computer Scientists and Information Technologists},
  isbn = {ISSN: 1015-7999; E:2313-7835},
  doi = {https://doi.org/10.18489/sacj.v32i2.833},
}
  • CSIR
  • DSI
  • Covid-19