Brain Computer Interface applications, developed for both healthy and clinical populations critically depend on decoding brain activity in single trials.
Recent advances in Neuroscience have led to an emerging interest in Brain Computer Interface (BCI) applications for both disabled and healthy populations. These applications critically depend on online decoding of brain activity, in response to single events (trials), as opposed to delineation of the average response frequently studied in basic research. Electroencephalography (EEG), a noninvasive recording technique, is one of the commonly used systems for monitoring brain activity. EEG data is simultaneously collected from a multitude of channels at a high temporal resolution, yielding high dimensional data matrices for the representation of single trial brain activity. In addition to its unsurpassed temporal resolution, EEG is non-invasive, wearable, and more affordable than other neuroimaging techniques, and is thus a prime choice for any type of practical BCI. The two other technologies used for decoding brain activity, namely functional MRI and MEG, require cumbersome, expensive, and non-mobile instrumentation, and although they maintain their position as highly valuable research tools, are unlikely to be useful for routine use of BCIs. Traditionally, EEG data has been averaged over trials to characterize task-related brain responses despite the on-going, task independent “noise” present in single trial data. However, in order to allow flexible real-time feedback or interaction, task-related brain responses need to be identified in single trials, and categorized into the associated brain states. Most classification methods use machine-learning algorithms, to classify single-trial spatio-temporal activity matrices based on statistical properties of those matrices. These methods are based on two main components a feature extraction mechanism for effective dimensionality reduction, and a classification algorithm.
Typical classifiers use a sample data to learn a mapping rule by which other test data can be classified into one of two or more categories. Classifiers can be roughly divided to linear and non-linear methods. Non-linear classifiers, such as Neural Networks, Hidden Markov Model and k-nearest neighbor, can approximate a wide range of functions, allowing discrimination of complex data structures. While non-linear classifiers have the potential to capture complex discriminative functions, their complexity can also cause overfitting and carry heavy computational demands, making them less suitable for real-time applications.
Linear classifiers, on the other hand, are less complex and are thus more robust to data overfitting. Naturally, linear classifiers perform particularly well on data that can be linearly separated. Fisher Linear discriminant (FLD), linear Support Vector Machine (SVM) and Logistic Regression (LR) are popular examples. FLD finds a linear combination of features that maps the data of two classes onto a separable projection axis. The criterion for separation is defined as the ratio of the distance between the classes mean to the variance within the classes. SVM finds a separating hyper-plane that maximizes the margin between the two classes. LR, as its name suggests, projects the data onto a logistic function. All linear classifiers offer fast solution for data discrimination, and are thus most commonly applied in classification algorithms used for real-time BCI applications.
Whether linear or non-linear, most classifiers require a prior stage of feature extraction. Selecting these features has become a crucial issue, as one of the main challenges in deciphering brain activity from single trial data matrices is the high dimensional space in which they are embedded, and the relatively small sample sizes the classifiers can rely on in their learning stage. Feature extraction is in essence a dimensionality reduction procedure mapping the original data onto a lower dimensional space. A successful feature extraction procedure will pull out task-relevant information and attenuate irrelevant information. Some feature extraction approaches use prior knowledge, such as specific frequency-bands relevant to the experiment or brain locations most likely to be involved in the specific classification problem. For instance, the literature has robustly pointed out parietal scalp regions to be involved in target detection paradigms, as a specific target-related response at parietal regions, known as the P300 wave, has been repeatedly observed approximately 300-500 ms post-stimulus. Such prior-knowledge based algorithms, in particular P300 based systems, are commonly used for a variety of BCI applications. In contrast, other methods construct an automatic process to pull out relevant features based on supervised or unsupervised learning from training data sets. Some approaches for automatic feature extraction include Common Spatial Patterns (CSP), autoregressive models (AR) and Principal Component Analysis (PCA). CSP extracts spatial weights to discriminate between two classes, by maximizing the variance of one class while minimizing the variance of the second class. AR instead focuses on temporal, rather than spatial, correlations in a signal that may contain discriminative information. Discriminative AR coefficients can be selected using a linear classifier. Other methods search for spectral features to be used for classification. PCA is used for unsupervised feature extraction, by mapping the data onto a new, uncorrelated space where the axes are ordered by the variance of the projected data samples along the axes, and only those axes reflecting most of the variance are maintained. The result is a new representation of the data that retains maximal information about the original data yet provides effective dimensionality reduction. PCA is used in the current study and is further elaborated in the following sections. Such methodologies of single-trial EEG classification algorithms have been implemented for a variety of BCI applications, using different experimental paradigms. Most commonly, single-trial EEG classification has been used for movement-based and P300 based-applications. Movement tasks, both imaginary and real, have been studied for their potential use with disabled subjects. P300 applications, based on visual or auditory oddball experiments, originally aimed at providing BCI-based communication devices for locked-in patients and can also be used for a variety of applications for healthy individuals. Emotion assessment, for example, attempts to classify emotions to categories (negative, positive and neutral) using a combination of EEG and other physiological signals, offering a potential tool for behavior prediction and monitoring.
An implementing a BCI framework is aimed at, in order to sort large image databases into one of two categories (target images; non-targets). EEG patterns are used as markers for target-image appearance during rapid visual presentation. Subjects are instructed to search for target images (a given category out of five) within a rapid serial visual presentation (RSVP; 10 Hz). In this case, the methodological goal of the classification algorithm is to automatically identify, within a set of event related responses, single trial spatio-temporal brain responses that are associated with the target image detection. In addition to the common challenges faced by single-trial classification algorithms for noisy EEG data, specific challenges are introduced by the RSVP task, due to the fast presentation of stimuli and the ensuing overlap between consecutive event related responses. Some methods have thus been constructed specifically for the RSVP task.
One such method, developed specifically for single-trial classification of RSVP data used spatial Independent Component Analysis (ICA) to extract a set of spatial weights and obtain maximally independent spatial-temporal sources. A parallel ICA step was performed in the frequency domain to learn spectral weights for independent time-frequency components. Principal Component Analysis (PCA) was used separately on the spatial and spectral sources to reduce the dimensionality of the data. Each feature set was classified separately using Fisher linear Discriminants and then combined using naive Bayes fusion (i.e., multiplication of posterior probabilities).
A more general framework was proposed for single trial classification, and was also implemented specifically for the RSVP task. The suggested framework uses a bilinear spatial-temporal projection of event related data on both temporal and spatial axes. These projections can be implemented in many ways. The spatial projection can be implemented, for example, as a linear transformation of EEG scalp recordings into underlying source space or as ICA. The temporal projection can be thought of as a filter. The dual projections are implemented on non-overlapping time windows of the single-trial data matrix, resulting in a scalar representing a score per window. The windows' scores are summed or classified to provide a classification score for the entire single trial. In addition to the choice of projections, this framework can support additional constraints on the structure of the projections matrix. One option is, for example, to learn the optimal time window for each channel separately and then train the spatial terms.