Filename: b410053k

Back | Show only these items: | Optimising the EVA descriptor for prediction of biological activity

1
Optimising the EVA descriptor for prediction of biological activity

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

EVA is a multivariate molecular descriptor for use in QSAR studies.

Type: Object | Advantage: None | Novelty: New | ConceptID: Obj1

It is constructed from vibrational eigenvalues derived from either a quantum theoretical or molecular mechanical treatment of molecular structure.

Type: Object | Advantage: None | Novelty: New | ConceptID: Obj1

This paper applies the method to biological-activity data using measures of the inotropic potential of a range of Calcium channel agonists.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

The performance of the descriptor, as both an explanatory and a predictive tool, is analysed in relation to the way in which it is constructed using a rigorous statistical treatment.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

Its capabilities are examined in relation to those of previously published methodology which used a composite descriptor.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

It is shown to have improved performance and several procedural advantages, such as ease of calculation and operation.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con1

It is a 3-D structural descriptor which does not require prior co-alignment of structures for a QSAR study.

Type: Object | Advantage: Yes | Novelty: New | ConceptID: Obj1

Introduction

A Quantitative Structure-Activity Relationship (QSAR) is a statistical model relating a common property of a series of chemicals (such as their responses in an assay of a particular type of biological activity) to their molecular structure.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met1

The major goal is to use this relationship to discover new chemicals with properties optimally suited to a defined application in an area such as pharmaceuticals, agrochemicals or catalysts.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met1

The process entails relating structurally significant features of a set of chemicals to activity measurements (responses) that are the complex result of integration of many physical, chemical and metabolic processes.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met1

The structural features are summarised numerically in a molecular descriptor, but it is often difficult to do so efficiently and effectively since these features may also be very complex.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met1

Furthermore, many studies are performed on data sets comprising limited numbers of compounds and associated responses and this restricts the number of variables that can be used to describe structure in the QSAR model.

Type: Motivation | Advantage: None | Novelty: None | ConceptID: Mot1

In order to maximise the value of this small number of variables, molecular descriptors that summarise efficiently a large amount of diverse chemical information are required.

Type: Motivation | Advantage: None | Novelty: None | ConceptID: Mot1

One recent example is the theoretically-based multivariate descriptor, EVA.^1–3

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

The EVA (EigenVAlue) descriptor

The premise on which the EVA approach is based is that a molecule's normal modes of vibration encode a suitable description of chemical structure, summarising the features necessary for the production of robust, predictive, QSARs.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

The normal modes are readily derived, using computational techniques based on either Quantum Theoretical or Molecular Mechanical methods.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp1

When suitably encoded they may be used as robust structural descriptors in QSAR lead-optimisation studies.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp1

In the following section, an overview of the approach is provided; for a more complete discussion, the reader is referred to reference .¹

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac1

QSAR studies demand as complete a structural description of a molecule as possible, in numerical terms.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met1

Using Quantum Mechanics, the molecular wavefunction, Ψ, provides a thorough characterisation of the nuclear and electronic properties of a molecule (and in principle should also allow for the estimation of properties such as biological activity).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met1

However, extraction of useful information for a QSAR study is difficult.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met1

Following studies with experimentally-based Infra-red Spectra, theoretically-derived normal modes of vibration were identified as a suitable molecular descriptor encoding information on the identity of the constituent atoms, their bonding and spatial relationships (hence molecular shape and size), their vibrational modes and molecular electronic structure.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

Intuitively, such a descriptor should constitute a robust, predictive, structural representation.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp1

Calculation of a molecule's normal modes is achieved using the molecular electronic potential energy function, V, which is derived from the molecular wavefunction.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod2

Normal Coordinate Analysis (NCA) is performed to characterise the molecular vibrational properties.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met2

This approach involves the pointwise estimation of the potential function to determine the equations of motion, yielding both normal coordinate EigenVAlues (the normal mode frequencies of vibration) and their associated eigenvectors (the atom vibration vectors).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met2

The EVA procedure makes use of the eigenvalues, the eigenvectors being retained for final back-transformation to the molecular structures.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

The normal modes for a set of molecules cannot be used directly in a multivariate statistical analysis because current methods demand that the descriptor for each molecule in the training set is comprised of the same number of well-defined components, (i.e., the molecular descriptor must be a vector of fixed dimension).

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met2

Since the number of normal modes varies with the number of atoms N in a molecule (actually 3N-6 for a molecule without axial symmetry) each descriptor within a series will generally be of different dimension.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met2

Data transformation is therefore required to convert, without loss of information, a varying number of vibrational eigenvalues into the descriptor EVA, which is of fixed dimension.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

This is done by projecting the eigenvalues (vibrational frequencies) onto a bounded frequency scale (BFS) with individual vibrations represented by points along this axis.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

The frequency range is chosen to be 0 cm⁻¹ to 4000 cm⁻¹ to encompass all fundamental molecular vibrations.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

Each of the 3N-6 vibrations is represented by an equivalent Gaussian curve, G{f(μ), σ²}, in which μ is the vibrational frequency; the area under each curve is assigned as unity.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

Proximate or coincident Gaussian functions are permitted to overlap and the “intensity” is summed.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

The choice of σ for each function defines the degree to which the vibrations overlap and is typically 10 cm⁻¹ to 20 cm⁻¹.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

This smoothing operation introduces, deliberately, a “fuzziness” to the spectrum of vibrations and is a key step in the definition of EVA; a value that is too small will fail to detect similarities where they exist, and one that is too large will result in overlaps that obscure significant proportions of the variation in the descriptor.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

The process results in a degree of serial correlation in the descriptor that causes inevitably some redundancy in the descriptor variables.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

However, it also enables the significance of the presence or absence of peaks to be assessed in the subsequent analysis, together with the monitoring of changes in peak position.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

Once a Gaussian smoothing function has been applied, the resultant spectrum is sampled across its whole width in fixed increments (L cm⁻¹).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met3

The choice of increment determines the number of variables in the EVA descriptor; thus, for example, a 2 cm⁻¹ increment results in a descriptor string of 2000 variables.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met3

Molecular vibrations are therefore depicted as “peaks” of intensity on a scale of frequency versus height.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod1

This produces the molecular descriptor, EVA, which retains the integrity of its constituent frequency data.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met3

The choice of the variables σ and L can have a marked effect on the quality of the subsequently derived QSAR/QSPR model, and this is discussed below (see section 3.2).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met3

The EVA descriptor was originally formulated at Shell Research Ltd, Sittingbourne Research Centre, and was applied successfully in a series of agrochemical lead-optimisation studies.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac2

While much of the experimental data used in the development remain proprietary and hence are not available for publication, its utility has been demonstrated by modelling experimentally observed logP values.¹

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac2

The intention of the present study is to illustrate the application of the method to biological data that are within the public domain, incidentally providing a comparison of the performance of the EVA descriptor with that of other, more familiar, descriptors.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

A convenient example lies in a set of calcium ion channel agonists and their associated responses in an assay of biological activity;⁴ this represents a particularly attractive set as it has been analysed previously by Davis et al^5,6. with another theoretically-based 3-D descriptor, GRID.⁷

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac3

This paper compares the performance of EVA with the published results of this study.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

In the previous work,^5,6 the 3D structural descriptor, GRID,⁷ was used to analyse the molecular features which account for the observed activities.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met4

The additional inclusion of selected physicochemical descriptors, and the consequences of variable scaling, was then considered in turn to assess whether the QSAR model could be improved upon.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac3

All results were cited in terms of the squared correlation coefficient of the fitted PLS regression model, r².

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod3

The intention of the present studies is to conduct a parallel investigation to examine the performance of the EVA descriptor.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

A particular focus will be to examine the robustness of the predictive ability of the EVA based QSAR model.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

Results will therefore be presented in terms of the squared correlation coefficients of the cross-validated equations, q², rather than the fitted regression model.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod3

The parameter q² is commonly used as a measure of the predictive ability of the regression model, but in order to facilitate comparison between the two studies the values of r² will also be included as appropriate.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod3

This paper reports the results of an investigation to show how to optimise the EVA descriptor in order to increase the predictive power of a QSAR equation.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

Experimental

The Calcium Channel Agonist data set

Table 1 shows the structures of the compounds used by Davis et al.^5,6

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac4

The compounds were assayed in vitro for their inotropic potency, i.e., their ability to increase cardiac contractibility, using guinea pig atria paced at 1 Hz.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met4

This was expressed as the concentration of the drug that increased the tension developed in the preparation to 50% of the isoprenaline maximum (EC₅₀).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met4

Results were expressed relative to the EC₅₀ of the standard calcium channel agonist Bay K 8644 (Fig. 1) and these values are given in Table 1.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac4

Also listed in Table 1 are values of the calculated log octanol/water partition coefficient (clogP) and the calculated molar refractivity (cMR) respectively, corresponding to the various structures.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac4

These were calculated using MedChem software v.3.54 B (1989).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met5

In the corresponding table in the paper by Davis et al. the quoted values for these parameters are different, having been modified by in-house fitting to a few experimentally determined values.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac4

In the present paper we use the standard values in order to maintain external consistency.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac4

GRID descriptor

For the 3-D QSAR analysis of the data set, Davis et al. made use of the GRID descriptor.⁷

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met4

The technique was developed originally as a method of probing proteins for potential binding sites and has been used successfully to predict sites of binding of small ligands in proteins,⁸ in addition to its more recent role as a structure descriptor.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met4

A detailed discussion of the derivation of the GRID descriptor can be found in reference .⁷

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac5

The inclusion of physicochemical descriptors

A limitation of GRID is that by generating the descriptors from the calculated interaction energy of a probe molecule, only the enthalpic component of the drug–receptor interaction can be modelled.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met4

This neglects other factors which can be, potentially, highly influential, such as the entropic changes resulting from the interaction.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met4

In recognition of this limitation, Davis et al^5,6. included two physicochemical descriptors, clogP and cMR.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod4

ClogP represents a particularly useful descriptor since by accounting for lipophilicity and solvation, features that are influenced heavily by hydrogen bonding, it can give information on the entropic changes associated with a molecule's coming out of solution and binding at a receptor site.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod4

EVA descriptor generation

EVA descriptors were generated for the 36 compounds in the calcium channel agonist set.^5,6

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met3

A typing error in the published table^5,6 cites incorrectly the substitution about the benzoylpyrrole backbone of compounds 8, 12, 19, 33, 35 and 36 as 2,4 rather than as .2,5⁹

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac6

The correct structures were used in the present study and were created using Sybyl 6.2 from Tripos Associates¹⁰ on a Silicon Graphics Indy R4000 platform.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met6

In the previous studies, Davis et al. constrained the energy minimisation of the structures in an attempt to minimise noise in the GRID descriptors and hence optimise the subsequent QSAR investigations.^5,6

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met4

In order to perform the calculation of the normal modes of vibration required for EVA, it is a prerequisite to ensure that the potential energy of a molecular conformation is at a stationary point.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met3

Slight deviations away from this energy minimum can lead to the introduction of rotational and translational modes during the determination of the equations of motion in the NCA.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met3

This in effect ‘contaminates’ the vibrational modes, resulting in the determination of vibrations that are not purely harmonic in nature and degrading significantly the quality of the calculated vibrational spectrum.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met3

A more detailed discussion of this is provided in reference .¹¹

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac7

No attempt was made therefore to reproduce directly the conformational co-ordinates used by Davis et al.^5,6

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met3

Molecular structures were optimised using the MOPAC 6.0 AM1 Hamiltonian¹² (keywords: SCFCRT = 1.D-12 GNORM = 0.05).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met7

Successfully minimised structures were then used to generate the NMs (additional keyword: FORCE).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met7

These were then converted into EVA descriptors in the manner described.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met3

It is acknowledged that the calculation of NMs using a semi-empirical route is less accurate than for example, by the use of Density Functional Theory methods.¹³

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met7

However, the use of the MOPAC AM1 Hamiltonian has previously been shown to produce qualitatively good results.¹⁴

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met7

Furthermore, EVA descriptors generated from the MOPAC AM1 Hamiltonian have been applied successfully in previously reported studies.¹⁴

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met7

Consequently, it is believed that the EVA descriptors generated in this way are acceptable for the present investigation.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp2

Statistical analysis

To generate useful QSARs between the activity, EC₅₀, and the EVA descriptor, the Partial Least Squares regression method (PLS) was adopted for the investigations; this is in line with the GRID-based studies of Davis et al.^5,6

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met8

PLS is a supervised approach that provides a reduced solution of new ‘latent’ variables that are linear combinations of the original variables and are well correlated with Y. In essence, it is a projection method that defines a hyperplane through the x-descriptor space.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod5

The latent variables represent the projected ‘summaries’ of the original variables onto this plane and each PLS component represents a co-ordinate dimension of the hyperplane.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod5

A more complete discussion of the PLS approach can be obtained from references ¹⁵ and .¹⁶

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac8

The PLS regression approach is an accepted technique for handling situations where over-square matrices are encountered and has some advantages over unsupervised techniques such as, for example, Principal Components analysis.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met8

By constructing the components in a supervised fashion, it is possible to produce fewer numbers of components correlated with Y, thereby summarising the data more efficiently.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met8

With matrices such as EVA and GRID this is a clear advantage.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met8

As has been mentioned earlier in section 1.1, by having the ability to vary the form and dimension of the EVA descriptor (i.e., σ and L), it is possible to change the manner in which the structural information is presented.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met3

In effect, this can be thought of as a form of data scaling, in line with log and variance scaling methods.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met3

This affects ultimately the regression analysis, particularly with respect to the co-variance with the activity data and the way in which the information is loaded onto the components.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp3

Therefore an essential stage in the application of EVA is the generation of descriptors with different σ and L values and their subsequent systematic validation to identify the optimum predictive model.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp3

100

Previous studies have shown that a σ value of between 10 and 20 cm⁻¹ is a reasonable starting point for such investigations.¹

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac9

Model validation and significance testing

101

A key step in regression analysis is the validation to determine the correct dimensionality of the model, i.e., the number of significant PLS components to be included.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp4

102

Without this validation there is a risk of ‘over-training’, leading to apparently well-fitting models with effectively little predictive capacity.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp4

103

As the intention of the present study is to assess the predictive capability of the EVA descriptor, two forms of model validation have been incorporated.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

104

In line with the study by Davis et al., Cross-Validation (CV) has been used to test each successive PLS component for significance.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met9

105

In addition, randomisation tests have been included to establish confidence limits on whether or not random correlations are being selected in place of real relationships.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met9

106

CV represents one of the most favoured methods of testing the PLS components for significance.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met9

107

In essence, CV randomly divides the data set into groups of compounds.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod3

108

Each group is then omitted from the data set in turn and the remaining set used to construct a new regression model, which is then used to calculate the responses for the members of the omitted set.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod3

109

From the predicted y-values for each of the omitted ‘test’ sets, the sum of squares of the differences from actual values can be calculated.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod3

110

This yields the PRESS score (Predictive Residual Sum of Squares) and by weighting this against both the inclusion of successive components and the number of cases in the set, the standard error of cross-validation, S_cv, can be calculated.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod3

111

This provides a good test of the significance of the inclusion of each component upon the overall regression model and has some advantages over other procedures such as, for example, the F-ratio test.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met9

112

By simulating how well the model predicts new data, it provides an estimate of the robustness of the model and potentially can highlight models that are influenced by data points with high leverage (e.g., outliers).

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met9

113

As the underlying intention of a QSAR is to make predictions, then this is useful.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met9

114

Initial analysis of the EVA descriptor models was performed using Leave-One-Out Cross-Validation (LOO CV).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met10

115

To determine the optimum regression model, the S_cv scores for the first 10 PLS components were calculated and the minimum S_cv score identified.¹⁵

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met10

116

Previous experience has shown that the relevant information linking the response and descriptor variables is summarised in the first few components (typically ≤ 6).¹

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac9

117

The squared correlation coefficient of the cross-validated equation, q², was then cited for the inclusion of each successive component.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac9

118

In a previous publication,¹ it was shown that the LOO CV q² obtained on a training set of 135 structures, using EVA as the descriptor and logP_ow as the measured response, was a very good indicator of subsequent predictive power for logP_ow in an unrelated test set of 76 compounds.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met10

119

However, work by Shao¹⁷ has challenged the suitability of LOO CV for testing the predictive robustness of a QSAR model and suggests that Leave-n-Out Cross-Validation, where n is greater than 1, represents a more robust alternative.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met10

120

Consequently, in the present investigations we have compared LOO CV results with those obtained using Leave-Group-Out Cross-Validation (LGO CV) with n = 7 (see section 3.2); this is also in line with the validation approach applied by Davis et al.^5,6

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met11

121

However, while LOO CV represents an easily reproducible test, i.e., each run will yield the same q² result, in LGO CV the cases assigned to each group to be left out are selected at random.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met11

122

This means that LGO CV must be repeated a large number of times (e.g., 1000) in order to obtain a realistic estimate of the internal predictivity and is hence more computationally demanding than LOO CV.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met11

123

Our work shows that with this data set and analytical approach, LOO CV gives very similar results to LGO CV and so we have adopted the simpler procedure.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met10

124

It must be acknowledged that neither LOO CV nor LGO CV can be regarded as a sufficient indicator of model validity; in addition, some measure of the probability that the result may be a chance occurrence is needed.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp5

125

This is particularly important in regression studies where there are a smaller number of cases than the dimensionality of the x-descriptor, i.e., where an over-square matrix is present.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp5

126

In such instances, there is an increased possibility of the extraction of a random correlation.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp5

127

Consequently, in order to address this with regard to the EVA descriptor, a randomised permutation test was performed to examine the model for chance correlations; this is in line with previous studies that have applied the EVA descriptor.^2,18

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

128

The process involves:

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

129

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

130

Randomly assigning observations to structures,

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

131

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

132

Performing a regression analysis using EVA and then evaluating q² (or r²),

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

133

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

134

Repeating steps 1 and 2 a large number of times (e.g., 1000 times) to ensure thorough sampling,

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

135

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

136

Determining the frequency distribution of q² (or r²),

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

137

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

138

Determining where in this frequency distribution the q² (or r²) of the real assignment lies.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

139

An actual result that is extreme in the upper confidence tail of the distribution is regarded as having a low probability of occurring by chance

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

Results and discussion

140

In the following section, the results are reported for the QSAR investigation of a set of 36 calcium channel agonists (Table 1), using EVA as the structural descriptor and PLS regression as the statistical approach.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

141

Except where stated otherwise, all results will be reported in terms of the squared correlation coefficient (q²) of the Leave-One-Out Cross-Validated (LOO CV) equation.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met10

142

To facilitate comparisons, Table 2 is taken directly from the work of Davis et al., detailing the PLS regression models reported for the full 36 compound data set.^5,6

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac6

Determination of optimum PLS regression model

Optimum EVA descriptor model

143

Initial identification of the optimum EVA model was performed by comparing the results obtained with a series of descriptors using σ values of 10, 20, 30 and 40 cm⁻¹, respectively.

Type: Method | Advantage: None | Novelty: New | ConceptID: Met3

144

The results established a narrowed focus of attention around σ = 20 cm⁻¹ and a further series of EVA descriptors was then generated (16 ≤ σ ≤ 26 cm⁻¹).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res1

145

Comparison of the predictive performance of the models identified the optimum EVA model as having σ = 24 cm⁻¹.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res1

146

Throughout, the BFS start point, S, was fixed at 1 cm⁻¹ and L was defined as σ/2.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod6

147

This has been assumed previously¹ to represent a reasonable sampling frequency for a given value of σ and is justified below (section 3.2).

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac2

148

The results for the full range of EVA models considered are included in Table 3.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res1

Model transformations and variance scaling effects

149

As was demonstrated earlier,^5,6 PLS analysis can be highly sensitive to the relative scales of variance between the x-block and y-variable (this is especially significant when different x-descriptors, each with its own unit and scale, are included in the regression equation, although this situation does not apply to EVA).

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met8

150

A series of transformations and scaling procedures was applied to both the activity and EVA variables, to investigate the impact on performance.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met13

151

Scaling and transformation tests were conducted in two parts.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met13

152

In the case of the activity variable, EC₅₀, both log scaling and variable standardisation were considered, as in the study of Davis et al.^5,6

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met13

153

For the EVA descriptor, two forms of variance scaling were applied, Autoscaling and Blockscaling.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met14

154

Autoscaling standardises each individual variable column to zero mean and unit variance, while Blockscaling scales the entire descriptor block to zero mean and unit variance.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met14

155

The impact of these two scaling methods is of particular interest, because differences in column variance across the descriptor are considered to contain structurally significant information.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp6

156

The full results are reported in Table 4.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res2

157

Variance scaling with the EC₅₀ variable has no impact on the PLS regression model (4c model, q² = 0.498).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res2

158

Similarly, blockscaling the EVA descriptor does not affect the performance.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res2

159

However, log₁₀ transformation of the activity values does result in a significant degradation of the PLS model (1c model, q² = 0.194).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res2

160

Notably, use of the autoscaled EVA descriptor yields the worst model of all.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res2

EVA QSAR model testing and stability

161

As discussed above (section 2.4.1), with over-square x-matrices the PLS regression approach, potentially, can extract chance correlations with the y-variable.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met8

162

In the case of the EVA descriptor, an extreme example of an over-square matrix, then the likelihood of such an occurrence may be increased.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp7

163

Therefore, a series of rigorous validation exercises was applied to establish confidence in the QSAR model obtained.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

164

These include sensitivity tests on the effects of changes in the EVA L variable, LGO CV (in line with references ⁵ and ⁶) and randomised activity permutation tests.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met12

165

Fundamentally, two variables, namely σ (Gaussian standard deviation) and L (the BFS sampling interval), determine the form and dimension of the EVA descriptor when derived from the normal coordinate spectrum.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met3

166

Potentially, the value chosen for L can impact on the information content retained in the final descriptor model and an extreme illustration of this would occur if the sampling interval were so large that not all of the intensities under the Gaussian curves were to be sampled.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met3

167

As outlined earlier in the initial determination of the optimum EVA model, L was defined as σ/2.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod6

168

This ensures that all structural information is retained, while keeping the associated computational time within reason.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod6

169

Validation of this procedure was carried out by comparing the effects of using a range of sampling intervals (σ/4 ≤ L ≤ 4σ).

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met15

170

The full set of results is shown in Table 5 and illustrated in Fig. 2.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res3

171

For sampling intervals where L ≤ σ, the information content remains intrinsically stable, with no significant impact on the statistical performance of the EVA models.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res3

172

However when L > σ, decreasing stability in the performance in the EVA model is observed.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res3

173

The results confirm that L = σ/2 is a reasonable rule.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res3

174

To test rigorously the predictive performance of the optimum EVA model (EVA{24,12}), LGO CV was used and compared with the results of LOO CV.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met11

175

In line with the LGO CV in references ⁵ and ⁶, the 36 compounds were divided randomly into 7 groups.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met11

176

Repeated runs of the LGO CV (1000 iterations) were carried out with different randomisations to ensure that the range of possibilities was thoroughly sampled.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met11

177

The results for the LGO CV study are presented in Table 5.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res4

178

These show that the 4-component model occurs most frequently (560 iterations; Fig. 3a), in agreement with LOO CV.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res4

179

The range of q² scores for the LGO CV 4 component model is approximately normal in distribution (Fig. 3b and 3c).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res4

180

Determination of the mean q² reveals that q²_LGO compares closely with q²_LOO (Fig. 2c: q²_LGO = 0.474, q²_LOO = 0.498), which gives confidence in the use of LOO CV to test the range of EVA models.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res4

181

Figs. 4 and 5 show the results of the randomised activity permutation tests on the same data set, obtained by using 1000 permutations of the response data with respect to structure.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res5

182

Fig. 3 shows the frequency distribution and normalised plot of q², obtained from the LOO CVs; the actual experimental result of 0.498 lies well outside the upper 95% confidence limit (>99%).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res6

183

A similar result is shown in Fig. 4 for the fitted r² distribution.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res7

184

Table 6 summarises these results.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res8

Augmentation of EVA {24,12} with the physicochemical descriptors clogP and cMR

185

Davis et al^5,6. showed that the 3D molecular descriptor GRID did not by itself encode sufficient information to give a high-quality QSAR regression model of −log EC₅₀ (r² = 0.42).

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met4

186

However, inclusion of clogP and cMR, in linear combination with GRID, gives a significant improvement (r² = 0.86).

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod4

187

It is therefore of interest to examine the influence of including clogP and cMR with EVA.

Type: Motivation | Advantage: None | Novelty: None | ConceptID: Mot2

188

The results are given in Table 6 and show that inclusion of these parameters (either singly or in combination) degrades the quality of the fit (r²) and also the predictive power (q²).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res9

EVA and analysis of the data of Davis et al^5,6.

189

QSAR methods that relate biological response to molecular structure are traditionally based on linear free energy relationships, as illustrated in the pioneering work of Hansch et al.¹⁹

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met1

190

Typically, these methods attempt to draw correlations between empirical descriptors which supposedly represent generalised forms of enthalpic and entropic effects associated with a drug molecule's transport to, and interaction with, a receptor or enzyme active site.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met1

191

The current research is aimed at further developing a new molecular descriptor, EVA, which encodes a large amount of diverse chemical information and can be used more effectively than the previous empirical forms.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa1

192

Analysis of the predictive performance of a range of EVA PLS models (Table 3) identifies the optimum descriptor as having a Gaussian sigma; = 24 cm⁻¹ and a sampling interval L = 12 cm⁻¹, EVA {24,12}.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res1

193

Cross-Validation (using both LOO CV and LGO CV methods) suggests that the first 4 components are significant, yielding a QSAR that is able to account for circa. 80% of the variation in the EC₅₀ activity (fitted r² = 0.801).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res4

194

Rigorous testing of the internal predictivity of the EVA QSAR model (Tables 3 and 5) shows that for structures outside the training set circa. 50% of the activity variation may be predicted with confidence, an experimentally useful result (4-component models, q²_LGO = 0.474; q²_LOO = 0.498).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res4

195

The close agreement between the two validation methods, illustrated in Fig. 2, gives considerable confidence in the use of LOO CV to validate the chosen EVA model and indicates that EVA is robust as a predictive QSAR tool, i.e., the model is stable.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res4

196

Since Davis et al^5,6. analysed their data with log-transformation of the EC₅₀ values (i.e., the y-variate was −log(EC₅₀), Table 1) and their original data are not available to us, an exact comparison of the two sets of results is difficult.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res4

197

Moreover, cross-validation tests were not reported so it is only possible to compare values of r², obtained by fitting the full-regression equation, rather than q².

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac10

198

The best model of Davis et al^5,6. uses a linear combination of GRID, clogP and cMR.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod4

199

(GRID alone gives a 1-component model, r² = 0.42, clogP alone a 1-component model, r² = 0.69, and GRID in combination with clogP and cMR gives a 4-component model, r² = 0.86).

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac3

200

The values of −log(EC₅₀) predicted by this model are not quoted, but a comparison with experimentally observed values is given graphically in Fig. 5 of reference ⁵; by careful measurement, it is possible to obtain reliable estimates for the predicted −log(EC₅₀) values.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac3

201

Taking antilogarithms yields values for predicted EC₅₀ and the regression statistics may be calculated and compared with those obtained for the EVA model.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp8

202

Overall, the full EVA model (r² = 0.80, standard error = 4.1) gives a considerably better fit to the untransformed experimental data than does the GRID + clogP + cMR model (r² = 0.42, standard error = 7.0).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res10

203

The results are shown in Table 7 and are further illustrated in Fig. 6 which shows plots of residuals for both treatments.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res10

204

For small values of EC₅₀ (typically less than 4.1, the standard error of the EVA treatment) the two models (not surprisingly) cannot be distinguished; however, for higher values of EC₅₀ EVA consistently outperforms the other model by an impressive margin.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res10

205

A model-validation test using randomised activity permutations establishes confidence intervals for both the predictive (Fig. 4) and fitted (Fig. 5) regression models and demonstrates that, although the EVA descriptor presents an over square x-matrix in the PLS regression model, random correlations with the activity data are highly unlikely (>99% probability that the correlations have not arisen by chance).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res10

206

EVA contains structurally significant features that account well for the observed biological activity.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con1

207

It should also be emphasised that EVA is a robust descriptor of 3D molecular structure, which does not require prior alignment of structures in order for a QSAR study to be performed.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met3

208

This is especially useful when dealing with non congeneric series, where alignment about a common skeletal feature is not feasible.^1,3

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met3

209

Although the relevance, multicolinearity and redundancy of the EVA descriptors have not been addressed in this study, they probably provide the most likely explanation for the relatively small values obtained for the estimates of LOO q².

Type: Result | Advantage: None | Novelty: None | ConceptID: Res11

210

These aspects will be reported in a future publication.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res11

211

Clearly, a model that yielded a q² of >50% would be even more useful.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res11

212

Plots of the predicted vs. actual activity for both the fitted and cross-validated regression models (Figs. 7 and 8) highlight problems with outliers, particularly those compounds with EC₅₀ values which are essentially zero; these structures can only contribute noise to the analysis, and the consequences are addressed in detail in section 3.7.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res12

Data scaling

213

In studies where the x-variate function is a vector consisting of a number of variables such as molecular properties with different dimensionalities and variances, careful tailoring of the variable variances is critical in extracting an optimum QSAR model.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp9

214

This is true, for example, of the work by Davis et al.^5,6

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac3

215

In the present study however, Table 4 shows that blockscaling and autoscaling of the x-variate function does not improve the EVA QSAR model.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res2

216

In particular, autoscaling (which scales columns independently to zero mean and unit variance), performs significantly worse.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res2

217

In the EVA descriptor, all variable columns are presented on the same scale with the differences in variance between them intrinsically encoding structurally relevant information.

Type: Method | Advantage: None | Novelty: Old | ConceptID: Met3

218

Consequently, individual columns should not be standardised since this would result in giving columns with little variation and therefore little information, undue weighting.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met3

219

Indeed, in-house development with proprietary data sets (Shell Research Ltd., the former Sittingbourne Research Centre) has in the past demonstrated that scaling typically does not improve the performance of EVA-based QSAR models.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac11

220

Log transformation of the y-variate, EC₅₀, leads to a substantially worse EVA model.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac11

221

This contrasts markedly with the results of a similar transformation in the work of Davis et al., which gave the best result.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res13

222

This is investigated further in section 3.7, below.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac11

Inclusion of the physicochemical descriptors

223

It is reasonable to ask the question as to whether or not the inclusion of the physicochemical parameters clogP and cMR would improve the EVA model.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp10

224

The results of doing so are shown in Table 8 and provide an insight into the information content of EVA.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res14

225

It is clear that the inclusion of either clogP or cMR, or both, does not lead to an improved QSAR model, even with scaling of the variables in this more complex x-variate vector.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res14

226

The implication may be that the information contained within the physicochemical parameters is already encoded intrinsically within the EVA descriptor.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp11

227

In order to test this hypothesis, namely that EVA encodes physicochemical information, regression models were constructed for clogP and cMR, respectively.

Type: Model | Advantage: None | Novelty: None | ConceptID: Mod7

228

The results (Table 9) show that while EVA accounts for the variation in cMR extremely well (r² = 0.983, q² = 0.963), it does not do so at all well for clogP (r² = 0.415, q² = 0.281).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res15

229

The result with clogP is surprising, since in previous work it has been shown that EVA QSAR can explain the variation in experimentally measured logP values across a wide variety of structurally-diverse, non-related compounds and be usefully predictive (135 compounds, r² = 0.96, q² = 0.68).¹

Type: Result | Advantage: None | Novelty: None | ConceptID: Res15

230

Why should the approach be less successful with the congeneric set of structures described in Table 1?

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp12

231

The most likely reason for this anomaly is in the variance in clogP; the range of values is relatively small, i.e., only 3 log units in comparison to 6 log units in the logP study.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp12

Investigation of outliers

232

It was noted that eight of the structures in the data set have corresponding values of EC₅₀ that were approaching zero, ranging from 0.01 to 0.09 (see Table 1).

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac12

233

Given the nature of the bioassay from which they have been derived, within experimental error, they probably may not be significantly different from zero.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac12

234

Inclusion of these data has the consequence of biasing the activity data set (y-variate) heavily, and introducing noise in the form of irrelevant descriptor information (x-variate).

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp13

235

The consequence of removing these data to leave a set of 28 structures and activity scores was investigated.

Type: Goal | Advantage: None | Novelty: None | ConceptID: Goa2

236

Table 10 compares the regression results with those obtained for the original set of 36 structures.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res16

237

For the untransformed EC₅₀ data, there was a slight improvement in both the fit of the full regression equation and in the predictive power, albeit with the addition of a further 2 components in the regression model (6 component model, r² = 0.939, q² = 0.527, cf. a 4-component model, r² = 0.802, q² = 0.498).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res16

238

In contrast, there was a highly significant change in the performance of the regression results for the −log(EC₅₀) scores, Figs. 9 and 10 (5-component model, r² = 0.963, q² = 0.721, cf. a 1-component model, r² = 0.332, q² = 0.194).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res17

239

This compares favourably with the results of a similar transformation in the work of Davis et al. (4-component model, r² = 0.86).

Type: Result | Advantage: None | Novelty: None | ConceptID: Res17

240

More importantly, the q² result would provide significant confidence in support of the application of this model in a predictive role.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con2

241

However, it should be recognised that chemical interpretability will be non-trivial, given the Gaussian transformation and subsequent sampling of the normal modes of vibration.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con2

242

The degree of improvement in the log transformed EC₅₀ was marked, however, it was not unexpected.

Type: Result | Advantage: None | Novelty: None | ConceptID: Res17

243

As was highlighted earlier, PLS analysis can be highly sensitive to the relative scales of variance between the x-block and y-variable.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met8

244

Certainly, for the log-transformed y-variable set, this variance would have been significant.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp14

245

Furthermore, given the likely experimental error in the bioassay, it is reasonable to postulate that relative differences in the log values for these compounds would have been subject to significant error.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp14

246

Consequently, in attempting to generate a model, these data would have had a greater bias in the log-transformed set.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp14

247

It remains interesting to speculate on whether the sensitivity in model formation was also in part a product of the choice of an essentially linear regression model.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con3

248

While outside the scope of this study, it remains an appealing point for future studies to test the performance of the EVA descriptor across a series of non-linear statistical methods.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con3

Chemical similarity searching

249

The formulation of the EVA descriptor as a bit string opens up the possibility of using the encoded vibrational mode information for chemical similarity searching.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp15

250

There is a substantial body of literature that describes the procedures and software available for this purpose and which points out the strengths and pitfalls of the approach.^20–22

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac13

251

Current procedures are based principally on the use of chemical fragments represented as topological maps, atom counts and bond types.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac13

252

Use of EVA for similarity searches would provide useful additional information to complement and extend this structural data.

Type: Hypothesis | Advantage: None | Novelty: None | ConceptID: Hyp15

253

Applications might include selective compound acquisition²⁰ and identification of compounds with similar biological activities²¹.

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac13

Conclusions

254

Within the context of the present study, EVA is well suited to the production of usefully predictive QSARs with both biological and physicochemical data.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con1

255

The absence of a requirement for additional descriptors demonstrates that EVA has a very high information content encompassing, intrinsically, physicochemical descriptors such as clogP and cMR.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con1

256

In direct comparison with a composite descriptor (GRID + clogP + cMR), EVA has been shown to have superior predictive performance.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con2

257

As a general tool for QSAR studies, EVA advantageously may be applied without the prior alignment of structures required for other 3-D descriptors such as COMFA and GRID.

Type: Method | Advantage: Yes | Novelty: Old | ConceptID: Met3

258

However, optimisation of the EVA descriptor by adjusting how it is constructed can influence the predictive power of a QSAR equation.

Type: Method | Advantage: No | Novelty: Old | ConceptID: Met3

259

Further studies with other sets of biological data will be necessary in order to demonstrate the robustness of these conclusions.

Type: Conclusion | Advantage: None | Novelty: None | ConceptID: Con3

Abbreviations

260

QSAR, Quantitative Structure-Activity Relationship; EVA {σ, L, S}, Eigen VAlue molecular spectral descriptor model where: σ (sigma) = standard deviation of the EVA Gaussian curve, L = sampling interval of summed intensities and S = sampling interval start point on the bounded frequency scale; BFS, Bounded Frequency Scale (0 to 4000 cm⁻¹); NCA, Normal Co-ordinate Analysis; NMs, Normal Modes (of Vibration); EC₅₀, relative force of contraction; clogP, calculated log octanol/water partition coefficient; cMR, calculated molar refractivity; PLS, Partial Least Squares regression analysis; r², squared correlation coefficient; CV, cross-validation; LOO, ‘Leave One Out’ cross-validation; LGO, ‘Leave One Group Out’ cross-validation; q², squared correlation coefficient of the cross-validation equation; PRESS, Predictive REsidual Sum of Squares; S_cv, cross-validated standard error, where ; n, number of compounds in data set; c, number of PLS components included in QSAR model; F-ratio, weighted ratio of r² to 1.0 − r², where

Type: Background | Advantage: None | Novelty: None | ConceptID: Bac14

Introduction

The EVA (EigenVAlue) descriptor

Experimental

The Calcium Channel Agonist data set

GRID descriptor

The inclusion of physicochemical descriptors

EVA descriptor generation

Statistical analysis

Model validation and significance testing

Results and discussion

Determination of optimum PLS regression model

Optimum EVA descriptor model

Model transformations and variance scaling effects

EVA QSAR model testing and stability

Augmentation of EVA {24,12} with the physicochemical descriptors clogP and cMR

EVA and analysis of the data of Davis et al5,6.

Data scaling

Inclusion of the physicochemical descriptors

Investigation of outliers

Chemical similarity searching

Conclusions

Abbreviations

EVA and analysis of the data of Davis et al^5,6.