Kfoil: learning simple relational kernels

kFOIL: Learning Simple Relational Kernels Niels Landwehr1 and Andrea Passerini2 and Luc De Raedt1 and Paolo Frasconi2 2Machine Learning and Neural Networks Group Albert-Ludwigs Universit¨at, Freiburg, Germany Universit`a degli Studi di Firenze, Florence, Italy {landwehr,deraedt}@informatik.uni-freiburg.de All these kernels are fixed before learning takes place and,to the best of the authors’ knowledge, a kernel method that A novel and simple combination of inductive logic program- directly learns from relational representations is still miss- ming with kernel methods is presented. The kFOIL algo- ing. Second, there is the idea of static propositionalization, rithm integrates the well-known inductive logic programming in which an ILP problem is turned into a propositional one constructed by leveraging FOIL search for a set of relevant by pre-computing a typically large set of features, cf. e.g.
clauses. The search is driven by the performance obtained (Muggleton, Amini, & Sternberg 2005), and then using tra- by a support vector machine based on the resulting kernel.
ditional SVM learning on the resulting representation. An In this way, kFOIL implements a dynamic propositionaliza- extension of this approach transforms the relational repre- tion approach. Both classification and regression tasks can be sentations into a structured one, by e.g. computing proof- naturally handled. Experiments in applying kFOIL to well- trees for so-called visitor programs (Passerini, Frasconi, & known benchmarks in chemoinformatics show the promise De Raedt 2006). Third, as kernels are closely related to sim- ilarity measures, work on distance based relational learning(Ramon & Bruynooghe 1998; Kirsten, Wrobel, & Horv´ath 2001) should also be mentioned. The drawback of these ap-proaches is that the resulting models are still complex and Various successes have been reported in applying inductive hard to interpret. In addition, the user typically needs to logic programming (ILP) techniques to challenging prob- specify additional information to restrict the number of fea- lems in bio- and chemoinformatics, cf. e.g. (Bratko & Mug- tures generated in the propositionalization process or to en- gleton 1995). These successes can—to a large extent—be code the distance function, which is often a non-trivial task.
explained by the use of an expressive general purpose repre-sentation formalism that allows one to deal with structured The approach taken in this paper is different. The key data, to incorporate background knowledge in the learning idea is to dynamically induce a small set of clauses us- process, and to obtain hypotheses in the form of a small set ing a FOIL-like covering algorithm (Quinlan 1990) and to of rules that are easy to interpret by domain experts.
use these as features in standard kernel methods. Apply- On the other hand, support vector machines and kernel ing rule-learning principles leads to a typically small set of methods in general have revolutionized the theory and prac- rules or features, which are—due to the use of a relational tice of machine learning in the past decade. These methods representation—also easy to interpret. Using these features do not only yield highly accurate hypotheses; they are also to define a kernel leads to similarity measures amongst re- grounded in a solid mathematical theory. However, dealing lational examples and also allows to directly tackle a wide with structured data and employing background knowledge variety of learning tasks including classification and regres- is harder, as it typically requires one to develop a novel ker- sion with support vector machines. Especially the uniform nel for the specific problem at hand, which is a non-trivial treatment of classification and regression is appealing from task. Also, the resulting hypotheses are hard to interpret by an ILP perspective, as these typically require rather differ- ent techniques (with possibly the exception of decision trees Given these developments, it can be no surprise that sev- (Kramer 1996)).In contrast to the three types of approaches eral researchers have started to combine and integrate ideas mentioned earlier, the kernel or similarity measure is being from ILP with those from support vector machines. First, learned. Also, whereas the resulting model is still a kind there has been a significant interest in developing kernels for of propositionalization, the features are learned dynamically structured data, cf. (Gaertner 2003) for an overview, in par- and not pre-computed in advance. Thus a dynamic propo- ticular for sequences, trees, graphs, and even individuals de- sitionalization technique results, which is similar in spirit to scribed in high-order logic (Gaertner, Lloyd, & Flach 2004).
the nFOIL system (Landwehr, Kersting, & De Raedt 2005),a method that combines FOIL with na¨ıve Bayes and proved Copyright c 2006, American Association for Artificial Intelli- to yield significant improvements over traditional ILP meth- gence (www.aaai.org). All rights reserved.
ods such as Aleph (an ILP system developed by Ashwin Srinivasan 1) on a number of benchmark problems.
form K(e1, e2, H, B). As the background theory B is fixed The above sketched idea has been incorporated in the throughout the whole learning process, we will from now kFOIL algorithm and has been elaborated for classification on omit this argument from the notation. The function K plays a role similar to that of the distances between first- ated experimentally on a number of well-known benchmark order logic objects used in relational learning (Ramon & Bruynooghe 1998; Kirsten, Wrobel, & Horv´ath 2001). Asupport vector machine will then be used in combination with the kernel K to define the f (e, H, B) function.
We start from an inductive logic programming perspective and then extend it towards the use of kernels.
K(e1, e2, H), it is convenient to first propositionalize the ex- amples e1 and e2 using H and B and then to employ existing Traditional ILP approaches tackle the following problem: kernels on the resulting problem. The natural way of doingthis, is to map each example e onto a vector ϕH (e) over {0, 1}n with n = |H|, having ϕH(e)i = 1 if B ∪ {ci} |= e • a background theory B, in the form of a set of definite for the i-th clause ci ∈ H, and 0 otherwise.
clauses, i.e., clauses of the form h ← b1, · · · , bk where h Example 1 Consider the following background theory B, which describes the structure of molecules: • a set of examples E in the form of ground facts of an un- known target function y; y maps examples to {+1, −1} atm(m1, a1 1, c, 22, −0.11) bond(m1, a1 1, a1 2, 7) (denoting {true, f alse}) in a classification setting, or al- ternatively to R, the reals, in a regression setting; atm(m1, a1 26, o, 40, −0.38) bond(m1, a1 18, a1 26, 2) • a language of clauses L, which specifies the clauses that atm(m2, a2 1, c, 22, −0.11) bond(m2, a2 1, a2 2, 7) • a f (e, H, B) function, which returns the value of the hy- pothesis H on the example e w.r.t. the background theoryB; atm(m2, a2 26, o, 40, −0.38) bond(m2, a2 18, a2 26, 7) • a score(E, H, B) function, which specifies the quality of the hypothesis H w.r.t. the data E and the background pos(X) ← atm(X, A, c, 22, C), atm(X, B, E, 22, 0.02) In a classification setting, the goal typically is to find a complete and consistent concept-description, i.e., a set of pos(X) ← atm(X, A, c, 27, C), bond(X, A, B, 2) clauses that cover all positive and no negative examples.
H as a logical theory covers both examples. Clauses c This can be formalized within our framework by making the succeed on the first example and clauses c following choices for f (e, H , B ) and score: ond. Consequently, in the feature space spanned by the truth • f (e, H, B) = +1 if B ∪ H |= e (i.e., e is entailed by values of the clauses, the examples are represented as score(E, H, B) = training set accuracy.
In a regression setting, the goal is typically to find a hypothesis H that minimizes a measure such as theroot mean squared error between the target y(e) and the Let us now look at the effect of defining kernels on the propositionalized representation. A simple linear kernel KL Let us now show how kFOIL can be formulated within the above sketched definition of inductive logic programming.
The notions of examples, language, hypotheses and back-ground theory remain essentially the same. However, it is The resulting kernel KL can be interpreted as the number of extended by a notion of similarity between pairs of exam- clauses in H that succeed on both examples.
ples e1,e2 that is defined—as for other kernel methods— Let us formalize the linear kernel introduced in the above by a kernel function. ¿From an ILP point of view, this should take into account the hypothesis H and the back-ground theory B. Thus kFOIL requires a kernel K of the where #entH (f ) = |{c ∈ H|B ∧ {c} |= f }| denotes the http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/ number of clauses in H that together with B logically entail f . Intuitively, this implies that two examples are similar if they share many structural features. Which structural fea- tures to look at when computing similarities is encoded in This formalism can be generalized to standard polynomial (KP ) and Gaussian (KG) kernels. Using a polynomial ker- nel, the interpretation in terms of logical entailment is P (e1, e2, H ) = (#entH (e1 ∧ e2) + 1)p, let c be the c ∈ ρ(c) with the best score which amounts to considering conjunctions of up to p clauses which logically entail the two examples, as can eas- ily be shown by explicitly computing the feature space in- duced by the kernel. Using a Gaussian kernel turns out to The generic FOIL algorithm is sketched in Algorithm 1.
where the argument of entH can be interpreted as a kind of It repeatedly searches for clauses that score well with respect symmetric difference between the two examples.
to the data set and the current hypothesis and adds them to the current hypothesis. The examples covered by a learned over examples in a propositional representation, we only clause are removed from the training data (in the update need to employ them within traditional support vector ma- function). In the inner loop, it greedily searches for a clause chine methods to obtain effective classification and regres- that scores well. To this aim, it employs a general-to-specific hill-climbing search strategy. Let p(X1, ., Xn) denote the For instance, using the standard support vector method for predicate that is being learned (e.g, pos(X) for a simple clas- classification, the f (e, H , B ) function is expressed as sification problem). Then the most general clause, whichsucceeds on all examples, is ”p(X1, ., Xn) ←”. The set of all refinements of a clause c within the language bias is produced by a refinement operator ρ(c). For our purposes, a refinement operator just specializes a clause h ← b1, · · · , bk where {e1, ., em} are the training examples and y(ei) = by adding a new literal bk+1, though other refinements have 1 if ei is a positive example and y(ei) = −1 otherwise.
also been used in the literature. This type of algorithm has Similarly, using support vector regression one obtains been successfully applied to a wide variety of problems in ILP. Many different scoring functions and stopping criteria The search in kFOIL follows the generic search strategy outlined in Algorithm 1. However, there are three key dif- obtained from the theory H using standard support vector ferences, which will now be outlined. First, when scoring a refined clause, a support vector machine based on the cur- By now, we have formally specified the learning setting rent kernel including the clause has to be built and its perfor- addressed by kFOIL. It is the instantiation of the standard mance must be evaluated on the training data. This can be ILP problem sketched earlier with the f (e, H, B) function achieved by introducing a loss function V (y(e), f (e)) that just defined. As scoring functions, kFOIL employs train- measures the cost of predicting f (e) when the target is y(e).
ing set accuracy for classification and Pearson correlation Thus score(E, H ∪ {c }, B) is computed in a ”wrapper” or root mean squared error for regression. The key point is that kFOIL—as standard inductive logic programming (α1, ., αm, b) := train svm(E, H ∪ {c }, B) techniques—must find the right hypothesis H that maxi- mizes its score. Note that this approach differs significantly from the static propositionalization approaches, where H is actually pre-computed and fixed. As kFOIL learns the hypothesis H, this implies that the kernel itself is being Here train svm(E, H, B) trains a support vector machine using the kernel defined by H, while f (e, H, B) computesthe prediction according to Equation 1 or Equation 2 for the classification or regression case respectively.
To learn H, kFOIL employs an adaptation of the well-known Second, kFOIL cannot use a separate-and-conquer ap- FOIL algorithm (Quinlan 1990), which essentially imple- proach. Because the final model in FOIL is the logical dis- ments a separate-and-conquer rule learning algorithm in a junction of the learned clauses, (positive) examples that are already covered by a learned clause can be removed from the training data (in the update(E, H) function in Algorithm 1).
(686 examples), low toxicity (886 examples), high acetyl In kFOIL, this notion of coverage is lost, and the training set cholinesterase inhibition (1326 examples), and good rever- is not changed between iterations. Therefore, update(E, H) sal of memory deficiency (642 examples).
returns E. Finally, FOIL stops when it fails to find a clause The NCTRER dataset has been extracted from the EPA’s that covers additional positive examples. As an equally sim- DSSTox NCTRER Database (Fang et al. 2001). It con- ple stopping criterion, learning in kFOIL is stopped when tains structural information about a diverse set of 232 nat- the improvement in score between two successive iterations ural, synthetic and environmental estrogens and classifica- tions with regard to their binding activity for the estrogen The repeated support vector optimizations performed dur- receptor. Again, we used atom and bond information only.
ing the search are computationally expensive. However, the In the Biodegradability domain (Blockeel et al. 2004) the costs can be reduced with simple tabling techniques, and task is to predict the biodegradability of 328 chemical com- by exploiting the fact that the relational example space is pounds based on their molecular structure and global molec- mapped to a much simpler propositional space by ϕh. There, ular measurements. This is originally a regression task, but different relational examples are represented by the same can also be transformed into a classification task by putting vector, and can be merged to one example with a higher weight. In our experimental study, this typically reduced On Mutagenesis, Alzheimer, and NCTRER, kFOIL was the time needed to learn a model by one to two orders of compared to nFOIL, the state-of-the-art ILP system Aleph and a static propositionalization approach. We used a variant In a preliminary evaluation, we compared alternative of the relational frequent query miner WARMR (Dehaspe, scores to guide FOIL search, including kernel target align- Toivonen, & King 1998) for static propositionalization as ment (Lanckriet et al. 2004) and various loss functions V WARMR patterns have shown to be effective propositional- in the wrapper-style score algorithm above (hinge loss, 0- ization techniques on similar benchmarks in inductive logic 1 loss, margin-based conditional likelihood). Kernel target programming (Ashwin Srinivasan 1999). The variant used alignment does not require SVM training but the speedup is was c-ARMR (De Raedt & Ramon 2004), which allows to marginal due to the inherent cost of FOIL and the optimiza- remove redundancies amongst the found patterns by focus- tions outlined above. In addition, local optima problems oc- ing on so-called free patterns. c-ARMR was used to gener- curred in conjunction with greedy search. 0-1 loss for clas- ate all free frequent patterns in the data sets where the fre- sification and quadratic loss for regression yielded the most quency threshold was set to 20%. We used at most 5000 of stable search results and were employed in the experiments the generated patterns as features to generate (binary) propo- reported below. These criteria are known to be associated sitional representations of the datasets. On the proposition- with the risk of overfitting in the case of propositional fea- alized datasets, a cross-validation of a support vector ma- ture selection (Kohavi & John 1997). However, the use of chine was then performed2. To evaluate the regression per- independent data—e.g. by using a leave-one-out estimated formance of kFOIL, we reproduced the experimental setting loss as suggested in (Reunanen 2003)—would increase com- used in (Blockeel et al. 2004) and compared to the results plexity significantly and the more efficient approach of esti- obtained in that study for Tilde and S-CART.
mating leave-one-out bounds resulted in unstable search.
As the goal of the experimental study was to verify that the presented approach is competitive to other state-of-the- art techniques, and not to boost performance, we did not try to specifically optimize any parameter. For nFOIL, we propositionalization approach developed in kFOIL: used the default settings: maximum number of clauses ina hypothesis was set to 25, maximum number of literals in (Q1) Is kFOIL competitive with state-of-the-art inductive a clause to 10 and the threshold for the stopping criterion logic programming systems for classification? to 0.1%. For kFOIL, we used exactly the same parameters.
(Q2) Is kFOIL competitive with state-of-the-art inductive For both algorithms, a beam search with beam size 5 instead logic programming systems for regression? of simple greedy search was performed, as in (Landwehr,Kersting, & De Raedt 2005). Furthermore, a polynomial (Q3) Is kFOIL competitive with other dynamic proposition- kernel of degree 2 was used, the regularization constant C alization approaches, in particular to nFOIL? was set to 1 for classification and 0.01 for regression, and (Q4) Is kFOIL competitive with static propositionalization tube parameter was set to 0.001. All SVM parameters were set identical for all datasets, and kept fixed during thesearch for clauses.
We conducted experiments on nine benchmark datasets Table 1 shows cross-validated predictive accuracy results 1996) the problem is to predict the mutagenicity of a on Mutagenesis, Alzheimer, and NCTRER. Both kFOIL set of compounds We used atom and bond information and nFOIL on average yield higher predictive accuracies For Alzheimer (King, Srinivasan, & Sternberg 1995), the aim is to compare four desirable properties of 2Note that this methodology puts this approach at a slight ad- drugs against Alzheimer’s disease: inhibit amine reuptake vantage and might yield over-optimistic results.
Table 1: Average predictive accuracy results on Mutagenesis, Alzheimer and NCTRER for kFOIL, nFOIL, Aleph and staticpropositionalization. On Mutagenesis r.u. a leave-one-out cross-validation was used (which, combined with the small size ofthe dataset, explains the high variance of the results), on all other datasets a 10 fold cross-validation. • indicates that the resultfor kFOIL is significantly better than for other method (paired two-sided t-test, p = 0.05).
Regression: root mean squared errorBioDeg Global + R Table 2: Result on the Biodegradability dataset. The results for Tilde and S-CART have been taken from (Blockeel et al. 2004).
5 runs of 10 fold cross-validation have been performed, on the same splits into training and test set as used in (Blockeel et al.
2004). For classification, average accuracy is reported, for regression, Pearson correlation and RMSE. • indicates that the resultfor kFOIL is significantly better than for other method (unpaired two-sided t-test, p = 0.05).
than the ILP system Aleph and static propositionalization.
kFOIL significantly outperforms nFOIL on two datasets, ← atm(B, o), bd atm(B, C, c, −), bd atm(C, D, c, =), and a Wilcoxon Matched Pairs Test applied to the results bd atm(C, E, c, −), bd atm(E, F, c, =), of kFOIL and nFOIL on the different datasets shows that bd atm(G, D, c, −), bd atm(F, H, I, −).
kFOIL reaches significantly higher predictive accuracy onaverage (p=0.05). These results affirmatively answer ques- It encodes an aromatic ring with a phenol group (a so-called Table 2 shows results for the Biodegradability dataset. For regression, we ran kFOIL with scoring based on correlationand root mean squared error, and measured the result usingthe corresponding evaluation criterion. The results obtained show that kFOIL is competitive with the first-order decisiontree systems S-CART and Tilde for classification. For re- gression, it is competitive at maximizing correlation, andslightly superior at minimizing RMSE. Thus, question Q4 In the study presented in (Fang et al. 2001), the presence can be answered affirmatively as well.
of a phenolic ring is identified by human experts as one ofthe main factors that determine estrogen-binding activity of kFOIL returned between 2.8 and 22.9 clauses averaged over the folds of the cross-validation, depending on thedataset. Interestingly, the number of clauses in H was al-ways lower than for nFOIL. On the datasets we examined, building a kFOIL model takes up to 10 minutes for classi- We have presented the kFOIL system, which introduces a fication, and up to 30 minutes for regression. This is of the simple integration of inductive logic programming meth- same order of magnitude as the runtime for the other systems ods with support vector learning. kFOIL can be consid- ered a propositionalization approach. Two types of propo- Finally, we give an example of a learned clause which sitionalization approaches have been discussed: static ones, is meaningful to human domain experts: on the NCTRER in which a typically large set of features is pre-computed, and dynamic propositionalization, in which features are in- Dehaspe, L.; Toivonen, H.; and King, R. 1998. Finding crementally and greedily generated. As the generation of Frequent Substructures in Chemical Compounds. In Proc.
clauses is driven by the performance of the support vec- tor machine, kFOIL performs dynamic propositionaliza- Fang, H.; Tong, W.; Shi, L.; Blair, R.; Perkins, R.; Bran- Hence, kFOIL is related to Support Vector Induc- ham, W.; Hass, B.; Xie, Q.; Dial, S.; Moland, C.; and Shee- tive Logic Programming,which combines static proposition- han, D. 2001. Structure-Activity Relationships for a Large alization with support vector learning, and systems like Diverse Set of Natural, Synthetic, and Environmental Es- SAYU (Davis et al. 2005), nFOIL, and Structural Logis- trogens. Chemical Research in Toxicology 14(3):280–294.
tic Regression (Popescul et al. 2003), which all combine Gaertner, T.; Lloyd, J.; and Flach, P.
dynamic propositionalization with probabilistic models. In contrast, kFOIL employs kernel based learning, which al- lows to tackle classification and regression problems in auniform framework. Also, kFOIL improved upon nFOIL Gaertner, T. 2003. A Survey of Kernels for Structured in terms of predictive accuracy in our experimental study.
Data. SIGKDD Explorations 5(1):49–58.
From a kernel machine perspective, kFOIL can also be King, R.; Srinivasan, A.; and Sternberg, M. 1995. Relat- seen as constructing the kernel based on the available data ing Chemical Activity to Structure: an Examination of ILP and therefore it has interesting connections to methods that Successes. New Generation Computing 13(2,4):411–433.
attempt to learn the kernel from data.
Kirsten, M.; Wrobel, S.; and Horv´ath, T. 2001. Distance (Lanckriet et al. 2004) works in the transductive setting based approaches to relational learning and clustering. In (input portion of the test data available when training) and Relational Data Mining, 213–230. Springer.
uses a semidefinite programming algorithm for computing Kohavi, R., and John, G. 1997. Wrappers for feature subset the optimal kernel matrix. Algorithms for learning the ker- selection. Art. Int. 97(1–2):273–324.
nel function include the idea of using a hyperkernel (that Kramer, S. 1996. Structural Regression Trees. In Proc. of spans a Hilbert space of kernel functions) (Ong, Smola, & Williamson 2002) and the use of regularization function-als (Micchelli & Pontil 2005). These approaches are typ- Lanckriet, G. R. G.; Cristianini, N.; Bartlett, P.; Ghaoui, ically more principled than kFOIL (as they learn the ker- L. E.; and Jordan, M. I. 2004. Learning the Kernel Ma- nel by solving well-posed optimization problems). However trix with Semidefinite Programming. J. Mach. Learn. Res.
the formulation by which the kernel is obtained as a convex combination of other kernel functions would be difficult or Landwehr, N.; Kersting, K.; and De Raedt, L.
impossible to apply in the context of dynamic feature con- nFOIL: Integrating Na¨ıve Bayes and FOIL. In Proc. of struction in a fully-fledged relational setting. Furthermore, to the best of the authors’ knowledge, no other method pro- Micchelli, C. A., and Pontil, M. 2005. Learning the Kernel posed so far can learn kernels defined by small sets of inter- Function via Regularization. J. Mach. Learn. Res. 6:1099– Acknowledgements The authors would like to thank Kris- Muggleton, S.; Amini, A.; and Sternberg, M. 2005. Sup- tian Kersting and the anonymous reviewers for valuable port Vector Inductive Logic Programming.
comments. The research was supported by the European Union IST programme, contract no. FP6-508861, Applica- Ong, C. S.; Smola, A. J.; and Williamson, R. C. 2002.
tion of Probabilistic Inductive Logic Programming II.
Hyperkernels. In NIPS 15.
Passerini, A.; Frasconi, P.; and De Raedt, L. 2006. Kernels on prolog proof trees: Statistical learning in the ILP setting.
Ashwin Srinivasan, Ross D. King, D. B. 1999. An As- sessment of ILP-Assisted Models for Toxicology and the Popescul, A.; Ungar, L.; Lawrence, S.; and Pennock, D.
PTE-3 Experiment. In Proc. of ILP’99.
2003. Statistical Relational Learning for Document Min- Blockeel, H.; Dzeroski, S.; Kompare, B.; Kramer, S.; ing. In Proc. of ICDM’03, 275–282.
Pfahringer, B.; and Laer, W. 2004. Experiments in Pre- Quinlan, J. 1990. Learning Logical Definitions from Rela- dicting Biodegradability. Appl. Art. Int. 18(2):157–181.
tions. Machine Learning 5:239–266.
Bratko, I., and Muggleton, S. 1995. Applications of Induc- Ramon, J., and Bruynooghe, M. 1998. A Framework for tive Logic Programming. Comm. of the ACM 38(11):65– Defining Distances Between First-Order Logic Objects. In Davis, J.; Burnside, E.; de Castro Dutra, I.; Page, D.; and Reunanen, J. 2003. Overfitting in making comparisons Costa, V. S. 2005. An Integrated Approach to Learning between variable selection methods. J. Mach. Learn. Res.
Bayesian Networks of Rules. In Proc. of ECML’05, 84– Srinivasan, A.; Muggleton, S.; King, R.; and Sternberg, M.
De Raedt, L., and Ramon, J. 2004. Condensed Representa- 1996. Theories for Mutagenicity: a Study of First-Order tions for Inductive Logic Programming. In Proc. of KR’04.
and Feature-Based Induction. Art. Int. 85:277–299.

Source: http://membres-liglab.imag.fr/bisson/cours/M2INFO-AIW-ML/papers/Landwehr06.pdf

Microsoft word - bone strengthening drugs actually cause fractures.doc

Michael J. Hughes, D.C. (517) 784-9101 Bone Strengthening Drugs Actually Cause Fractures Posted By Dr. Mercola | March 15 2011 Orthopedic surgeons and bone specialists have been seeing an increasing number of unusual fractures among long-term users of bisphosphonate bone-strengthening drugs such as Fosamax, Actonel, Boniva and Reclast. The latest and largest study suggests th

mutual-seniorcare.com

Member Drug Formulary Alphabetical Listing 2008 The Member Drug Formulary is an alphabetical list of approved medicines covered by your benefit plan. In the Member Drug Formulary, generic drugs are listed by their generic name and begin with lower case letters. You will pay the lowest copay when you buy generic drugs. Formulary brand drugs are listed alphabetically by brand name. The nam