The most used method of finding logical rules from data, inductive logic programming (ILP), has shown successful, but unfortunately not very scalable with increasing problem size. In this report a model for doing induction of logical rules, using the concepts of the potentially more scalable method of genetic algorithm, is suggested.
Five strategies of reducing the search space in the representation are suggested: pruning by logical entailment, pruning by integrity constraints, pruning by logic factorization, pruning by range restrictedness, and pruning using a heuristic fitness function on the cohesion of literals. The genetic operators suggested are applying these pruning search strategies.
The model has yet to be implemented and tried out in an experimental setting.
I would like to thank my supervisor Øystein Nytrø for his initial suggestion of a hypergraph
model for horn clauses, and for being patient enough to give me free hands in my rather
stochastic search towards what is presented in this thesis. I will also thank Geir Sørensen for help with the standard template library for C++, even if I didn't end up using it.
Chapter 3 Graphs, Hypergraphs and Horn Logic 5
3.1 Horn logic representation 5
3.4 A matrix hypergraph representation of horn logic clauses 7
Chapter 4 Mathematical programming 9
4.1 Linear programming models 9
4.2 Linear Programming Solution Methods 9
4.3 Integer Programming Models 10
4.4 Some classic IP problems and their solutions 10
Chapter 5 Evolutionary Computation 13
5.2.2 Probalistic Incremental Program Evolution 15
5.2.3 Genetic Programming and LISP 15
5.2.4 Extensions of traditional Genetic Programming 15
5.3 Populations and generations in GA and GP 16
5.4 Examples of applications of GA and GP 16
Chapter 6 Inductive Logic Programming 17
6.2 Properties of a logic program 18
6.3.3 Relative Least General Generalization (rlgg) 19
7.7 Pruning by demanding range-restrictedness 30
Figure 3.1 Example of a graph 5
Figure 3.4 Example of a hypergraph 7
Figure 3.5 Hypergraph matrix representation of clause 8
Figure 5.1 Flowchart for the traditional Genetic Algorithm 14
Figure 7.1 GILP unconstrained search space vs. the factorial function 22
Table 6.1 Background Knowledge 17
Table 6.2 Positive and negative training examples 18
Table 7.1 Effect of integrity constraint pruning of gilp(n) 28
Table 9.1 - The GILP calculation programs 35
Table 9.2 - The GILP system 35
Genetic Algorithms (GA) and Inductive Logic Programming (ILP) are both research fields of Artificial Intelligence (AI). Both fields are quite young, and have experienced considerable growth during the last decade. ILP (See Bergadano F., Gunetti D. (1996), Inductive Logic Programming - From,See Lavrac N., Dzeroski S. (1993), Inductive Logic Programming - Techniques and,See Nienhuys-Cheng S.-H., Wolf R. de (1997), Foundations of Inductive Logic) is a very specific field, aimed at inductively constructing logic programs, whereas GA (See Heitkoetter J. (1997), The Hitch-Hiker's Guide to Evolutionary Computation,,See Koza J.R. (1992), Genetic Programming - On the Programming of Computers) is a broader field aimed at
solving different types of problems inspired by evolution of species in nature.
More and more information is being stored in databases around the world. As the amount of information stored electronically rises, so does the wish to get something more out of the data than just being able to retrieve what was entered. From this wish, the fields of Knowledge Discovery in Databases (KDD) and the related field Data Mining emerged (See Fayyad U.M., Piatetsky-Shapiro G., Smyth P., Uthurusamy R. (1996), Advances in).
Central in KDD is the development of efficient computerized methods to analyze databases, and ILP can be used to do that with partial success (See Alfheim E., Tveit A. (1996), Inductive Logic Programming Applied in).
Example of results after data mining (See Palmeri C.(1997), Believe in yourself, believe in the merchandise,):
Wal-Mart knows that customers who buy Barbie dolls (it sells one every 20 seconds)
have a 60% likelihood of buying one of three types of candy bars. What does Wal-Mart
do with information like that? "I don't have a clue," says Wal-Mart's chief of
Fortunately, this is not a typical example of a data mining and its interpretations. More
useful applications of data mining include: finding patterns in molecular structures, global climate change models, and numerous of others from science and business (See Fayyad U.M., Piatetsky-Shapiro G., Smyth P., Uthurusamy R. (1996), Advances in).
One problem with ILP systems is the performance, its search strategies are to not known to efficiently utilize modern parallel computer systems. From this wish, and from the success of genetic algorithms and evolutionary methods (See Haynes T.D., Schoenfeld D.A., Wainwrigth R.L. (1996), Type Inheritance in,See Levine D. (1994), A Parallel Genetic Algorithm for the Set Partitioning Problem,,See Chu P.C., Beasley J.E. (1997), A Genetic Algorithm for the Multidimensional,See Chu P.C., Beasley J.E. (1997), Constraint Handling in Genetic Algorithms: and See Koza J.R. (1996), Future Work and Practical Applications of Genetic), the main
goal for this thesis is to create methods of solving ILP problems in GA setting.
This report assumes that the reader has a basic knowledge of logic programming,
algorithmitic complexity analysis, and artificial intelligence. For logic and logic
programming a book such as See Nilsson U., Matuszynski J. (1990), Logic, Programming and Prolog, Wiley should cover the required material. For algorithmic complexity analysis a book such as See Cormen T.H., Leiserson C.E., Rivest R.L.(1992), Introduction to Algorithms, (and possibly ) is recommended. For the artificial intelligence part, the book by Russel and Norvig (See Russel S., Norvig P.(1995), Artificial Intelligence - A modern approach,) is recommended.
Chapter 3 describes graphs, hypergraphs, and how horn clauses can be modelled by
hypergraphs with the corresponding matrix representation. Chapter 4 describes mathematical programming, and how it has been applied to solve combinatorical integer problems which regarding to constraints and variables are similar to the hypergraph model of horn clauses. Chapter 5 describes evolutionary computation and genetic algorithms and programming.Chapter 6 describes inductive logic programming and some characteristics of logic
programs. Chapter 7 describes a horn clause vector model based on the hypergraph model for horn clauses, and suggests and discusses ways of pruning the vector model's search space. Chapter 8 suggests genetic algorithms based on the results from chapter 7. Chapter 9
gives a brief overview of the implementations done. Chapter 10 presents the conclusion and suggestions to further research.
The motivation for performing the experiments in this document is threefold. First I wanted
to investigate a hypergraph matrix model for representation of horn clauses, and if possible improve it towards being applicable in a genetic algorithm (GA) setting. Second, suggest GA operators, and how to create good initial GA population, both regarding to efficient pruning of the search space. The main purpose for the GA operators is to do a stochastic search in a possible set of horn clause hypotheses and to get the best horn clause for a a set of both positive and negative examples, background knowledge and facts. This is similar to the goal of inductive logic programming . Third, implement the suggested GA methods and test them using standard problems known from inductive logic programming.
This chapter describes graphs, hypergraphs and applications of those to horn logic.
The representation of a relation like (See Sperschneider V., Antoniou G. (1991), Logic A Foundation for Computer Science,)
is called a Horn Clause (after Horn who studied them). Logic programming language implementations are usually based on Horn Clause representation of relations for efficiency reasons. Prolog is the most well known logic programming language based on Horn Logic representation, an example of syntax for a clause is shown below.
A graph is a set of points and a set of arrows, where each arrow is joining a pair of points. The points are called the vertices of the graph, and the arrows are called the arcs or the edges of the graph.
A graph G consisting of a set of vertices V and a set of edges E can be defined formally as shown below.
The adjacency matrix for the example graph in See Example of a graphis shown in See Adjancy matrix. Each row represents the start vertex, and each column the stop vertex for an edge, the edge between the vertices m and n exists if position (m,n) in the matrix is 1.
The collection of adjacency lists for the example graph in See Example of a graphis shown in See Adjancy lists. Here each list belongs to a vertex, and the list is ordered based on the vertices the vertex has an edge to.
Space usage for an adjacency matrix representation is
and the space usage for a
collection of adjacency lists is
. When the graph is dense
a matrix representation is favorable, but in a practical problem the graph is usually sparse
and a collection of adjacency lists is the only applicable way of representing the graph.
Examples of problems which can modeled using graphs: topological sorting in planning problems, minimum span tree in electronic circuit layout, shortest path problems.
A hypergraph is a generalization of a graph, allowing edges to be curves connecting more than two vertices (See Berge C. (1976), Graphs and Hypergraphs, North-Holland). A dual representation of a logic clause (See Nytrø, Ø. (1992), Aspects of structure in Horn Clause programs,) like
is a hypergraph as shown in See Example of a hypergraph. The variables are represented as vertices and the predicates as edges.
A possible hypergraph representation for the clause below could be to have a matrix with each column representing parameter positions from left to right, this example will have 13 such columns. Each row in this representation is 1 if a variable number rownum is selected for a column and 0 otherwise.
The matrix using this representation is shown in See Hypergraph matrix representation of clause. This is a very sparse1 matrix because of the natural constraint that you can not have more than one variable per parameter position in the clause. Another natural constraint is that the sum of 1s for each row is constrained by the rowlength. These constraints are formally (for a zero indexed n*n matrix A with elements aij) shown in See Hypergraph matrix constraints
To make this a complete hypergraph representation information about which vertices (variables; Xi for aij=1, i is the row and j is the column number,
) belongs to which edges is needed (predicates, i.e the predicate brothers/3 map to parameter 2, 3 and 4).
The hypergraph matrix representation of a clause is the basis representation for the
suggested GILP data model shown in See The GILP data model and its constraints.
Mathematical programming is `programming' in the sense of `planning' (See Williams H.P. (1993), Model building in mathematical programming,).
The common feature which mathematical programming models have is that they all involve optimization, either by maximizing or minimizing an objective function within certain
constraints. Mathematical programming models can be classified into linear programming models, non-linear programming models, integer programming models or combinations of those.
Due to the integer nature of the hypergraph matrix representation for horn logic, only linear and integer programming model will be described here. For a thorough description of the other types of mathematical programming see See Ravindran A., Phillips D.T., Solberg J.J. (1987), Operations Research Principles and See Williams H.P. (1993), Model building in mathematical programming,.
1. decision variables involved in the problem are nonnegative (i.e. pos. or zero)
2. The criterion for selecting the "best" values of the decision variables can be
described by a linear function of these variables, that is, mathematically function
involving only the first powers of the variables with no cross products. The criterion
function is normally referred to as the objective function.
3. The operating rules governing the process (e.g., scarcity of resources) can be
expressed as a set of linear equations or linear equalities. This set is referred to as the
An example of a general LP minimization problem is shown below
The most used method to solve LP problems is the simplex method (or variants of it)
developed by G.B. Dantzig. The simplex method is an iterative procedure for solving LP problems in standard form. To define the principles of the simplex methods two definitions are needed to describe the state of the LP system.
1. Basic variables (variables with value greater than 0) with coefficients different
from zero only exists in one of the equality constraints
2. Maximum of one such non-zero basic variable coefficient per equality constraint.
1. The objective is of the maximization or minimization type.
2. All constraints are expressed as equations.
1. Start with an initial basic feasible solution in Canonical form
2. Improve the initial solution if possible by finding another basic feasible solution
with a better objective function value. At this step the simplex method implicitly
eliminates from consideration all those basic feasible solutions whose objective
function values are worse than the presented one.
3. Continue to find a better basic feasible solutions improving the objective function
values. When a particular basic feasible solution cannot be improved any further, it
becomes an optimal solution and the simplex method terminates.
An integer programming model is a specialized LP model.
1. The requirements in See Requirements to be a Linear Programming (abbreviated LP) ([21])Themust be satisfied
2. The possible values of the decision variables must be in the domain of natural
Examples of problems which can modeled as a PIP : the set partitioning problem,
the knapsack problem, the transport problem and the assignment problem.
The transport problem is minimizing the cost of transportation from a set of suppliers to a set of users given constraints like supply limits and that the demand requirements need to be fulfilled.
This problem can be solved by applying several (heuristic) strategies (See Ravindran A., Phillips D.T., Solberg J.J. (1987), Operations Research Principles), i.e. the northwest corner rule, the least cost rule or Vogel's approximation method. These methods have all in common that they represent the cost matrix as a table, where each row corresponds to a supplier and each column to a user (demand). Each cell in the table is the cost of transport from a supplier to a user. The differences between the methods is the strategy of variable selection. These methods does not guarantee an optimal solution, optimality can be checked by applying the stepping-stone method. A method that guarantees an optimal solution is the
u-v method, this method uses optimality checks (similar to the ones in the simplex method) for each variable it takes into the basis until an optimal solution is found. If an optimal
solution doesn't exist, it will degenerate.
The assignment problem is minizing the cost of allocating a number of items of one type for a number of items of another type, i.e. allocating machines to jobs (See Ravindran A., Phillips D.T., Solberg J.J. (1987), Operations Research Principles).
This problem can be solved either by transforming it to a transport problem, or more
efficiently, by applying the hungarian method. This method is based on that there is no change in the optimal solution of a linear program by adding or subtracting a constant to the objective function, this is done by subtracting a sufficiently large number from rows and columns in the cost matrix until a solution is found by inspection.
The max flow problem is to find maximum flow in a graph from a source to a sink.
This problems maximum value has an upper bound of the minimum capacity of a cut, where the sink and the source belongs to different sides of the cut. This problem can be solved by using the Ford-Fulkerson or The Edmonds-Karp algorithm (See Cormen T.H., Leiserson C.E., Rivest R.L.(1992), Introduction to Algorithms, and See Ravindran A., Phillips D.T., Solberg J.J. (1987), Operations Research Principles).
The shortest path problem is to find the shortest path between nodes in a network (the path length can be seen as the cost).
To find the shortest path between two nodes in a network, Dijkstra's algorithm can be applied. To find the path between several nodes in sparse graph, Johnson's algorithm is
preferred (See Cormen T.H., Leiserson C.E., Rivest R.L.(1992), Introduction to Algorithms,).
The branch and bound algorithm is the most commonly used algorithm to solve IP problems in practice (See Ravindran A., Phillips D.T., Solberg J.J. (1987), Operations Research Principles, See Williams H.P. (1993), Model building in mathematical programming,).
It is based on first solving the IP problem with the simplex method as if it was a regular LP problem, then select a variable with a non-integer value and create two subproblems added new constraints for the selected variable, if x1 (=7.6) is the selected variable from an optimal LP basis, the first subproblem will get the constraint x1 <= 7, and the second
subproblem will get the constraint x1 >= 8. Then solve the new subproblems. If the
subproblems don't give an integer solution, branch on a new variable.
The first feasible integer solution is the greatest lower bound for the optimal solution, and nodes with values less than the least upper bound are fathomed, that is, any integer solution
from subnodes of these are going to be less than the greatest lower bound. Further
branching on such nodes is thereby unnecessary since a better solution can't be obtained. If a node gets an unfeasible solution, it is also fathomed, and no further subproblems are investigated. Branch and bound terminates when there are no more open nodes in the
problem tree, then the largest integer solution in the tree is the optimal one. (There is though possible to have a branch and bound tree with only non-integer solutions and unfeasible solutions, which means a feasible integer solution couldn't be found)
Selection of which variables to branch on is usually done by either selecting the variable with the largest fractional value, the variable with the relatively largest fractional value (i.e. if x1=1.3 and x3 = 4.3, x1 will be selected), or selection of the most important variable.
In addition to the LP relaxation of an IP problem solved by branch and bound, other
relaxations like lagrangean, surrogate and composite relaxations which gives tighter
constraints than the LP ones, have been used together in the branch and bound algorithm
(solving of the multidimensional knapsack problem, see See Chu P.C., Beasley J.E. (1997), A Genetic Algorithm for the Multidimensional and See Gavish B., Pirkul H. (1985), Efficient algorithms for solving multiconstraint)
Gomory's method (See Nygreen B. (1980), Heltallsproblemer (Integer Problems), Lecture book in) is similar to branch and bound since it is adding new constraints and using LP relaxation, the difference is that it is also adding new variables, thereby increasing the dimension of the basic matrix and the number of calculations per new
solution. If there is both an upper and a lower limit for the objective function, Gomory's method is proved to have convergence.
Enumerative methods are only looking at integer solutions (represented as paths in trees),
since there is an exponential number of possible integer solutions (i.e. 2^n for n binary
variables), "smart" selection and pruning operators have to be applied. I.e. in implicit
enumeration (See Nygreen B. (1980), Heltallsproblemer (Integer Problems), Lecture book in, See Ravindran A., Phillips D.T., Solberg J.J. (1987), Operations Research Principles), there are two possibility test operators and two optimality test operators used for pruning of the search tree.
Dynamic programming solves problems by combining the solution of subproblems (See Cormen T.H., Leiserson C.E., Rivest R.L.(1992), Introduction to Algorithms,). It is only applicable when subproblems are not independent, that is, they share subsub-
problems. Each time a new subproblem is solved, its solution is stored such that other subproblems sharing the stored subproblem can use the stored value instead of doing a
recalculation, thereby saving work compared to applying the divide-and-conquer principle on the same problem which would have recalculated everything.
The development of a dynamic programming algorithm is done in four steps (See Cormen T.H., Leiserson C.E., Rivest R.L.(1992), Introduction to Algorithms,)
Dynamic programming have been applied to integer problems like the knapsack problem (See Ravindran A., Phillips D.T., Solberg J.J. (1987), Operations Research Principles), the shortest-path problem (Dijkstras algorithm can be formulated as a dynamic
programming algorithm), and in speech recognition in combination with Hidden Markov Models (See Russel S., Norvig P.(1995), Artificial Intelligence - A modern approach,).
Dynamic programming has also been applied in non-linear integer programming (See Ravindran A., Phillips D.T., Solberg J.J. (1987), Operations Research Principles)
Evolutionary Computation (EC) is usually divided into five different methods (See Heitkoetter J. (1997), The Hitch-Hiker's Guide to Evolutionary Computation,); Genetic Algorithms (GA), Evolutionary Programming (EP), Evolution Strategies (ES), Classifier Systems (CFS), Probalistic Incremental Program Evolution (PIPE) and Genetic
Programming (and there also exist some less well known problem solving methods more or less connected to EC).
All the methods have their strategies motivated by Darwin's principles about the means of natural selection, survival of the fittest, and the theories of evolution. The methods don't have strategies for the simulation of mutual influence, like in that of a "society". This
strategy is though supported by the related field artificial life.
Since this thesis is to mostly related to GA and GP, only these method will be described here. (See See Heitkoetter J. (1997), The Hitch-Hiker's Guide to Evolutionary Computation, for a description of the other methods)
Genetic Algorithms is a model of machine learning inspired by some of the mechanisms of evolution in nature. It is usually applied to optimization problems, similar to integer
programming (section See Integer Programming Models). The model consists of a set of character strings, called a
population of individuals represented by chromosomes, and a set of operators for those character strings, called genetic operators. The most common chromosome type is a fixed length string of binary values. An overview of the traditional GA algorithm is shown in
See Flowchart for the traditional Genetic Algorithm(See Koza J.R. (1992), Genetic Programming - On the Programming of Computers). Each time all the chromosome individuals in the population have been used together with an operator, the population is going from one generation to the next.
The fitness measure function gives a value for a chromosome, usually scaled to a value between 0 to 1 (if possible). If solving a mathematical linear programming problem like See Example of a LP minimization problem (assuming integer variable requirements), the fitness function could be equal to the negated value of the objective function, such that it is transformed into a GA maximization problem. However, there is nothing wrong in having the objective function as the GA
fitness function, but that requires the selection criterias to be changed such that the element with the least fitness value is most likely to survive.
Depending on the fitness measure or estimate, the chromosome will survive unchanged from one generation to another, this is controlled by the reproduction operator.
Depending on fitness for three randomly selected chromosomes from the population, the two fittest of them will be combined systematically (crossed) to create a new
chromosome replacing the least fittest chromosome. There also exists other selection
criterias, i.e. one where two fit (relative to the rest of the population) are selected, then
crossovered and the least fit element of the two "parents" are selected for replacement.
This operator is contributing by combining fit elements to possibly get an even fitter one, though depending on how the crossover is done.
The mutation operator does a random change in a chromosome. This is often to "loosen up" the population in cases of premature convergence. The mutation operator is usually the only operator to do such a dramatic change, even if there can exist crossover operators with some
randomness (See Chu P.C., Beasley J.E. (1997), A Genetic Algorithm for the Multidimensional).
There can exist many problem specific operators, i.e. a repair operator (See Chu P.C., Beasley J.E. (1997), A Genetic Algorithm for the Multidimensional), which is used to guarantee feasibility even after random mutation in a chromosome.
The main differences between GA and Genetic Programming (GP) is that GP extends the representation, such that program structures and not only character strings can be
represented (See Koza J.R. (1992), Genetic Programming - On the Programming of Computers). Another difference is that GP usually doesn't allow mutation, but allows growing of the chromosome in size (that is, program size), i.e. a crossover operation in GP will usually cause the chromosome to grow in size. Since programs can be represented directly in the structure there is usually no preprocessing of inputs or postprocessing of outputs required, the inputs, intermediate results and outputs are usually expressed in the terms of the natural terminology of the problem (See Koza J.R. (1992), Genetic Programming - On the Programming of Computers).
Probabilistic Incremental Program Evolution (PIPE) is a novel method similar to GP (See Salustowicz R., Schmidhuber J. (1997), Probalistic Incremental Program,See Salustowicz R., Schmidhuber J. (1997), Probalistic Incremental Program:), except that instead of a crossover operator, it uses a "probalistic prototype tree" to combine experiences of different programs to generate better programs. On some
regression problems it has been shown to give better solutions than GP, but it also has more variance in the fitness within the population.
Genetic Programming applications frequently use LISP (See Haynes T.D., Schoenfeld D.A., Wainwrigth R.L. (1996), Type Inheritance in, See Koza J.R. (1992), Genetic Programming - On the Programming of Computers) as the main chromosome
representation language, one reason for the selection of LISP is that both programs and data have the same form, S-expressions. This makes it possible to do genetic manipulation of the program (as data), and then execute it to check fitness (as a program).
The other main reason for selection of LISP, is that LISP's S-expressions are equivalent to the parse tree for the computer program, this parse tree is needed to make syntactically
correct genetic changes in the program. In other languages this parse tree is usually not accessible, and if it were accessible it would have another syntax and representation than the language itself.
Several suggestions has been made to extend the GP model, this has mainly been motivated by the means of increasing the domain (See Koza J.R. (1996), Future Work and Practical Applications of Genetic) of programs which GP can
represent and handle, but also because of performance issues.
Traditional GP doesn't support data structures known from software engineering, i.e. queues, stacks, lists and more complex data structures. Since a human programmer is used to data structures like those, there is a gap between programs generated by GP and the ones a human programmer is used to handle. By allowing the mentioned data structures the code generated by GP will also be more compact and being able to express more information in a shorter length. William B. Langdon was able to evolve a stack structure (See Langdon W.B. (1995), Evolving Data Structures with Genetic Programming,, See Langdon W.B. (1996), Using Data Structures within Genetic Programming,) using GP where each individual consisted of five trees representing five stack operations: makenull, top, pop, push and empty. The GP primitives used to develop the operations were: the
constants 0 and 1, max(10), the variable aux, the operations Inc_Aux and Dec_Aux to increase and decrease the variable aux.
To decrease the search space and thereby possibly increase performance in GP, Strongly Type Genetic Programming (STGP) can be applied. This is done by creating inheritance hierarchies of types with a maximum tree depth (See Haynes T.D., Schoenfeld D.A., Wainwrigth R.L. (1996), Type Inheritance in), and by using these types, a
A good rule of thumb for an initial population is that it should have a broad selection of individiual chromosomes, preferebly all uniqe, but there is usually no rule on how large a
population should be, but there has been a tendency towards big populations and few
generations. Gathercole and Ross (See Gathercole C., Ross P. (1997), Small populations over Many Generations can beat) has shown that for some problems that there is an advance in having a relatively small population, but instead having more generations and apply Dynamic Subset Selection (DSS) and Limited Error Fitness (LEF) learning methods in addition to GP. One argument in which they explain the results, is that there might be a larger probability for many small increments in fitness per generation in a GP solution with a small population, compared to the probability of large increments per generation in a GP solution with a large population.
GA and GP has been shown to give good (though not guaranteed optimal) solutions on NP-complete problems like the facility layout problems (See Garces-Perez J., Schoenefeld D.A., Wainwright R.L. (1994), Solving Facility), the multidimensional knapsack problem (See Chu P.C., Beasley J.E. (1997), Constraint Handling in Genetic Algorithms:) and the set-partitioning problem (See Levine D. (1994), A Parallel Genetic Algorithm for the Set Partitioning Problem,,See Chu P.C., Beasley J.E. (1997), A Genetic Algorithm for the Multidimensional). These are problems previously solved using mathematical programming concepts like LP or IP (see chapter See Mathematical programming). In
automatic discovering of protein classification rules (the D-E-A-D box family of proteins,
see See Koza J.R. (1996), Future Work and Practical Applications of Genetic), rules discovered by GP is better than any known rules found by other methods or by human experts. GA has also been applied to the autoparallelization of algorithms (See Koza J.R. (1996), Future Work and Practical Applications of Genetic), this is a highly relevant problem, since new processors will probably rely on "smarter"
compilers to create instructions to be run in parallel (i.e. HP and Intel's planned Merced CPU, see Byte magazine december 1997). Genetic Algorithms in the form of evolvable hardware applying devices like field programmable gate arrays (FPGA), have been applied to evolve a frequency discriminator circuit and a robot controller, this field is expected an explosive growth (See Koza J.R. (1996), Future Work and Practical Applications of Genetic)
Inductive logic programming (ILP) can be seen as the intersection between logic programming and inductive machine learning (See Muggleton S., Raedt L.D.(1994), Inductive Logic Programming: Theory).
Machine learning is an area within AI which explores how computers can learn (See Midelfart, H. (1996), Knowledge Discovery in Databases applied in Inductive). One motive for developing ILP theory was the lack of expressiveness in machine learning.
There have been attempts to solve this problem by adding a 1.order logic framework from logic programming with possibilities to use substantial background knowledge and
ILP can be explained more precisely by first describing the concepts of logic programming and induction separately. The main idea of logic programming (LP) is to use a computer to draw conclusions from declarative descriptions. A logic program is a finite set of logic
formulas describing facts and relations. LP's main principle of inference is deduction, which is the derivation from facts and general rules to new facts. An algorithm for
mechanical deduction called resolution is used.
Induction is described as "the act, process, or result or an instance of reasoning from a part to a whole, from particulars to generals, or from the individual to the universal" in
Webster's dictionary. This can be summarized into that induction is the creation of general rules based on facts and examples.
ILP is divided into two major sub fields, interactive ILP and empirical ILP. Interactive ILP is most often used on relatively small data sets and the user has possibilities to participate during learning of clauses. Empirical ILP is automatic extraction of regularities from big and possibly imperfect data sets without user interaction (See Lavrac N., Dzeroski S. (1993), Inductive Logic Programming - Techniques and).
Some areas where ILP is applied: drugs' structure activity rule discoveries, learning rules for predicting protein secondary structure, automated program generation, learning rules for chess playing and in general as a tool for data mining and knowledge discovery in databases (See Lavrac N., Dzeroski S. (1993), Inductive Logic Programming - Techniques and).
Here is an example of the facts, examples and rules which inductive logic programming can start to induce from.
In See Background Knowledgethe known background information is represented, e.g., it is not possible to be a mother without being a parent. In See Positive and negative training examplesrepresents true and false observations. The
reason why ILP algorithms need negative examples is to specialize a generated rule that is inconsistent with negative examples.
The task here is to learn the relation daughter(X,Y) with the background knowledge
predicates female and parent. Using ILP methods the concluded rule, called a hypothesis is shown below
daughter(X,Y)
female(X), parent(X,Y).
This inference is not deductively sound as we have generalized beyond what was known with certainty.
The set of positive examples is in this setting called E+, and correspondingly for negative examples, E-.
A logic program P is complete (with respect to E+) iff, for all examples e
E+, P |- e.
A logic program is consistent (with respect to E-) iff, for no example e
E+, P |- e.
A horn clause C is range-restricted iff every variable in the head of C is found in
- subsumption (See Midelfart, H. (1996), Knowledge Discovery in Databases applied in Inductive)
A clause
- subsumes a clause
denoted by
.
is a
Inverse resolution (See Fayyad U.M., Piatetsky-Shapiro G., Smyth P., Uthurusamy R. (1996), Advances in) is used to produce general clauses from specific examples by
inverting the resolution rule of deductive inference. For generalization inverse substitution is used, which mean replacing constants with variables and at the same time ensuring that the original example(s) can be restored by ordinary substitution.
Least general generalization (lgg) of two clauses is such that by applying two substitutions on the same variable in the lgg, the two clauses can be regenerated, e.g.
A rlgg is the most specific generalization of two clauses relative to the given background information (See Lavrac N., Dzeroski S. (1993), Inductive Logic Programming - Techniques and).
Most ILP applications use top-down searching, starting with the most general hypothesis and it clause until it supports none negative examples.
Progol is currently the best performing ILP system, it uses top-down searching like MIS and FOIL, but searches in a lattice which is bounded both above and below and is therefore
better constrained (See Muggleton S. (1995), Inverse entailment and Progol, Oxford University). To search the lattice which is ordered by
-subsumption, Progol uses an A* - like algorithm with refinement operators (See Russel S., Norvig P.(1995), Artificial Intelligence - A modern approach,).
Golem is the a well performing ILP system based on bottom-up searching by applying
relative least generalization. Other methods for bottom-up searching are inverse resolution and inverse implication (See Bergadano F., Gunetti D. (1996), Inductive Logic Programming - From).
The GILP data model must be able to represent horn clauses in a clear and
non-ambiguous way eligible for genetic algorithm operations and representation.
In this chapter vector and clause are both used to describe a generated hypothesis.
In the hypergraph matrix representation constraints formalized in See Hypergraph matrix constraintsthe first
constraint is unnecessary since the second one makes sure there is maximum one
occurrence of the value 1 per column and there are n rows which leads to a maximum row sum of n.
If starting from the left in the matrix and generating elements in columns a new variable (that is a new row without a value 1, getting a value 1). What this row number is for a new variable doesn't have any importance, so to avoid duplicate semantics for different matrices, the least free row number will be selected when entering a new variable. This can be formulated with adding a new vector of dimension 1*n
consisting of elements
which is the minimum available row index at column i. Since it is the minimum available row number at position i, it is greater than the maximum j for akj=1 and k < i. Since the maximum number of new variables going one column right is 1, constraints between different elements in
are added. The new model is shown below.
Since there is only one variable per column in the matrix model it could be transformed into a vector model storing the variable number which is active for each column in a vector. This is done to decrease the storage requirements and the total number of constraints. The vector formulation is shown below
All investigation of the search space is based the vector constraint model presented in See GILP Vector Model Constraints.
The search space for GILP is the number of allowed vectors for a given vector length.
It is from this point called gilp(n) where n is the vector length.
The first and the second constraint in See GILP Vector Model Constraints ensures that the growth in number of
variables from position k to k+1 is less than or equal to 1 and m0 = 1
This has very high2 running time if solved in a straight order fashion. Since there are many common recursive steps there are overlaps of calculations of the various
and a program based on the dynamic programming principle was developed (see appendix A-See gilpcalc.cc)
A sub-searchspace of gilp(n) is in this context the number of vectors with a specific number of vector elements different from zero.
Since comb(n,k) = n!/(k!(n-k)!), the left side of the equality symbol is then
simplified to n/(n-k) and lim of that when k is fixed and n goes towards infinity is 1.
Since by increasing the vector length n by 1, you keep the search space for the
previous vector length but also allows more combinations, like adding all the
vectors with non-zero of length k-1 from the vectors generated for length n and
having a non-zero element in the new position allowed, thereby increasing the
overall search space. From this we can say that gilpnz(n+1,k)/gilpnz(n,k) > 1
By inspection of the data calculated for each length, an assumption of that
comb(n+1,k)/comb(n,k) > gilpnz(n+1,k)/gilpnz(n,k) seems to hold for all data and
If the previous assumption holds, applying the squeeze theorem to gilpnz(n+1,k)/
gilpnz(n,k) shows that the limit for any allowed k when n goes towards infinity is 1,
since gilpnz(n,k) is growing slower than comp(n,k) then by lemma 7.3 gilpnz(n,k) is
A variable in a gilp clause is not allowed to unify to the same values as the other
A consequence of definition See Constraints on gilp clause variable unification is that a generated clause
In See gilpnz(n,k) for k of 5, 10, 20, 40 and 100there is an example of the growth of gilpnz(n,k) for k of 5, 10, 20, 40 and 100,
as we can see from the figure, the curves are growing slower than straight lines and since the y-axis scale is logarithmic that supports the previously proved non-exponential growth of gilpnz(n,k) for any fixed k.
gilpcompfit(H,E+) is equal to the number of unique members of E+ matched by
the query Head,Body where Head and Body are the head and body part of the
Horn Clause hypothesis H , divided by the total number of positive examples in E+
gilpcompfit(H,E+) is a number between 0 and 1 (inclusive), if gilpcompfit(H,E+) is equal to 1, H is complete (see definition See Completeness of a logic program ([1])).
gilpconsfit(H,E-) is equal one minus, the number of unique members of E- matched
by the query Head,Body where Head and Body are the head and body part of the
Horn Clause hypothesis H divided by the total number of negative examples in
gilpconsfit(H,E-) is a number between 0 and 1 (inclusive), if gilpconsfit(H,E-) is equal to 1, H is consistent (see definition See Consistency of a logic program ([1])).
Because of the formulation of gilp vectors, having a gilp vector
with at least one vector element equal to zero, if adding an allowed non-zero number to that currently zero position, the new vector will be logically entailed by
. That means that
is more general than the new vector. If
is checked for completeness fitness and found to be unfit there is no need to check the new vector since it is more special and it is upper bound by the
gilpcompfit(
,E) regarding to completeness fitness.
The reason why the completeness fitness function is used is the fact that the search space gilp(n) is "very large" and this is a way to prune clauses in gilpnz(n,k2) based on failed clauses in gilpnz(n,k1) where k1 < k2 <= n and the vector in gilp(n,k2) is entailed by a vector in gilp(n,k1).
If a vector generated get a non-zero completeness fitness the consistency fitness gilpconsfit will be measured. The reason why this isn't done for every generated vector is that the
gilpcompfit() and gilpconsfit() functions are computationally heavy operations and it is
usually not of great interest in a practical problem to measure the comsistency fitness of generated vector which has a zero completeness fitness. (Though, this could be of
1. Check if
(in gilpnz(n,k2)) is entailed by investigating each tree of stored
failed hypotheses for lengths from 0 to k2 - 1, if such an element is found give
the generated vector completeness fitness of 0
2. If no entailment vector is found calculate gilpcompfit(
,E+), if it is 0, store
it in the failed-tree for gilpnz(n,k2) (tree number k2)
3. If gilpcompfit(
,E+) > 0, calculate gilpconsfit(
,E-), if it is equal to 1
(
is complete) remove the positive examples covered by
One problem with this algorithm are the storage requirements to store one tree of failed clauses for each k from 0 to n in gilpnz(n,k), the partial solution to this could be to store trees with a non-zero length of k up to a certain percentage of the vector length n.
I.e. for a template clause length of 40, an upper bound for the number of vectors with a non-zero length of 6, is gilpnz(40,6) = 6.62*10^8, which is only a 3.31*10^(-28) part of gilp(40) = 2.35*10^36.
6.62*10^8 is still a significantly high number nodes of vectors in a binary tree regarding to memory requirements3, but this number will be significantly reduced by other pruning
operations, like integrity constraints (see section See Pruning using Integrity Constraints), logic factorization (see section See Pruning using logic factorization), requirements of connectivity between predicates in a clause (see section See Pruning using heuristics on literal cohesion), and requirements that a generated clause has to be range-restricted (see section See Pruning by demanding range-restrictedness).
Another feature of the entailment algorithm is that the vectors of short non-zero lengths are the most general and when the tree is starting to fill up4 many new generated
vectors will found to be entailed and thereby removed as a possible hypothesis.
Integrity constraints(See Nytrø Ø., Østvold B.M. (1996), Pruning the specialization search space using) are restrictions on a predicate parameter variable binding constrained by previous bindings of that variable. A simple example can be for the predicate father/2, the (natural) integrity constraint for the second parameter is that it can't be bound to the same variable as the first parameter was bound to, i.e. father(X,X) is not allowed (a father can't be the father of himself), but father(X,Y) is valid by integrity constraints. There can also be integrity constraints between predicates in a template clause, in:
there are integrity constraints5 between the first parameter in father/2 and the parameter in female/1 (a father can't be a female), father/2 is the head predicate. The unconstrained search space gilp(8) = 21147, and the integrity constrained search space for this example template = 4496 (the unconstrained search space is reduced with 78.74%).
Since the number of possible variables for a position p in a gilp vector is growing with p,
there seems to be a point in positioning the predicates such that integrity constraints occur early (as much to the left as possible) in the template clause, this could possibly prune a larger part of the search space - gilp(n) than an integrity constraints further to the right in the template clause. The template clause in See father/2, brothers/2, female/1, mother/2, male/1 could be rearranged to
to get the following collision-position pairs: (3,1), (4,3), (5,3), (6,3), (7,1), (7,4), (7,5) and (7,6) instead of the previous collision-position pairs: (5,1), (5,3), (5,4), (6,1), (6,3), (6,4), (8,5) and (8,6). The new constrained search space is still the same as the old one, 4496.
By splitting up the predicates and allowing parameters for different predicates to be moved around independently the clause could be rearranged to
getting the corresponding collision-position pairs: (3,1), (4,1), (5,3), (5,4), (6,3), (6,4), (7,3) and (7,4). There gives no decrease in the constrained search space, it is still 4496. By
allowing splitting of the head predicates the clause can be rearranged to
getting the corresponding collision-position pairs: (2,1), (3,1), (4,2), (4,3), (5,2), (5,3), (6,2)
and (6,3). This last template search space is still 4496. Reconsidering the assumed pruning effect of rearranging predicates or parameter places, it seems plausible that there is no effect in the rearranging of parameter places regarding to search space, this is because moving an integrity constraint to the right will prune smaller subtrees of the search space, but it will also prune an increasing number of subtrees. Rearranging integrity constraints is only a
syntactic change, not a change of the allowed semantics of clauses. There is still a point in having the integrity constraints as far left as possible, in an iterative or recursive program for generation of clauses the running time would be better caused by fewer and larger
In See Effect of integrity constraint pruning of gilp(n)effects of pruning using integrity constraints are shown. The minimum row is the minimum search space obtained for a generated integrity constraint, the maximum row is the maximum search space obtained for a generated integrity constraint with a upper bound of gilp(n). To generate integrity constraints a program accepting 3 parameters: vector length, probability for having zero integrity constraints for a parameter position and a
ratio (between 0 and 1) limiting the maximum number of integrity constraints for a
parameter, a parameter with more than zero integrity constraints are limited to the integer value of the product fraction value times (position number - 1), i.e. with a fraction of 0.5 for position 8 there there can be a maximum of trunc(0.5*(8-1)) = 3 integrity constraints. The integrity constraint generator program can be found in in appendix A-See gilp_generator.perl and the calculation program in appendix A-See gilpcalc_ic.cc.
The reason why there is a decrease in number of trials for increasing n is the exponential growth in time it takes to search with integrity constraints, and dynamic programming can't be applied since there are dependencies both to positions to the left and right for any given
In a gilp template clause, there will typically be repeated use of predicates, and when
generating a clause using that template, there is no gain in information to have repeated
predicates with the same variables setup. I.e. if a part of a clause looks like
the last occurence of father(X,Y) is superfluous. Removing such repeated predicates is a part of doing logic factorization, which is removing repeated and superfluous predicates in a horn clause until no more predicates can be removed without altering the semantics of the clause. In logic factorization of a generated clause like
the last two father/2 predicates will be removed, but this will not be done by the
single-predicate logic factorization suggested in gilp.
The strict constraints for unification of variables in definition See Constraints on gilp clause variable unification will probably also help pruning the search space, since the completeness fitness for the clause above is the upper bound for the completeness fitness of the same clause with unification constraints.
High cohesion of literals in a clause means that there is a high number of repeated
occurrences of variables in different parameter positions in a clause. I.e.
In the extreme case of low cohesion, there are no reoccurrences of variables in a clause, this is equivalent of having independent predicates in a clause, which means that a fitness query for such a clause is the same as making the fitness query for each individual clause
consisting of one predicate in the order given by the template mesh, which is (usually) of no interest. The other extrema is the highest cohesion, where only one variable exists
throughout the parameter places of the clause, then the expressibility of the clause is almost reduced to that of a propositional logic, since the predicates "have taken the role as" atoms.
To avoid getting close to the possible cohesion extremas mentioned for a generated clause, a heuristic approach is used. Since the reoccurrences of variables in parameter places is
connected to the cohesion, the heuristic function here is estimated based on those.
The heuristic function gilpheuristic requires the following elements:
which registers variable occurrence in predicates, initially all the elements are initialized to -1, when a variable k is added to a position mapped to predicate m. if(vc(k) < maxpredicate+1) then begin if(vc(k) = -1) then vc(k) = m else if(v(k) != m) then v(k) = maxpredicate+1 end. From
three questions about a variable k can then be answered, 1. is it use? (v(k) != -1), 2. is it used in exactly one predicate? (-1<v(k) < maxpredicate+1), and 3. is it used in more than 1 predicate (v(k) = maxpredicate + 1).
needs a variable for the number of elements equal to maxpredicate+1.
of variable numbers counting number of occurences per variable, i.e. if variable 4 is created at first, vn(4) = 1, if variable 4 reoccurs, just do increase(vn(4),1).
needs a variable for the number of elements greater than 1.
of occurences of variable occurences, i.e. is if variable 3 and 5 are the only ones with 5 occurences, then ov(5) is equal to 2 (this method is inspired by "the bucket sort" algorithm, which is O(n) given its requirements, see section 9.2 of See Cormen T.H., Leiserson C.E., Rivest R.L.(1992), Introduction to Algorithms,). If a variable reoccurs, i.e. variable number 3 increases from 5 to 6 occurrences, then decrease(ov(5),1) and increase(ov(6),1) to keep
consistent with the current
.
needs a variable for the number of elements greater than a user selected value.The calculation of all the above elements has a time complexity of O(n), where n is the gilp
vector length. Going from one step to another has a time complexity of O(1).
The gilp heuristic fitness function =
w1*(number of elements in
with values greater than 1 divided by the total
number of non-zero elements in
) +
w2*(number of elements in
with the value equal to maxpredicate + 1 divided
by the number of positive elements in
, zero inclusive) +
w3*(number of elements in
with a value greater than a user limit
divided by the total number of non-zero elements in
).
To make the fitness value be between 0 and 1, let w1+w2+w3 = 1
If the creation of a gilp vector is done incrementally the heuristic fitness has a time complexity of O(1) from one step to another, which gives a total time complexity O(n) for
In the previous section the problems with low cohesion in a clause were mentioned, another example of low cohesion could be
in my opinion the above clause is practically worthless since there is no cohesion between the head and body literals of the clause, i.e. if you combine the clause with the facts:
you can't for sure get any information about the grandfather relation for those 3 atoms
(haakon, olav and harald) , except that there exists a positive example of grandfather with some atoms.
This is the motivation to only allow range-restricted clauses (see definition See Range-restricted horn clause ([49])) in gilp. This can be done iteratively such that for each new head variable created, a counter is increased, and each time a head variable is used in the body and at the same time not allready used in more than one predicate, the counter is decreased. This criteria could be checked by the
vector, assuming that there is only one head predicate. If the head variables are not necessarily in order from 0 to maximum headvar, the variable check to see if its a headvar is increased from O(1) to O(lg |HV|) where HV is the set of head variables. Summarized the check for a range-restricted clause is O(n) (and can be done iteratively simultaneously with the factorizing checks) if the head variable numbers occur in an continous range and there is only one head predicate. If the head variables are not in a continous range, the check is increased to O(n lg|HV|). If there are more than one head predicate, the
vector structure would have to be changed to uniquely represent which predicates a variable occur in, a
possible solution could be to have an associative array (implemented using binary trees), giving a time complexity of (at least) O(n (|HP| lg(|TP|) + lg|HV|)), where HP are the
number of head predicates, and TP are total number of predicates in the clause.
The methods for constraining the GILP search space in the methods suggested here, are described in the previous chapter.
An individual in the GILP population is represented by a chromosome vector of length n.
It is |P| predicates in the vector template, with |HP| predicates in the head part and |BP|
1. Randomly select the first position in the vector to be of value 0 or 1
2. If the current position is the last position in the vector, go to 7
3. Calculate the set PV of possible values for next pos in the vector based on:
3a. The integrity constraint check, if it fails go to 6
3b. The simplified logic factorization check, if it fails go to 6
4. If the set of possible values generated is empty go to 6
5a. Go to the next position vp in the vector
5b. Insert a random value from the generated set PV, and insert in position vp
5c. Check if the vector is range-restricted, if it is go to 7, otherwise go to 2
A GILP population is represented by an ordered set P of individuals. The size of the
1. If the population size |P| is equal to k go to 7
2. Create a new individual using the method in definition See GILP's algorithm for initial creation of a chromosome vector.
3a. Do the entailment check, if it fails go to 1
3b. Calculate the completeness fitness for the individual, if "low" or zero go to 5a
3c. Calculate the consistency fitness for the individual, if "low" or zero go to 6
4. Save the individual in the set of good solutions and go to 6
5a. Insert the clause in the entailment storage tree (for the correct non-zero length)
6. Insert the individual into P and go to 1
(NOTE: this is a set insert, so if it already exists in P, P will not grow in size)
7. Return the generated population
The reason why the heuristic function on literal cohesion isn't used in the creation of an
initial population, is that the primary wish for the initial population is to have individuals which are range-restricted (which is a strict constraint) but also entails many elements in gilp(n) by having a short non-zero length (see section ). Another approach for the
creation of an initial population could be to use an recursive programs like A-See gilpcalc.cc or A-See gilpcalc_ic.cc.
By prioritizing completeness fitness over consistency fitness in the the algorithm in
defintion See GILP's algorithm for the creation of a population, GILP is a top-down GA search algorithm.
Based on fitness value there is a probability pr for a survival of an individual to
the next generation. This gives no new contribution to next generations population
but most likely it will contribute to sustain or increase the average fitness of the
Based on the fitness value there is a probability pm for mutation of a vector.
Select a random position in the vector and select a random variable from that
positions set of possible variable numbers (which may have to be recalculated),
select another random position in the clause and repeat. If the vector is mutated