From the sector “Natural Computing” (simulation of natural Phenomena, hardware from nature, nature borrowed methods, etc.), the area “Biological inspired Computing” is selected and described.
A systematic literature analysis of this field of research over the past 30 years shows that after a boom in neural networks in the 1990s, in the last five years genetic algorithms, including particularly the methods of genetic programming, came to the foreground.
In this heuristic procedure computer programs are optimized in an iterative loop. In the startup phase, programs will be randomly generated. In a frequently recurring cycle, the steps program execution, evaluation of results (determination of fitness); selection and diversification (especially crossover and mutation) are used to “grow” better programs from generation to generation.
This work shows criteria to decide in favor of whether or not to use genetic programming. Proven and experimental methods are presented for all phases of the optimization process, and one will find a short survey on how far these methods correlate to their natural role model. This thesis also refers to common problems such as Bloat.
A library of methods collected by the author forms a mixture of a cookbook and a toolbox to be used in Genetic Programming.
Finally, this thesis provides some examples where with the help of genetic programming award-winning practical applications have been created, which in many cases have outperformed conventionally obtained results.
Table of Contents
1 Natural Computing
1.1 Natural Hardware
1.1.1 Quantum Computing
1.1.2 DNA Computing
1.1.3 Optical Computing
1.2 Simulation ofNature
1.2.1 Fractal Geometry
1.2.2 ChaosTheory
1.2.3 Artificial Life
1.2.4 Other methods
1.3 Natural Methods = Bio-Inspired Computing
2 Bio-Inspired Computing
2.1 Evolutionary Computation
2.2 Neural Networks
2.2.1 The Blue Brain Project
2.3 Artificial immune systems
2.4 Swarm Intelligence
2.5 Other Types of Bio-inspired Computing
3 Systematic Literature Analysis
3.1 Automatic Time-Series-Search with a Perl-Script
3.2 Treatment of raw data
3.3 Conclusion
4 GeneticAlgorithms
4.1 Evolutionary Strategies (ES)
4.1.1 The Process of Optimization
4.2 Genetic Algorithms (GA)
4.2.1 GA compared to Real Life
4.3 Integration of ES in GA
5 Genetic Programming
5.1 When is it suggested to use GP?
5.2 Basic Algorithm of GP
5.3 Initial Population
5.3.1 Terminal and Function Set
5.3.2 SomeConstrains
5.3.3 Growing Trees
5.3.4 Seeding
5.4 Execute Programs and Ascertain their Fitness
5.4.1 Measuring the Quality
5.4.2 Multi-Objective Problems
5.4.3 Achieving Fitness
5.5 Creation of next Generation
5.5.1 Reproduction
5.5.2 Mutation
5.5.3 Crossover
5.5.4 Architecture Altering
5.5.5 Automatically Defined Functions (ADF)
5.6 End of Execution
5.7 HallofFame
6 Summary and Conclusion
Bibliography
List of Figures
List ofTables
List of Abbreviations
Appendix A: Source Code of Search-Script
Appendix B: Diagrams of Search-Results
1 Natural Computing
Natural computing, also known as Natural Computation, covers everything that has to do with nature in means of computation. There are three branches into which natural computing can be divided: Natural Hardware, Simulation of Nature and Use of Natural Methods (i.e. Bioinspired Computing).
1.1 Natural Hardware
1.1.1 Quantum Computing
Very small systems, typically in the size of atoms or even subatomic particles, often show behavior that is very contrary to common sense. For example, one single particle can exist in two (or even more) states that exclude each other at the same time. So an electron is able to rotate clockwise and anticlockwise at the same time. Physicians call this the principle of superposition. Since the chirality of rotation can be interpreted as a bit being 0 when rotating clockwise and 1 when rotating in the other direction, this physical superposition of electron- spins can be interpreted as superposition of the bit-values 0 and 1. The phenomenon of superposition and other quantum effects are described in a very understandable way by S. A. Camejo (Camejo, 2006).
By combination of eight independent electrons (eight quantum bits) forming a quantum byte, this quantum byte is able to express all 256 possible values at the very same time. On the same way one can combine even much bigger quantum calculation objects, representing millions of billions of numbers, all at the same time. When this quantum numbers are used in a calculation, the physical probability function collapses, and only one of these numbers “survives”, which is one of the possible results of calculation.
With this technique some problems like the factorization of big numbers, e.g. used in cryptography, which would even on the fastest available conservative machines consume a period of time that is many million times longer than the time the whole universe exists. Quantum Computers can do all the necessary calculations in one single step that takes just a fraction of a second.
Although building hardware for this kind of computers progresses very slowly, there are lots of ready-to-run programs for quantum computers, being able to solve some very hard problems in a very short time, if only the hardware was available.
But not every hard problem is able to be solved by Quantum Computers, as is discussed in (Aaronson, 2008).
1.1.2 DNA Computing
The genetic code of every known life-form on earth is encoded in a type of molecule called DNA (Deoxyribonucleic Acid). This molecule is built from small groups of atoms, called “base pairs”, of which one pair can store two bit of information (each pair is built from one of four possible pairs). In an extrapolation you will find, that one gram of DNA is able to store about 500 Exabyte of information, which is the same amount of information stored in 500 million standard One Terabyte Hard discs (weighing 200 million kg).
But DNA also can be used as medium of calculation in a way quite similar to Quantum Computing. In Quantum Computing many different numbers are encoded in a small group of quantum objects, all of them being processed at the very same time. In DNA Computing many short strings of DNA, all being different, are dissolved in water and some enzymes manifold matching strings, while other enzymes cut those strings in pieces that do not match the result of calculation. This is done simultaneously to all DNA-molecules at the same time.
A typical example for a problem being able to be solved by DNA computers in a short time is the traveling salesman problem, where the shortest closed path connecting a given number of nodes (towns) is wanted: Each town is represented by a special sequence of base-pairs. These sequences can be synthesized, and combinations of these sequences in random order, give many different DNA-strings, each of them representing a possible route. When the number of synthesized strings is big enough, there is a high probability, that each possible route is represented by at least one DNA-string. Leonard Aldeman (Aldeman, 1994) showed, that in a chemical reaction, during some days, all strings representing non-optimal routes, will be eliminated, leaving only the representation of the shortest solution. Unfortunately Aldeman did not find a way to read the shortest route from this “solution”, but in this proof of concept he showed that DNA-computing is possible.
1.1.3 Optical Computing
Light as medium to transport information from one computer to another is state of the art since many years. But light as medium of computation is still in its infancy. Engineers do hard work on trying to build optical devices similar to electronic transistors, and they succeed slowly.
But the real advantage of optical computers is not to imitate electronic Computers. Similar to quantum computers and DNA computers optical computers have the power of massive parallel computation. Other than electric currents in electronic devices, beams of light in optical components will not interfere with other beams. It is this property, that makes it so hard to build optical transistors; but on the other hand, this is a feature, that makes it possible to handle millions of light-beams, encoding millions of different signals, at the same time in the same piece ofhardware.
So once, when optical computers come to age, they will be able to solve the same hard problems that nowadays are meant for quantum computers or DNA computers.
1.2 Simulation ofNature
The second branch of Natural Computing is the Simulation of Nature. This includes lots of different techniques, whose aim is to produce an output that might have been produced by nature on its own too. The methods used to produce this output in most cases have nothing to do with methods used by nature. It is only the result that counts:
1.2.1 Fractal Geometry
Mathematicians called this objects “monsters” when they have been discovered at the end of 19th century. But after their rediscovery in the 1970s, their similarity to natural objects like coast-lines, clouds or snow-flakes was evident.
The Mandelbrot Set (Fig 1), published by Benoit Mandelbrot 1975 (in French) and 1977 (in English) (Mandelbrot, 1977), became a symbol for the engagement in this topic.
illustration not visible in this excerpt
Fig 1: The Mandelbrot Set.
On its own, but in most implementations in combination with probabilistic methods, fractal objects are able to simulate mountains, plants, fire and many other natural objects. These simulated objects are used in arts and film industry, but for example multidimensional “mountains” also can be used as excellent search-spaces to test manifold optimization algorithms, as described here in this thesis.
1.2.2 Chaos Theory
Long-time unpredictable behavior of a system as result of recurrent execution of simple deterministic steps is named Chaos. Since in chaotic systems there is at no point any probabilistic element, this systems are deterministic in that meaning, that a repetition of the hole process with the same start-parameters always will lead to the same results. But this system is still unpredictable, since there is no shorter way to gain the result of this process, than to execute this process.
A typical and very well-known system that shows chaotic behavior is the weather. So the chaos theory can help to understand that the weather develops from second to second in small steps, based on simple deterministic rules, while showing the evident unpredictable behavior on time-scales larger than some days.
1.2.3 Artificial Life
The idea of creation of artificial life is as old as mankind. The Greek god and blacksmith Hephaistos (in roman mythology known as Vulcan) is said to have created two mechanic handmaidens. Jewish folklore knows the story of rabbi Maharal, who has build a living creature, called Golem, in the late 16th century, to defend the Jewish ghetto in Prague against anti-Semitic attacks.
In modern times computer-science and the knowledge of robotics is used to manufacture creatures that try to simulate living beings. The robot-dog “Aibo” from the Japanese company Sony is a well-known example, but tamagotchis or the computer program “creatures” also try to simulate different aspects oflife.
1.2.4 Other methods
Lindenmayer Systems (L-Systems)
These systems, named after the Hungarian theoretical biologist Astrid Lindenmayer (19251989) use formal grammar that describe rules to transform strings of symbols that can be interpreted as detailed description of the shape of trees. Due to that rules, L-Systems have the property of self-similarity, which leads to fractal objects. L-Systems are used to create fractal plants and other types of fractal objects.
Cellular Automata
In a grid, in most cases spread over a two-dimensional surface, each cell can have one of a limited number of states. At each click of a global clock each cell changes its state, depending on its own state and the state of neighboring cells. The most famous cellular automat is named “game of live”: There are two possible states for each cell (“living” and “dead”); the neighborhood is all eight cells that have common borders or edges with the cell. Some simple transition-rules let the system behave like a colony of bacteria that moves over the grid, changing its shape in an amazing complexity.
Langton’s Ant
Take the same grid as used for Cellular Automata, but replace the global clock by a local “ant”. This Ant has its own set of states and it sits on one of the cells. It reads the state of this cell, changes the cells state, changes its own state, and then moves to a neighboring cell, depending on its own state and the state of the cell. If the grid is a one-dimensional belt, than Langton’s Ant is nothing else then a Turing machine. On a two-dimensional grid, with the right set of rules, Langton’s Ant can create patterns that look quite similar to that produced by Cellular Automata, and with a very small set of rules one can create very complex and unpredictable behavior, that is typical for chaotic systems.
1.3 Natural Methods = Bio-Inspired Computing
The third branch of Natural Computing is a summary of all computational methods that are inspired by methods of living nature. So unlike Simulation of Nature, here counts the way how a goal is met, not the kind of the result. This branch will be discussed in detail in the next chapter (chapter 2).
2 Bio-Inspired Computing
In this thesis Bio-inspired Computing (BIC) is presented as part of Computer Science, but it is also a part of another fast expanding scientific field named “Bionics”. According to W. Nachtigall Bionics means Analysis of Nature, Abstraction of the discovered fundamental principles and assignment of these fundamental principles in technical appropriate manner:
Bionik bedeutet demnach:
1. Naturstudium
2. Abstraktion der dabei entdeckten Grundprinzipien
3. Übertragung dieser Grundprinzipien in technisch angemessener Weise
(Nachtigall, 2008, S. 55)
If in a special case the technical appropriate manner is computation, then we are in the field of bio-inspired computing.
It is important to bring out, that bionics, thus also bio-inspired computing, is not blind copying from nature, but learning from Nature. Nachtigall emphasizes several times in his book, that blind copying from nature in most cases will not work. The principles and methods found in nature must be understood, and then from the gist of this principle a new, somewhat different principle or method must be created to be implemented in technical projects.
Bio-inspired Computing in classical meaning is also loosely (but not mandatorily) related to social and emergent behavior of interconnected networks of simple units. Colonies of ants, bees, and other social insects are often named as biological archetypes for networks of this type.
Until now there have been adopted lots of natural methods for the needs of computation, some of the most important methods are described here:
2.1 Evolutionary Computation
Evolution, being the mother of all natural methods, is such a fundamental and important method, that the complete chapter 4 (Genetic Algorithms) (and more or less the rest of this thesis) is dedicated to it.
2.2 Neural Networks
In the research of neural networks, methods of nature have been studied, adopted and implemented to computational devices, as described a few lines above, so this field of research fits perfectly into the pattern of bio-inspired computing. But it is also simulation of nature, because artificial neural networks try to simulate biological neural networks.
As Manfred Spitzer described in (Spitzer, 2000, S. 12-13), the switching frequency of natural neurons is at maximum 1 kHz. More realistic is 400 Hz. Spitzer wrote: “People need for a large number of higher benefits, such as reading a word or recognizing a face only a fraction of a second. Because within this time neurons only can process about a hundred steps, the programs that are running on the hardware the brain, must be structured very simply.”
Menschen benötigen für eine große Zahl höherer Leistungen wie beispielsweise das Lesen eines Wortes oder das Erkennen eines Gesichts nur Bruchteile einer Sekunde. Da Neuronen innerhalb dieser Zeit nur etwa einhundert Verarbeitungsschritte ausführen können, müssen die Programme, die aufder Hardware Gehirn laufen, sehr einfach strukturiert sein.
(Spitzer, 2000, S. 13)
This Problem is known as the “100 steps problem”. It shows that the architecture of neural networks must be very different from that of “normal” computers. And in fact neural networks are highly parallelized machines.
Artificial neural networks try to imitate biologic neural networks, and since there are different types of neural networks in a human brain, there are also different types in artificial neural networks. The nodes in a neural network are neurons. These are simple devices that accept electric pulses as Input and generate electric pulses as output.
Types of neural networks
Churchland (Churchland, 1995) distinguishes between Feedforward Networks and Recurrent Networks:
illustration not visible in this excerpt
Fig 2: Feedforward Network with two intermediate layers.
Neurons in the input layer of a Feedforward Network (Fig 2) receive signals from input devices like cameras, eyes, acceleration sensors, ears, microphones or any other device that is able to measure environmental data and transform it into electric signals. These neurons send signals to other neurons (in the same layer or another layer).
Neurons in the output layer send their signals to actors. These are devices being able to transform electric signals into any kind of action in the physical world. Actors can be motors, muscles, speakers, lamps or many other devices.
Between input and output layer can by one or more intermediate layers of neurons. E.g. in the neocortex ofhuman brains there are six layers.
So a Feedforward Network is a type of neural network, where information is carried only in one direction.
Recurrent Networks
In the other type of networks the output is connected to the input again, so information moves in loops through the network (Fig 3). In each loop the inputpattern is modified, which leads to two important results:
The patterns moving through this type of network build a stabile cycle. Heartbeat and Breath are stimulated by this cycling patterns as well as any other periodic activity like walking, running or chewing (Churchland, 1995, S. 119)
illustration not visible in this excerpt
Fig.3 Recurrent Networks
illustration not visible in this excerpt
Fig 4: Recurrent Networks help to recognize the person in this picture.
The other characteristic of recurrent networks is the perception of patterns and causal correlations: The input-pattern passes through the layers of a recurrent network and changes a small amount towards a known pattern. This new pattern is merged with the original input and processed again and again until the result is a well-known pattern that can be transferred to the output. This is the way a brain recognizes a person if it only sees it from behind, or the face of the person hidden inFig 4.
Spitzer (Spitzer, 2000) described two more types ofNetworks:
Hopfield Networks
Netzwerke, in denen jeweils ein Neuron mit allen anderen Neuronen verbunden ist, d.h. autoassoziative Netzwerke, wurden erstmals von dem Physiker John Hopfield beschrieben und werden deshalb auch Hopfield-Netzwerke genannt. Sie bestehen aus einer einzigen Schicht, in derjedes Neuron mit jedem anderen verbunden ist.
(Spitzer, 2000, S. 184)
(Networks, where each neuron is connected to all other neurons, i.e. auto- associative networks, have been described for the first time by John Hopfield, that’s why they are called Hopfield-Networks. They consist of a single layer where every neuron is connected to each other.)
illustration not visible in this excerpt
Fig 5: HopfieldNetwork.
This type of network, shown in Fig 5, is able to store complex information and is able to reconstruct noisy signals, similar to recurrent networks.
Elman Networks
None of the types described before is able to represent a chronological sequence of events. Elman Networks can solve this problem. They are built similar to a feedforward-network, but the intermediate layer is connected to an extra “context layer” (Fig 6). Information flows from the input layer to intermediate layer and from there to the output layer, but also from intermediate layer to the context layer. From there the transformed information simply goes back to the intermediate layer. So the Intermediate layer is receiving “fresh” Information from the input and “old” information from the context layer. This gives this type of network the possibility to percept time.
illustration not visible in this excerpt
Fig.6 Elman Networks
2.2.1 The Blue Brain Project
The most ambitious project in research of artificial neural networks seems to be the “Blue Brain Project” by Henry Markram:
The Blue Brain Project aims to build biologically accurate software models of the rat, mouse, cat,primate, and eventually the Human brain by 2015. Thefirst step focuses on reconstructing a template biologically accurate replica of an elementary network of 10,000 neurons - the rat Neocortical Column. The project addresses the computational challenges to databasing, reconstructing, simulating, analyzing, and visualizing the brain from the molecular to the whole brain levels. The models provide the framework for refinement over the years as new data, methods and computers becomes available. This bottom up reconstruction also involves a bottom up calibration ofthe model to systematically match emergentfunctionalproperties of the actual circuit to the in-silicon circuit through iterative simulations and experiments.
(Markram, 2006)
Markram started this project in June 2005. Only two years later, in November 2007, he could announce the completion of phase I: A “neocortical column”, consisting of about 10,000 neurons, was completely simulated on his machine. Neurons in the neocortex of mammals are subdivided horizontally in six layers, according to the model of feedforward-networks as described above. But there is also a vertical clustering: Groups of about 10,000 neurons, spread over the six layers, build a column with a high density of connections among neurons in this group and only loose connections to neighboring columns.
Markram managed to copy one of these columns in detail. There are hundreds of different types of neurons in such a column, and every single neuron is simulated by a single chip. He uses 10,000 dual-core CPUs to simulate the flow of chemical agents like sodium- and calcium-ions or neurotransmitters through thousands of ionic channels per neuron.
2007 a single cortical columns was simulated. At the moment he is working on a simulation of a complete rat’s brain. Next steps are brains of cats and monkeys, and in 2015 he wants so have built a complete artificial human brain.
2.3 Artificial immune systems
Artificial immune systems can be defined as computational systems inspired by theoretical immunology and observed immune functions, principles and models, which are applied to problem solving. By utilizing the basic concepts of the immune system it is possible to construct artificial immune systems that can be applied to a myriad of computational scenarios.
(de Castro & Timmis, 2002, S. 104)
Artificial immune systems (AIS) can help to protect a computer system against malicious intruders as well as against any other kind of unwanted states of such a system. But as in real life systems, in the first step the immune system has to learn to differentiate between friend and foe. So it is necessary, that a system is infected before the immune system can learn how to fight against that infection. So from the beginning of the infection until the end of the healing process, the system is in the state of illness. Artificial immune systems are not prophylactic systems, but self-learning repair-systems. But as in real life: Once an AIS has had contact to pathogenic “germs”, it knows how to protect the system from it.
2.4 Swarm Intelligence
Swarm intelligence (SI) is the emergent collective behavior of interconnected networks of simple decentralized social units, building a self-organized system. The natural role models are ant colonies, bacterial growth, fish schooling, bird flocking, animal herding, and other forms of intelligent behavior emerging from groups of simple units.
The “agents”, as those units are called in terms of SI, act on simple local rules, and they can communicate with neighbored agents of the same group.
First of all, it is clear that social insect and, more generally, natural systems can bring much insight into the design of algorithms and artificial problem-solving systems. In particular, artificial swarm-intelligent systems are expected to exhibit the features that may have made social insects so successful in the biosphere: flexibility, robustness, decentralized control, and self-organization. [...] The swarm-based approach, therefore, looks promising, in face of a world that continually becomes more complex, dynamic, and overloaded with information than ever.
(Bonabeau, Theraulaz, &Dorigo, 1999)
Particle Swarm Optimization is also often seen as some kind of Swarm Intelligence: A swarm of particles moves through a multi-dimensional space, where each particle arbitrarilyjumps to other positions in that space or follows the gradient of a scalar field, following very simple rules. As a result of these simple local movements of many thousand very simple particles, the swarm is able to find a global optimum in that field. This is similar to the behavior of a swarm of flies that finds the position of a corpse in the wild. But it must be clear, that this is exactly the same strategy that is followed in evolutionary methods, just seen in another context.
2.5 OtherTypes of Bio-inspired Computing
Emergent Systems
This is a subtype of Swarm Intelligence. All systems that show complex behavior occurring from many simple components, similar to ants, termites, bees or wasps are called emergent systems. Webber et al gave a good definition:
Emergent systems are characterised by having a behaviour that cannot be predicted from a centralised and complete description of the component units of the system. In emergent systems, the overall behaviour is the result of a great number of interactions of agents obeying very simple laws. The overall behaviour cannot be anticipated by simple reduction to individual behaviours following a logico-deductive model, which are conditioned by the immediate surroundings, like other agents and objects in the environment.
(Webber, Pesty, &Balacheff 2002, S. 100)
Epidemic protocols
All network and group communication protocols that mimic epidemiology and the spread of disease to spread news are subsumed as Epidemic Protocols.
Membrane computers
Calculation models that are influenced from the idea of intra-membrane molecular processes in the living cell: A cell receives some input, and transforms it according to these cells rules. Then the result diffuses through its membrane, into the next cell.
Artificial Intelligence
This method is often seen a part of bio-inspired computing, but it does not fit very well into the schema of bionics, as which BIC bust be seen. As shown in (Nachtigall, 2008), BIC is learning from nature, which is a bottom-up-approach. But in Artificial Intelligence (AI) a top- down-approach is used:
Q. What is artificial intelligence?
A. It is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.
(McCarthy, 2003, S. 2)
3 Systematic Literature Analysis
Chapter 2 gives a qualitative classification of Bio-Inspired Computing, but it is hard to find which of its fields of research are in the scope of the scientific community and which are not. So a quantitative analysis of scientific literature seems to be the best approach:
3.1 Automatic Time-Series-Search with a Perl-Script
In internet there are some search engines that are more or less specialized to search for scientific books, papers and articles. ACM, Google Scholar and Springerlink are three of them. They have web pages for advanced search, where it is possible to enter some search terms and where the search can be limited with some additional parameters, among them the year of publication.
The Perl-script “suche.pl” that was written by the author (source code in Appendix A) uses these web pages. It sends 32 different hardcoded search terms to these web pages, for a period of 31 years (1978 to 2008), giving 992 combinations of search term and year. Those combinations are performed on each of those three search engines, in each case in three different modes (search in title only, in abstract only, complete document). Since Google does not support the mode “search in abstracts only”, there are not nine but only eight different search types:
- http://portal.acm.org/
- Search in title only
- Search in abstract
- Search in the complete document
- http://scholar.google.com/
- Search in title only
- Search in the complete document
- http://www.springerlink.com/
- Search in title only
- Search in abstract only
- Search in the complete document
The 32 search terms can be separated in two groups: eight reference words and 24 real search terms:
Reference words:
illustration not visible in this excerpt
Real search terms:
illustration not visible in this excerpt
All together the number of search requests performed by the script is 31*32*(3*3-1) = 7936. Each request returns a website, either containing a message declaiming that nothing has been found, or containing the number of documents found, together with a list which short descriptions of and links to 10 or 20 of those documents. In some cases the website also contains error messages. Most annoying is Google’s message “your query looks similar to automated requests” that appears after typically 200 requests sent to Google. This was handled by modifying the script, that it skips sending requests to Google after this message appeared. After some hours the script had to be restarted to perform the next 200 Google searches (without ACM and Springerlink). So the collection of all data wanted from Google needed more than a week.
Each html-document returned from one of the search engines was parsed by that Perl-script. Only the number of findings was distilled from that answer-document. This number (or 0 if the text “was not found” was found in the document), together with the search-parameters was printed to stdout, which was redirected to a file. To be able to monitor the program, this information is also printed to stderr. From time to time also the amount of elapsed time is printed to stderr.
3.2 Treatment of raw data
The raw data file was imported into a spreadsheet program. The 7936 lines of data were sorted and separated into eight sheets by the combination of search engine (ACM, Google, Springerlink) and search type (all, Title, Abstract) (Google-Abstract does not exist, giving only eight not nine sheets). In each sheet the data was arranged in a matrix with the two coordinates “time” (year of publication) and “search term”. A macro, performed on each of the eight sheets processed the next steps:
Reference Words
Each entry in the raw data represents the absolute number of documents found for a search term in a year. But it is a fact that the total number of scientific publications is rising from year to year. So when the raw data gives the information, that 1990 there were 2000 documents for a certain term, and 2005 there were 3000 documents for the same term, this does not mean that this term was more in the scope of interest of scientific community in the year 2005 than 1990. So these numbers have to be seen in relation to the total number of scientific publications of its year. But it is impossible to find those total numbers.
The solution was to select some reference words. It must be words, whose relative ratio in scientific documents according to computer science does not change too much over the years. The words “computer”, “memory” and “processor” are words that can be found in many computer science documents, and they are assumed by the author to match this criterion mentioned above. The words “about”, “many”, “such”, “that” and “where” are common English words, being assumed not to change their relative amount too much over the years.
So the idea was, that the number of documents found with those reference words is the total number of documents published in that year, multiplied with a constant factor smaller than 1:
illustration not visible in this excerpt
There is no need to find an accurate value for C or Ntot. But a reference must be found whos C is as constant as posible over the period from 1978to 2008.
Correlation Matrix
Since the total number of publications is not available, the lists of the eight reference words must be compared with each other in a peer-to-peer-comparison to find those, who fit best to the others, assuming that this would fit best to the total number. This is done by calculating the correlation coefficients of each pair of time-lines:
In Fig 7 the values for “computer” are drawn against the values of “memory”. Fig 8 shows the same situation for the pair “where” - “many” (values in this example are taken from “ACM- Abstract”):
illustration not visible in this excerpt
Fig 7: correlation between “computer” and “memory” is 0.9437
illustration not visible in this excerpt
Fig 8: correlation between “where” and “many” is 0.9988
It is easy to see, that the pair “where”-“many” has a much better correlation than “computer”- “memory”. In fact the correlation coefficients are 0.9437 for “computer”-“memory” and 0.9988 for “where”-“many” (highlighted numbers in Table 1).
Calculation of the correlation coefficient for each pair results in this matrix:
illustration not visible in this excerpt
Table 1: Correlation Coefficients.
To find all values in this matrix being close to 1 means, that the selection of those eight words was not too bad. But now the best ones are wanted:
Quality Index
All eight coefficients in each columns are multiplied, giving a value, named the “Quality Index”, that holds information of the ability of that word to be the reference-value. Higher values mean higher quality. For ACM-Abstract, those index-values are shown in Table 2:
illustration not visible in this excerpt
Table 2: Quality Indices.
These values are normalized (sets the midpoint to 0.0 and the standard deviation to 1.0):
illustration not visible in this excerpt
Table 3: Normalized Quality Indices.
[...]
- Quote paper
- Hubert Schölnast (Author), 2009, Genetic Programming in the Context of Natural Computing, Munich, GRIN Verlag, https://www.grin.com/document/148852
-
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X. -
Upload your own papers! Earn money and win an iPhone X.