This document discusses the implementation of the cInstruction, cGenome, and cInstLib (instruction library) classes.
This class is used to represent a single instruction within a genome. The private portion of this class consists of a single number that uniquely identifies the type of instruction, and the public section has a number of helper methods that allow us to work with that number.
class cInstruction { private: UCHAR operand; public: // Accessors... int GetOp() const { return (int) operand; } void SetOp(int in_op) { assert(in_op < 256); operand = in_op; } // Operators... void operator=(const cInstruction & inst) { operand = inst.operand; } bool operator==(const cInstruction & inst) const { return (operand == inst.operand); } bool operator!=(const cInstruction & inst) const { return !(operator==(inst)); } // Constructors and Destructor... cInstruction() { operand = 0; } cInstruction(const cInstruction & _inst) { *this = _inst; } explicit cInstruction(int in_op) { SetOp(in_op); } ~cInstruction() { ; } // Some extra methods to convert too and from alpha-numeric symbols... char GetSymbol() const; void SetSymbol(char symbol); };
As I stated above, the only private datum is a numerical value that identifies this instruction. The name "operand" is the term that is used for a command name in an assembly language. Most normal assembly languages have both an operand and arguments associated with each full command. In avida, the commands have no arguments, and hence they just consist of a single operand. The data type used for this is UCHAR, which is short for "unsigned char". A char is an 8 bit number, so when one is unsigned, it represents a number from 0 to 255. As avida is currently implemented, we are limited to 256 distinct instructions. To the outside world, we treat the instruction operand like an integer, so it would be easy to modify this class were we ever to need more than 256 instructions in the set. The only reason to limit it to 8 bits internally (rather than the 32 of an int) is to save memory when we have a very large number of instructions throughout a population. Instructions already make up over half of all the memory resources used by avida.
The public methods begin with the GetOp() and SetOp() methods, which are standard accessors. Next, we have a collection of methods that begin with the word "operator". These are used to define how the corresponding symbols should be treated when applied to objects of this class. For example, the method operator==() is called when we try to compare on object of type cInstruction to another. We have full control over the definition of this method, just like any other.
The next batch of methods we come to are the constructors and destructor. Notice that there are three different constructors, so there are multiple ways an instruction object can be created.
Finally, we have a pair of methods that convert instructions to and from alphanumeric characters (symbols). These methods are used to print instructions out in a maximally compressed format, and to load them back in. The order of symbols used are the letters 'a' through 'z' in lowercase, followed by an uppercase 'A' through 'Z' and finally the numbers '0' through '9'. If there are more than 62 possible instructions, all the rest are assigned a '?' when printed out, and cannot be read back in properly from this format.
A genome is a sequence of instructions. The following class maintains this sequence as an array, and provides a collection of methods to manipulate the total construct.
class cGenome { protected: tArray<cInstruction> genome; int active_size; public: explicit cGenome(int _size); cGenome(const cGenome & in_genome); cGenome(const cString & in_string); virtual ~cGenome(); virtual void operator=(const cGenome & other_genome); virtual bool operator==(const cGenome & other_genome) const; cInstruction & operator[](int index) { assert(index >= 0 && index < active_size); return genome[index]; } const cInstruction & operator[](int index) const { assert(index >= 0 && index < active_size); return genome[index]; } virtual void Copy(int to, int from); int GetSize() const { return active_size; } cString AsString() const; };
The protected variable genome is an array containing an object of type cInstruction at each position. The second variable, active_size denotes the number of instructions in this array that are currently being used. The fact that these variables are "protected" instead of "private" means that any class derived from cGenome will also have access to the variables. In particular, we will discuss the class cCPUMemory in a future lecture, which extends cGenome, adding methods to alter the array length and new variables to keep track of information about each instruction.
Three constructors allow for a new cGenome object to be specified by either a genome length, a previously created genome, or else a string -- a sequence of symbols representing each instruction in order.
The operators created for manipulating genomes include both assignment (setting one genome equal to another) and comparison (testing to see if two genomes are identical.) Additionally, there are two operator[] methods. These means that if you have an object of type cGenome, you can index into it to retrieve a single instruction. Thus, if the object was called "initial_genome", the statement "initial_genome[15]" would return the instruction at position fifteen in the genome. This occurs by calling one of these methods with the appropriate integer. The difference between these two operator methods is that one of them is for mutable genomes (i.e. those that can be modified) -- a reference to the instruction in question is returned allowing it to be altered. The other index operator (operator[] method) is for const genomes, which can never be changed so only the value of the instruction is returned.
The Copy() method is a shortcut to copy memory from one position in the genome to another. This method will later be overloaded (that is, replaced with a newer version) by cCPUMemory such that the proper flags will be copied, with the instruction and others will be set to indicate the copied instruction for future tests. GetSize returns the length of the genome, and AsString returns a string who has symbols in each position that correspond to the the instruction in the same position in the genome.
This class keeps track of all of the possible, legal instructions, and relates each of them to the hardware method that should be triggered when that instruction is executed.
The beginning of the inst_lib.hh file has a couple of commands required for the class definition. They are:
class cHardwareBase; typedef void (cHardwareBase::*tHardwareMethod)();
The first command basically states that there will be a class called cHardwareBase that will be fully defined at some point in the future. We need to refer to the class here, but not directly interact with any of its methods or variables, so it is sufficient to just tell the compiler that it is a class. This form of statement is called a predeclaration.
The second line is a typedef statement, which defines a new data \ type. A typedef looks just like a variable declaration, but the keyword typedef indicates that the "variable" that would have been defined is instead a new "type". For example, the command "typedef unsigned char UCHAR" creates a new type called "UCHAR" that is just an unsigned char. In the case above, we are creating a new type called tHardwareMethod, which is a pointer to a method in the class cHardwareBase. Basically it will point to a location in memory that contains a method in the yet-to-be-defined class. This is a tricky concept -- but it is the easiest way to point to a variable method that we don't know about until run time (when we load the instruction set file in).
The cInstLib class is long, so I'm going to split the code presentation into two halves, with explanation in between. Here is the first part:
class cInstLib { private: // This class gives full info about a single instruction in the library. class cInstEntry { public: cString name; // Name of this instruction. tHardwareMethod function; // Pointer to hardware method. // Some additional details to be used in the future... int redundancy; // Weight in instruction set int cost; // additional time spent to execute inst. int ft_cost; // additional time spent first time exec double prob_fail; // probability of failing to execute inst }; tArray<cInstEntry> inst_array; // The instructions tArray<int> nop_mods; // Modification table for nops // Static components... static const cInstruction inst_error; static const cInstruction inst_none; static const cInstruction inst_default;
The first thing that I do in the private section of the cInstLib definition is define another class called cInstEntry, which contains all of the information about a single entry in an instruction library. Since it is defined internally and privately, cInstEntry can only be used inside of cInstLib; we refer to it as a helper class.
cInstEntry contains only pubiic data about a single instruction. At the moment, the only portion of that data that we are using is the instruction name and the hardware method that it is supposed to call. We also have room to record how redundant this instruction is supposed to be in the full instruction set, how many CPU cycles should be spent as the cost to executing it, an additional first-time cost incurred only the first time the instruction is used by an organism, and a probability of failure, that the instruction will not be executed at all. Any other useful information about an instruction can be put here, but then it should also be made use of elsewhere in the code.
After the definition of cInstEntry, we have the declaration of an array of entries called inst_array. The position of an instruction in this array is the ID number that uniquely identifies that instruction. Thus, whichever instruction entry is placed in the array first, that instruction has ID of 0. The next one has an ID of 1, then 2, and so on.
The inst_array has the instructions ordered in the same order they are presented in the inst_lib file from which they are loaded with one caveat: all no-operation instructions must be listed first. This is because the next variable, nop_mods is an array that indicates which CPU component each nop is associated with. The size of this array determines how many nops there are, and they directly correspond to the same array positions in inst_array.
In the default instruction set, the inst_array has 26 entries contained in it, since this is the size of the instruction set, the first three of which are nop-A, nop-B, and nop-C respectively. The nop_mods array has three integers in it, one for each of the three nops.
The remainder of the private section of the cInstSet definition contains three static and constant variables that represent special instructions. They are inst_error, for an illegal instruction (it will be given the ID number 254), inst_none to represent an empty position (ID number 255), and inst_default for the default instruction to initialize genomes with, unless otherwise specified (usually ID number 0 -- nop-A in the default set -- but this can be changed).
Here is the public portion of the cInstLib definition:
public: cInstLib() { ; } cInstLib(const cInstLib & in_inst_lib); ~cInstLib() { ; } cInstLib & operator=(const cInstLib & _in); // Accessors const cString & GetName(const cInstruction & inst) const; tHardwareMethod GetFunction(const cInstruction & inst) const; int GetNopMod(const cInstruction & inst) const { return nop_mods[inst.GetOp()]; } cInstruction GetInst(const cString & in_name) const; int GetSize() const { return inst_array.GetSize(); } int GetNumNops() const { return nop_mods.GetSize(); } // Instruction Analysis. int IsNop(const cInstruction & inst) const { return (inst.GetOp() < nop_mods.GetSize()); } cString FindBestMatch(const cString & in_name) const; // Insertion of new instructions... int Add(const cString & _name, tHardwareMethod _fun, int redundancy=1, int ft_cost=0, int cost=0, double prob_fail=0.0); int AddNop(const cString & _name, tHardwareMethod _fun, int reg, int redundancy=1, int ft_cost=0, int cost=0, double prob_fail=0.0); // Static methods... static const cInstruction & GetInstDefault() { return inst_default; } static const cInstruction & GetInstError() { return inst_error; } static const cInstruction & GetInstNone() { return inst_none; } };
The constructors, destructor, and assignment operator are all rather standard for this class -- an instruction library can either be created from scratch or else copied over from a pre-existing one.
The accessors in this class allow specific information to be obtained about instructions passed in. The GetName() method returns the human-readable name for a specific instruction, GetFunction() returns a pointer to the method associated with an instruction, and GetNopMod() returns the id of the component modified by a nop-instruction passed in. Conversely, GetInst() will return an instruction object when an instruction name (as a string) is passed in. Additionally, there are the accessors GetSize() to obtain the total number of instructions in the set, and GetNumNops() will return the number of nop instructions in the set.
Two instruction analysis methods also exist; one to determine if an instruction is a nop called IsNop(), and another method called FindBestMatch() takes in an unknown instruction name, and returns the closest name match that it can find in the set. This allows Avida to make suggestions to the user in the case of a mis-spelled instruction name.
An instruction set is initially loaded by successive calls to the Add() and AddNop(), filling in the appropriate arguments. These methods are called from a utility class (cHardwareUtil) that manages the loading of the inst_set file. Next class we will be integrating the classes discussed here with the classes that directly implment the hardware.
Finally, three static methods are included to provide access to the
three pre-defined instructions.