roguelike - ARNS - Workshop40

	Sign In Sign-Up
Random Name Generation Using Regular Expressions - Brian Robinson.
This article was originally written for Darren Hebden's RLNews
Rationale

	One of the things that makes roguelike games interesting is the
use of unique items and beings.  Having Ringil by your side in Angband is
much more fun than having a sword or even a sword +1.  And its much more
satisfying to get rid of Wormtongue than another "seedy looking human."
In the case of Angband, many of the unique names were derived from the
works of J. R. Tolkien and are used to evoke the same kind of atmosphere
used in his books.  Tolkien provides a rich background for a game, and
plenty of names.  However, what do you do if you want to create your own
game world from scratch?  You could write a few novels and use the
characters and items from them, or instead you could try to use a random
name generator.
	A random name generator is a program or part of a program that
can produce names randomly.  The names can be used for anything, from
naming individual beings like monsters and shopkeepers, to naming
artifacts, to naming towns and regions.  However, some tweaking may be
neccasary for different purposes.  For instance, Brian might be a good
name for an orc, and a decent name for a city, but the amulet of Brian
just doesn't sound that great.  One characteristic that a good name
generator should have is flexibility.  It should be able to produce names
of different forms with ease.

Regular Expressions

	One way of accomplishing this is to use regular expressions.  If
you've taken a discrete structures class or are a *nix guru, you probably
already know what these are.  For those that do not, I will provide a
brief explanation.  A regular expression is a string, or a sequential
series of letters and symbols, that describes the form a word may take,
or in technical terms, a language.  One special letter used in
regular expressions is called the null symbol, or a blank.  It is usually
denoted by a greek lambda, but I will use a 0 instead.  Each letter and
each grouping of letters within a regular expression is also a regular
expression.
	Some regular expressions have only one interpretation; for
instance, the regular expression "bab" can only represent "bab".
However, we can use a number of operators to give regular expressions
more options.  The first operator we will look at is the "or" operator,
commonly written as |.  Or basically means that of the two expressions on
either side of it, only one will be selected.  So, if my regular
expression is "b|a", there are two valid words in this language, "b", and
"a".  I can also use parantheses just as in math to group letters, so
"(b|a)b" has two valid words, "bb" and "ab".  
	There are two other operators in classic regular lanugages.  One
is the "and" operator, but it is implicit in the form I am using here.
Basically, whenever an expression is placed next to another expression,
they are anded, which means that they occur together in the order they
were first written.  The other is *, but rather than using * (which can
generate infinitely long words), we will look at the use of numbers as
operators.  Consider the regular expression "a3".  This is analogous to a
to the third power in standard math, and the result of this regular
expression is "aaa".  
	One useful technique we can use with regular expressions is
substitution.  Substitution basically maps a single letter to a regular
expression.  So, if I make a substitution of the form "D=(a|b)" and then
write "Db" it would be the same as if I had written "(a|b)b".  This is
particularly applicable to random name generation.  Consider the
substitution "V=(a|e|i|o|u)".  Now whenever we use a "V" in our regular
expression what it really means is to insert a vowel.  Similarly, we can
define the consonants as "C=(b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z)".

Actually Making Some Names!

	Now we can do something interesting.  Lets define a new regular
expression, "CVC".  This is a very simple thing to type, and yet we can
now generate words such as "bob", "sig", "mop", "non", etc.  There are
very many combinations here!  Lets try another: "C(V2)C".  Here we use the
numerical operator.  Substituting the V,  we expand this as
"C((a|e|i|o|u)2)C", so we must select our vowel first, then place it 2
times, rather than placing 2 different vowels.  This regular expression
will create names like "baab", "look", "hiig", etc.  Lets look at one more
before moving on: "(C|0)VCV".  This regular expression can create names
like "moli" and well as "ago" because of the "C|0" part.  Using the null
along with an or operator can be a powerful tool and give you many
options.
	Now that you see how to create some basic names, lets take it a
little further.  You may have noticed that some of the output names I
listed above seem unrealistic; they don't sound like real names.  How can
we improve this?  Well, substitution is again our friend.  I'm going to
define two new substitutions, P for prefix and S for suffix.  The idea
comes from looking at names.  Take a few, like "Brian", "Scott", "Steve",
"Frank", "Casandra", and "Kristen".  These names all have letter
combinations in the beginning or end that we are used to seeing, like
"br" in "Brian" or "nk" in "Frank".  What we do is take these prefixes
and suffixes and map them into a substitution.  So we get 
"P=(br|sc|st|fr|kr)" and "S=(tt|nk|dra|sten)".
	Now lets play with our new toys.  Remember how the names from
"CVC" were not very impressive?  Well, lets see what we come up with when
we use "(C|P)V(C|S)": "lott", "brik", "stink", "wodra", "juk", "kosten".
With the exception of "stink", these are some pretty good names, and the
regular expression for generating them was pretty simple.  To eliminate
names like "stink", we can use a filter which eliminate a set of words we
don't want to appear.  I cover the implementation of such a filter below.

Implementation

	In my implementation, as I mentioned, I treat the concatenation
operator as explicit.  This simplifies things in the code, but makes it
just a little more complex for the person creating the regular
expressions.  Basically, you need to use some more parentheses than you
might expect.  Consider the example above of "C(V2)C".  You might expect
to be able to get the same effect without the parentheses.  But due to the
string implementation, without the parentheses odd things can happen, such
as both the consonant and the vowel being duplicated.
	Now, with that caveat, lets proceed.  I have a Java class that
takes a regular expression as an argument to its constructor.  Every time
you want a name in that format, you can call the getName() method, and a
random name of that type will be generated.  Most of the work in the class
takes place in a private method called parse() which is called by
getName().  I will analyse this line by line below so you can understand
what is happening.

--
      public String parse(String s)
      {
         int index;
         char current;
         StringBuffer b;
         Vector leaves;
      
         leaves = new Vector();
         b = new StringBuffer();
         index = 0;
--

	This code sets up the variables used throughout the method and
initializes them.  Index keeps track of where we are in the regular
expression string s.  Current is a temp variable to let us examine each
character seperately.  B is a buffer used for temporary string storage.
Leaves is a Vector that will contain all the possible outcomes of the or's
in this expression.  I call it leave because it is like leaves of a tree.
The expression "a|b|c" could be represented as:

                                  *
                                 /|\
                                / | \
                               a  b  c

--
         while (index < s.length())
         {
            current = s.charAt(index);
            if (Character.isLetter(current))
            {
               b.append(current);
               index++;
            }
--
	This sets up a while loop that will run until index is greater
than or equal to the length of the string.  Current is assigned the
character at index and we start figuring what we want to do.  First off,
if current is a letter, we add it to the buffer and increment the index,
then restart the loop.

--
            else if (current == '|')
            {
               leaves.add(b);
               b = new StringBuffer();
               index++;
            }

--
	If current is an or operator, we add the string in our buffer to
the leaves Vector and create a new buffer.  We increment the index and
loop.

--
            else if (current == '(')
            {
               int count = index + 1;
               int openParens = 1;
               while (openParens > 0)
               {
                  if (s.charAt(count) == '(')
                  {
                     openParens++;
                  }
                  else if (s.charAt(count) == ')')
                  {
                     openParens--;
                  }
                  count++;
               }
               count--;
               b.append(parse(s.substring(index + 1, count)));
               index = count + 1;
            }
--
	Slightly more complex than the previous parts, when we see an open
parentheses, we need to find it's matching closing parentheses.  We can't
just use the first one we find because we want to support nested
parentheses.  Once we know where the parentheses open and close, we take
whats in the parentheses and parse it just as we do the whole string.  The
result of that parsing is added to our current buffer.  We then increment
the index past the close parentheses and loop.

--
            else if (Character.isDigit(current))
            {
               int count = index + 1;
            
               if (current != '0')
               {
                  while (   (count < s.length()) 
                         && (Character.isDigit(s.charAt(count))))
                  {
                     count++;
                  }
--
	If we have a number, we want to determine what the number is.
Although I don't know how often people are going to need to use numbers
larger than 9, the support is there for anything an int can handle.  Also,
since we are using '0' as our null character, we have to take special note
of it here.  The while loop above simply counts the number of digits we
have in a row.

--
                  Integer exponent;
                  try
                  {
                     exponent = Integer.decode(s.substring(index,count));
                  }
                     catch (NumberFormatException ex)
                     {
                        exponent = new Integer(0);
                     }
--
	This is Java's way of converting a String to an int.  Its a little
more complicated than atoi(), but not so bad.

--
                  StringBuffer temp = new StringBuffer();
                  for(int i = 0; i < exponent.intValue(); i++)
                  {
                     temp.append(b);
                  }
                  b = temp;
--
	Here we simply copy our whole buffer over as many time as
specified by the exponent.  Not that this will affect everything since
either the beginning of the regular expression or since the last or
operator.  This is somewhat of a kludge, but it works if you use
parentheses to segregate exponents.

--
               }
               index = count;
            }
         }
--
	Finally, we need to set the index equal to count.  If we
encountered a '0', count will be (index + 1), so the '0' is skipped.  If
it was any other number, count will be the place in the expression where
the digits stopped.  The final end brace above closes our while loop.

--
         if (leaves.size() == 0)
         {
            return b.toString();
         }
--
	If leaves is empty, that means there was no or operator in the
expression.  We simply return the buffer because there is nothing to
choose between.

--
         else
         {
            leaves.add(b);
            int whichString = generator.nextInt(leaves.size());
            return (leaves.get(whichString)).toString();
         }
      }
--
	If leaves is not empty, we need to choose one of the possible
outcomes of the or operator.  Each possibility should be weighted equally,
but the random number generator in Java is not perfect for small numbers.
However, you will always get correct, if not uniformly distributed names
from it.  Here we add the final string from the buffer, choose a random
index of the Vector leaves, and return that index as the name we've
generated.
	There's a little more to the class, but not much.  Just a
constructor and the getName() method which simply calls parse().  You may
want to add the ability to do substitutions before returning the final
name, or you may chose to do them after the return with a special
substitution class.  That should be simple enough to implement that I need
not discuss it here.

Filtering

	Some people may be content to have names like "stink" appear
in their game, but others may not.  Specifically, parents may not
like for their children to be exposed to profanity within a game,
even if it is not used as such.  Fortunately, its relatively simple to
provide a filter in a name generator to prevent these sorts of
names from being created.  Here's some Java pseudocode of how to use the
filter:

do
{
  String name = nameGenerator.getName();
} until (!filter.isBad(name));

	Now, writing the filter is a little more difficult than
using it.  In Java, its still pretty simple.  I'll post the entire
source here minus the actual words being filtered:

   import java.util.*;
   class Filter extends Hashtable
   {
      public Filter()
      {
         String[] badWords = 
         {/* fill with bad words in lowercase */};
      
         for(int i = 0; i < badWords.length; i++)
         {
            put(badWords[i], new Object());
         }
      }
      public boolean isBad(String s)
      {
         return (get(s.toLowerCase()) != null);
      }
   }

	This class extends the built in Java Hashtable class.  It
creates an array of Strings which are all profane.  It then puts
them into the hash table.  When you want to check to see if a word
is one you wish to filter, it simply hashes it and sees if the
word exists in the table or not.  You could also employ any
storage/search combination you prefer, but this is so easy to
implement in Java it seemed it was the way to go.

Conclusion

	Hopefully you know a bit more about generating random
names after reading this article.  I have provided source for
everything I have discussed below.  If you have any questions or
comments, feel free to email me at bcr19374@pegasus.cc.ucf.edu.
Download Source.