[wplug] Detecting numeric values

Vance Kochenderfer vkochend at nyx.net
Sun Jun 13 00:24:00 EDT 2010


Thanks to everyone who came out to today's general user meeting!
I'm sorry for the last-minute nature of the presentation, which
prevented me from fully exploring the second exercise.  I'll go
over it here, and will post this and the commands from the first
exercise to the wiki <http://www.wplug.org/wiki/Meeting-20100612>.

To review, here was the (revised) problem statement:

# Determine whether a given value is numeric (decimal).
#
# Example numeric values:
#   123       45.6789   -3.4567   -0        000123
#   .01234    54321.    00000.    -0.987    -.987
#   -0123.    012       0.0       .0        -.000
#
# Example non-numeric values:
#   hello     3f        3F        AB        0xAB
#   0.0.      -0-       3.0E8     3.0e-08   .-0123
#   1.23.4    5.678-    --98      -.        a space
#   a tab
#
# As a bonus, make your command also consider a value
# numeric if it starts with a + sign instead of a - sign.

As I explained at the meeting, I wasn't able to find a way of
doing this using bc or shell arithmetic as I thought I might.  So
the fallback was to use grep and build an appropriate regular
expression (regex).

We ran out of time before I could explain the full regex, so here
is the command in all its glory:

  egrep -q '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$'

or, if you want to strictly conform to POSIX
<http://www.opengroup.org/onlinepubs/009695399/nframe.html>,

  grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null

We are using an extended regex, so we have to use egrep or the -E
option to grep.  Since we want to get a true/false value, we use
the -q option or redirection to throw away the output.  This way,
we can just act based on grep's exit status.

Let's pick apart the regex to see what it does.

^
The ^ at the beginning means that our regex will only match
starting at the beginning of the line.  The anticipated use for
this command is something like 'echo "$value" | grep blah...', so
we know there won't be any extraneous stuff at the beginning.  If
you are getting the value from a file or as input from the user,
you may need to strip away whitespace from the beginning, or alter
the regex to account for that.

[-+]
An expression inside brackets matches a single character, as long
as that character is one of those listed inside the brackets
(a range of characters can be specified, as can special pre-
defined character classes, but in this case we're not using either
of those).  So this would match either a - sign or a + sign.  Note
that inside a bracket expression, + has no special meaning.

[-+]?
The question mark means "match zero or one of the preceding
character."  This makes it so that having a sign at the beginning
of our number is optional, and also disallows multiple sign
characters.

( ... | ... )
The parentheses are there for grouping what's inside as a single
unit.  The pipe symbolizes alternation - that is, this part of the
regex will match either the expression appearing before the pipe
symbol, or the expression appearing after it.  We need to use
alternation because while it is optional to have numbers before
the decimal point (e.g., .123) or after the decimal point (e.g.,
123.), it is not valid for both sets of numbers to be missing
(e.g., just a .).

[0-9]
Another bracket expression, this matches any single numeral (that
is, any character in the range 0 through 9).

[0-9]+
The + sign outside a bracket expression does have special meaning.
It means "match one or more of the preceding character."  So this
will match one or more numerals, but not an empty string.

\.
The dot has the special meaning "match any character."  If we want
to literally match a period, we have to escape it with a backslash
to remove its special meaning.  Note that [.] would do the same
thing, as dot has no special meaning inside a bracket expression.

\.?
Again, question mark means to match zero or one of the preceding
character.  This makes the decimal point optional.

[0-9]+\.?
This is the entire first expression of our alternation.  It will
match an integer of any length, optionally followed by a decimal
point.  So 0, 123, 123., 00000., 000123, and 123123123123123123123
would all match, but just a decimal point would not.

[0-9]*
We've seen [0-9] before, but the * is new.  It is similar to +,
but means "match zero or more of the preceding character."  We use
* instead of + because it's valid to have nothing in front of the
decimal point (e.g., .123).

\.
This matches a decimal point again, but note there is no question
mark.  In this expression, one single decimal point is mandatory.

[0-9]+
Again, this matches one or more numerals.  Having numbers after
the decimal point is not optional here.

[0-9]*\.[0-9]+
This is the full second expression of our alternation.  It will
match any floating point value, such as 0.123, 1.234, .123, 0098.6,
123.456, or 3.1415926535897932384626433832795028841971693993751.

([0-9]+\.?|[0-9]*\.[0-9]+)
This is the full non-sign part of our regex.  As discussed above
regarding alternation, it means "either an integer, optionally
followed by a decimal point, or an optional set of numerals
followed by a (mandatory) decimal point and one or more trailing
numerals."

$
This is the counterpart to the ^ character, forcing a match at the
end of the line.  Putting the expression inside ^$ forbids any
extraneous characters before or after our match.

As we talked about, we could use this regex in awk like so:
  awk '/^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$/ {action-list}'
where the action list would be executed for each numeric value
in input.  Note that sed does NOT support the extended regexes
understood by egrep or 'grep -E'.  It only handles the basic
regexes of standard grep, so we cannot use this regex with sed.

If you are having difficulty understanding any of this, try
playing around with different input values and/or altering the
regex to see what the different parts do.  Ask questions here on
the list if you get really stuck.  Have fun!

Vance Kochenderfer        |  "Get me out of these ropes and into a
vkochend at nyx.net          |   good belt of Scotch"    -Nick Danger


More information about the wplug mailing list