Editing Meeting-20100612

Jump to: navigation, search

Warning: You are not logged in.

Your IP address will be recorded in this page's edit history.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision Your text
Line 9: Line 9:
 
== Speaker/Presentation ==
 
== Speaker/Presentation ==
  
[[User:Vance|Vance Kochenderfer]] will be talking a bit about the UNIX text processing utilities such as grep, sed, awk, cat, wc, and the like.
+
(TBA)
 
+
However, you don't get to just sit on your butt and listen; this is an audience-participation event.  What we're going to do is take a couple simple tasks, and then explore how you could accomplish them using various UNIX utilities.
+
 
+
The goal is not solely to find the standard, quickest, or simplest solution, but to try out as many different whacked-out options as we can.  So don't stop thinking once you've got an answer, even if it's a good one - see what else you can come up with!
+
 
+
We'll talk over all the suggestions and how they work (or don't work), so hopefully we'll all learn something new.
+
 
+
Start thinking about these, and bring your ideas to the meeting:
+
 
+
=== EXERCISE ONE ===
+
 
+
You have a large text file.  Some lines contain text; others are
+
blank.  Your goal is to figure out how many non-blank lines are in
+
the file.
+
 
+
I can think of at least six ways of doing this, how about you?
+
 
+
Below are the various examples I came up with, and running times using an input file with 1.5 million blank lines and 3 million non-blank lines.  You can generate these statistics by preceding the command with 'time -p'.
+
 
+
# First example from <http://www.vectorsite.net/tsawk_3.html>
+
awk 'NF != 0 { ++count } END { print count }' filename
+
3000000
+
real 3.67
+
user 3.55
+
sys 0.10
+
 
+
awk '/./ { ++count } END { print count }' filename
+
3000000
+
real 2.96
+
user 2.81
+
sys 0.12
+
 
+
grep -c . filename
+
3000000
+
real 0.94
+
user 0.84
+
sys 0.08
+
 
+
# sed is just a slower grep here.
+
sed -n -e '/./p' filename | wc -l
+
real 6.00
+
user 5.64
+
sys 0.14
+
3000000
+
 
+
# If you REALLY love sed, you can replace wc -l, too!
+
sed -n -e '/./p' filename | sed -n -e '$='
+
real 7.43
+
user 5.70
+
sys 0.19
+
3000000
+
 
+
tr -s '\012' < filename | wc -l
+
real 1.21
+
user 0.84
+
sys 0.13
+
3000000
+
 
+
# -b and -s are non-POSIX extensions to cat found on GNU and
+
# BSD systems.
+
cat -b -s filename | tail -n 2 | cut -f 1
+
real 1.20
+
user 0.63
+
sys 0.16
+
2999999
+
3000000
+
 
+
sh -c 'count=0
+
while read ln ; do
+
[ -n "$ln" ] && count=$(($count+1))
+
done
+
echo $count' < filename
+
3000000
+
real 240.14
+
user 214.12
+
sys 24.96
+
 
+
perl -e 'while (<>) { chomp; if ($_) { ++$count } } ;
+
print "$count\n"' < filename
+
3000000
+
real 7.22
+
user 7.00
+
sys 0.14
+
 
+
perl -e 'while (<>) { if (/./) { ++$count } } ;
+
print "$count\n"' < filename
+
3000000
+
real 8.93
+
user 8.78
+
sys 0.11
+
 
+
# This one displays a separate count of blank, non-blank, and
+
# total lines.
+
awk 'NF != 0 {++nonblank} NF == 0 {++blank}
+
END {print "Non-blank:",nonblank ; print "Blank:",blank ;
+
print "Total:",NR}' filename
+
Non-blank: 3000000
+
Blank: 1500000
+
Total: 4500000
+
real 5.42
+
user 5.28
+
sys 0.12
+
 
+
# Actually, we don't need a separate pattern and action to
+
# count blank lines; we can subtract from the total instead.
+
awk 'NF != 0 {++count}
+
END {print "Non-blank:",count ; print "Blank:",NR-count ;
+
print "Total:",NR}' filename
+
Non-blank: 3000000
+
Blank: 1500000
+
Total: 4500000
+
real 3.75
+
user 3.64
+
sys 0.09
+
 
+
# This does the same, but has to read the file three separate
+
# times.  On your system, might be faster or slower than the
+
# one above; depends on whether CPU or I/O is the bottleneck.
+
sh -c 'printf "Non-blank: " ; grep -c . filename ;
+
printf "Blank: " ; grep -v -c . filename ;
+
printf "Total: " ; wc -l filename | cut -d " " -f 1'
+
Non-blank: 3000000
+
Blank: 1500000
+
Total: 4500000
+
real 2.25
+
user 1.91
+
sys 0.31
+
 
+
=== EXERCISE TWO ===
+
 
+
Determine whether a given value is numeric (decimal).
+
 
+
Example numeric values:
+
  123      45.6789  -3.4567  -0        000123
+
  .01234    54321.    00000.    -0.987    -.987
+
  -0123.    012      0.0      .0        -.000
+
 
+
Example non-numeric values:
+
  hello    3f        3F        AB        0xAB
+
  0.0.      -0-      3.0E8    3.0e-08  .-0123
+
  1.23.4    5.678-    --98      -.        a space
+
  a tab
+
 
+
As a bonus, make your command also consider a value numeric if it starts with a + instead of a -.
+
 
+
I haven't thought about this one as much, and only have one solution so far.  Maybe you can come up with something using bc or some other non-obvious method?
+
 
+
'''Answer:''' I wasn't able to find a way of doing this using bc or shell arithmetic as I thought I might.  So the fallback was to use grep and build an appropriate regular expression (regex).
+
 
+
We ran out of time before I could explain the full regex, so here is the command in all its glory:
+
 
+
  egrep -q '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$'
+
 
+
or, if you want to strictly conform to [http://www.opengroup.org/onlinepubs/009695399/nframe.html POSIX],
+
 
+
  grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null
+
 
+
We are using an extended regex, so we have to use egrep or the -E option to grep.  Since we want to get a true/false value, we use the -q option or redirection to throw away the output.  This way, we can just act based on grep's exit status.
+
 
+
Let's pick apart the regex to see what it does.
+
 
+
<font color="red">'''^'''</font>[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$
+
;^
+
:The ^ at the beginning means that our regex will only match starting at the beginning of the line.  The anticipated use for this command is something like 'echo "$value" | grep blah...', so we know there won't be any extraneous stuff at the beginning.  If you are getting the value from a file or as input from the user, you may need to strip away whitespace from the beginning, or alter the regex to account for that.
+
 
+
^<font color="red">'''[-+]'''</font>?([0-9]+\.?|[0-9]*\.[0-9]+)$
+
;[-+]
+
:An expression inside brackets matches a single character, as long as that character is one of those listed inside the brackets (a range of characters can be specified, as can special pre-defined character classes, but in this case we're not using either of those).  So this would match either a - sign or a + sign.  Note that inside a bracket expression, + has no special meaning.
+
 
+
^<font color="red">'''[-+]?'''</font>([0-9]+\.?|[0-9]*\.[0-9]+)$
+
;[-+]?
+
:The question mark means "match zero or one of the preceding character."  This makes it so that having a sign at the beginning of our number is optional, and also disallows multiple sign characters.
+
 
+
^[-+]?<font color="red">'''([0-9]+\.?|[0-9]*\.[0-9]+)'''</font>$
+
;( ... | ... )
+
:The parentheses are there for grouping what's inside as a single unit.  The pipe symbolizes alternation - that is, this part of the regex will match either the expression appearing before the pipe symbol, or the expression appearing after it.  We need to use alternation because while it is optional to have numbers before the decimal point (e.g., .123) or after the decimal point (e.g., 123.), it is not valid for both sets of numbers to be missing (e.g., just a .).
+
 
+
^[-+]?(<font color="red">'''[0-9]'''</font>+\.?|[0-9]*\.[0-9]+)$
+
;[0-9]
+
:Another bracket expression, this matches any single numeral (that is, any character in the range 0 through 9).
+
 
+
^[-+]?(<font color="red">'''[0-9]+'''</font>\.?|[0-9]*\.[0-9]+)$
+
;[0-9]+
+
:The + sign outside a bracket expression does have special meaning.  It means "match one or more of the preceding character."  So this will match one or more numerals, but not an empty string.
+
 
+
^[-+]?([0-9]+<font color="red">'''\.'''</font>?|[0-9]*\.[0-9]+)$
+
;\.
+
:The dot has the special meaning "match any character."  If we want to literally match a period, we have to escape it with a backslash to remove its special meaning.  Note that '''[.]''' would do the same thing, as dot has no special meaning inside a bracket expression.
+
 
+
^[-+]?([0-9]+<font color="red">'''\.?'''</font>|[0-9]*\.[0-9]+)$
+
;\.?
+
:Again, question mark means to match zero or one of the preceding character.  This makes the decimal point optional.
+
 
+
^[-+]?(<font color="red">'''[0-9]+\.?'''</font>|[0-9]*\.[0-9]+)$
+
;[0-9]+\.?
+
:This is the entire first expression of our alternation.  It will match an integer of any length, optionally followed by a decimal point.  So 0, 123, 123., 00000., 000123, and 123123123123123123123 would all match, but just a decimal point would not.
+
 
+
^[-+]?([0-9]+\.?|<font color="red">'''[0-9]*'''</font>\.[0-9]+)$
+
;[0-9]*
+
:We've seen [0-9] before, but the * is new.  It is similar to +, but means "match zero or more of the preceding character."  We use * instead of + because it's valid to have nothing in front of the decimal point (e.g., .123).
+
 
+
^[-+]?([0-9]+\.?|[0-9]*<font color="red">'''\.'''</font>[0-9]+)$
+
;\.
+
:This matches a decimal point again, but note there is no question mark.  In this expression, one single decimal point is mandatory.
+
 
+
^[-+]?([0-9]+\.?|[0-9]*\.<font color="red">'''[0-9]+'''</font>)$
+
;[0-9]+
+
:Again, this matches one or more numerals.  Having numbers after the decimal point is not optional here.
+
 
+
^[-+]?([0-9]+\.?|<font color="red">'''[0-9]*\.[0-9]+'''</font>)$
+
;[0-9]*\.[0-9]+
+
:This is the full second expression of our alternation.  It will match any floating point value, such as 0.123, 1.234, .123, 0098.6, 123.456, or 3.1415926535897932384626433832795028841971693993751.
+
 
+
^[-+]?<font color="red">'''([0-9]+\.?|[0-9]*\.[0-9]+)'''</font>$
+
;([0-9]+\.?|[0-9]*\.[0-9]+)
+
:This is the full non-sign part of our regex.  As discussed above regarding alternation, it means "either an integer, optionally followed by a decimal point, or an optional set of numerals followed by a (mandatory) decimal point and one or more trailing numerals."
+
 
+
^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)<font color="red">'''$'''</font>
+
;$
+
:This is the counterpart to the ^ character, forcing a match at the end of the line.  Putting the expression inside ^$ forbids any extraneous characters before or after our match.
+
 
+
As we talked about, we could use this regex in awk like so:
+
  awk '/^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$/ {action-list}'
+
where the action list would be executed for each numeric value in input.  Note that sed does NOT support the extended regexes understood by egrep or 'grep -E'.  It only handles the basic regexes of standard grep, so we cannot use this regex with sed.
+
 
+
If you are having difficulty understanding any of this, try playing around with different input values and/or altering the regex to see what the different parts do.  Ask questions [[Mailing Lists|on the main wplug mailing list]] if you get really stuck.  Have fun!
+
 
+
Example output:
+
cat numbers
+
123
+
45.6789
+
-3.4567
+
-0
+
000123
+
.01234
+
54321.
+
00000.
+
-0.987
+
-.987
+
-0123.
+
012
+
0.0
+
.0
+
-.000
+
+
cat nonnumbers
+
hello
+
3f
+
3F
+
AB
+
0xAB
+
0.0.
+
-0-
+
3.0E8
+
3.0e-08
+
.-0123
+
1.23.4
+
5.678-
+
--98
+
.-
+
 
+
+
+
cat bonusnumbers
+
123
+
45.6789
+
+3.4567
+
+0
+
000123
+
.01234
+
54321.
+
00000.
+
+0.987
+
+.987
+
+0123.
+
012
+
0.0
+
.0
+
+.000
+
+
cat numbers | while read value ; do echo "$value" | \
+
grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null \
+
&& echo "Number" || echo "Not a number"; done \
+
| paste numbers -
+
123 Number
+
45.6789 Number
+
-3.4567 Number
+
-0 Number
+
000123 Number
+
.01234 Number
+
54321. Number
+
00000. Number
+
-0.987 Number
+
-.987 Number
+
-0123. Number
+
012 Number
+
0.0 Number
+
.0 Number
+
-.000 Number
+
+
cat nonnumbers | while read value ; do echo "$value" | \
+
grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null \
+
&& echo "Number" || echo "Not a number"; done \
+
| paste nonnumbers -
+
hello Not a number
+
3f Not a number
+
3F Not a number
+
AB Not a number
+
0xAB Not a number
+
0.0. Not a number
+
-0- Not a number
+
3.0E8 Not a number
+
3.0e-08 Not a number
+
.-0123 Not a number
+
1.23.4 Not a number
+
5.678- Not a number
+
--98 Not a number
+
.- Not a number
+
  Not a number
+
Not a number
+
+
cat bonusnumbers | while read value ; do echo "$value" | \
+
grep -E '^[-+]?([0-9]+\.?|[0-9]*\.[0-9]+)$' > /dev/null \
+
&& echo "Number" || echo "Not a number"; done \
+
| paste bonusnumbers -
+
123 Number
+
45.6789 Number
+
+3.4567 Number
+
+0 Number
+
000123 Number
+
.01234 Number
+
54321. Number
+
00000. Number
+
+0.987 Number
+
+.987 Number
+
+0123. Number
+
012 Number
+
0.0 Number
+
.0 Number
+
+.000 Number
+
  
 
== Meeting Minutes ==
 
== Meeting Minutes ==
  
'''DRAFT'''
+
(TBA)
 
+
The regular monthly meeting of the Western Pennsylvania Linux Users Group was held on Saturday, June 12, 2009, at 11:06 AM, at the Wilkins School Community Center, the regular presiding officer being in the chair.  In the absence of the regular secretary, Vance Kochenderfer was elected to serve as secretary pro tem.  The [[Meeting-20091031#Meeting_Minutes|minutes of the October 31, 2009 meeting]] were approved as read.
+
 
+
The Treasurer reported that there is $760.80 in the checking account, $66 cash on hand in the refreshment fund, and $40 received in dues yet to be deposited.
+
 
+
The meeting adjourned at 11:09 AM.
+
 
+
Vance Kochenderfer<br />
+
Secretary pro tem
+
 
+
'''DRAFT'''
+
  
 
== Meeting Staff ==
 
== Meeting Staff ==

Please note that all contributions to WPLUG may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see WPLUG:Copyrights for details). Do not submit copyrighted work without permission!

Cancel | Editing help (opens in new window)