Exercises 4: Data Wrangling

Take this short interactive regex tutorial.
Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words? sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?
- Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words?
```
 ```
   cat /usr/share/dict/words | tr "[:upper:]" "[:lower:]" | grep -E "^([^a]*a){3}.*$" | grep -v "\'s$" | wc -l
 ```

 `cat /usr/share/dict/words` prints the contents of the file `words` to the standard output (terminal).
 `tr "[:upper:]" "[:lower:]"` translates the contents of the file to lower-case using the `tr` program.
 `grep -E "^([^a]*a){3}.*$"` searches the file using extended regular expression. The regular expression matches strings for sequences of non-`a` characters (`[^a]*`) that ends in `a`, for three times. (`([^a]*a){3}`) The rest of the string could contain whatever chracter however many times. (`.*`)
 `grep -v "\'s$"` invert matches for excluding specific string `\'s`. Note `-v` tag allows for inverse matching.
 `wc -l` counts lines in file. This is equivalent to the total number of words as each words is separated by a new line.
```
- What are the three most common last two letters of those words?
```
cat /usr/share/dict/words | tr "[:upper:]" "[:lower:]" | grep -E "^([^a]*a){3}.*$" | grep -v "\'s$" | sed -E "s/.*([a-z]{2})$/\1/" | sort | uniq -c | sort -n | tail -n3
```
  sed -E "s/.*([a-z]{2})$/\1/" replaces all occurrences of the regular expression to the first capture group. The regular expression captures the last two lower-case alphabet characters.
  sort | uniq -c displays number of occurrences of each line along with that line. sort -n perfroms numeric sort, in an ascending order. tail -n3 returns the last three items, which should be the three most common last two letters.
- How many of those two-letter combinations are there?
```
cat /usr/share/dict/words | tr "[:upper:]" "[:lower:]" | grep -E "^([^a]*a){3}.*$" | grep -v "\'s$" | sed -E "s/.*([a-z]{2})$/\1/" | sort | uniq | wc -l
```
- And for a challenge: which combinations do not occur?
```
cat /usr/share/dict/words | tr "[:upper:]" "[:lower:]" | grep -E "^([^a]*a){3}.*$" | grep -v "\'s$" | sed -E "s/.*([a-z]{2})$/\1/" | sort | uniq > last_letters

source letters.sh > all_letters

diff --changed-group-format="%<" --unchanged-group-format="" all_letters last_letters
```
  Output the two-letter combinations in file words to another file last_letters.
  Use a shell script to perform nested loops on all possible lower-case two-letter combinations as in letters.sh and output the result to another file all_letters.
  Use diff program to print the difference between the two files all_letters and last_letters. You can use --changed-group-format and --unchanged-group-format options to filter required data. Following three options can use to select the relevant group for each option: %< get lines from FILE1 %> get lines from FILE2 ‘’ (empty string) for removing lines from both files. (Source)
To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.

sed s/REGEX/SUBSTITUTION/ input.txt > input.txt will return a blank file, since input.txt on the right hand side of the output operator > will be made blank before the left hand side of the output operator could be applied. Because the file names conicide, an empty input.txt will be output to an empty input.txt.

Use man sed to find out how to accomplish in-place substitution.
```
-i extension
         Edit files in-place, saving backups with the specified extension.
         If a zero-length extension is given, no backup will be saved.  It
         is not recommended to give a zero-length extension when in-place
         editing files, as you risk corruption or partial content in situ-
         ations where disk space is exhausted, etc.
```
It is generally a bad idea to in-place edit files as mistakes can be made when using regular expressions and without backups on the original file, it can be very difficult to recover. This is not particular to sed but in general to all forms of in-place editing files. To accomplish in-plface substitution add the -i tag.
Find your average, median, and max system boot time over the last ten boots. Use journalctl on Linux and log show on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:
```
Logs begin at ...
```
and
```
systemd[577]: Startup finished in ...
```
On macOS, look for:
```
=== system boot:
```
and
```
Previous shutdown cause: 5
```
Output the matching logs to a separate file bootlog since it takes a while to run through a month worth of log history everytime.
```
log show | grep -E "=== system boot:|Previous shutdown cause: 5" > bootlog
```
A line of the matched log looks like this on macOS:
```
2020-04-29 09:46:04.022430+0900 0x0        Timesync    0x0                  0      0    === system boot: C4C57548-7964-483A-A0E9-8FD49A6088C6
2020-04-29 09:46:04.683152+0900 0xa9       Default     0x0                  0      0    kernel: (AppleSMC) Previous shutdown cause: 5
```
The line of interest the second column of the logs, of which their difference denotes the time how long the system took to boot.
```
cat bootlog | awk '{print $2}' | sed -E "s/^.*:(.*)/\1/" | xargs -n2 | awk '{print $2"-"$1}' | bc | R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'
```
awk '{print $2}' only prints the second column of the bootlog.
sed -E "s/^.*:(.*)/\1/" dismisses the hour and minute of the time log, and replaces the logs with the simplified version of seconds (as boot time is often around a few seconds).
xargs -n2 rearranges the logs to be shown in two columns.
awk '{print $2"-"$1}' concatenates a subtraction character “-“ between the second column and the first column. Note that this print is interpreted as a string type, not an operation between integer type values.
bc turns the previous strings to an integer operation.
R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)' can be used as in the lectures to compute the average, median, and max boot time. Programming language R must be installed in order to run this command. Or install st and replace the command with st --mean --median --max. There does not seem to be a ‘vanilla’ method to perform this step.
Look for boot messages that are not shared between your past three reboots (see journalctl’s -b flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use sed '0,/STRING/d' to remove all lines previous to one that matches STRING. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).

Skipped (system not Linux)
Find an online data set like this one, this one. or maybe one from here. Fetch it using curl and extract out just two columns of numerical data. If you’re fetching HTML data, pup might be helpful. For JSON data, try jq. Find the min and max of one column in a single command, and the sum of the difference between the two columns in another.