Skip to content

Parsing Text (grep, sed, awk, find)

Finding files (find)

The most common command that you will use when looking for files on your system is the aptly named find command.

It has somewhat strange syntax, so using tldr is particularly convenient for this command. A common use of find will look like the following:

Terminal window
$ find <start-directory> -iname <file-name>
$ find ~/stuco -iname "*.txt" #case insensitive file name search

There are a bunch of other useful flags:

Terminal window
$ find . -type f # only find files of type "file"
$ find . -maxdepth 2 # only recurse 2 levels deep
$ find . -exec cat {} \; # run the following command for any file that you find, where {} is the placeholder

Parsing text

grep

I’ve used this command many times throughout this class, and its likely going to the one that you use the most. grep actually stands for “Global RegEx Print”. It just allows you to find text inside any of the input passed into it.

You can pass text input in using stdin:

Terminal window
$ cat file.txt | grep "search" # using pipes
$ grep "search" < file.txt # using input redirection
$ cat file1.txt | grep "search" < file2.txt # or both

You can also pass in file names as input to grep:

Terminal window
$ grep "search" file.txt # notice that we arent using < (input redirection). This is just a file name which grep opens
$ grep "search" *.md # we can also using file globbing (we will cover this more later)

-r/--recursive flag

Sometimes, though, we have no idea what file what we are looking for is in, but we know a keyword that is around it. When that is the case, we can search the contents of every file.

Terminal window
$ grep -r "search" [dir] # search the contents of every file, similar to how `find` recursively searches every directory for a file name
$ grep -r "search" --exclude="*.json" # you can, of course, reduce the search area
$ grep -r "search" --include="*.md" # in more ways than one

RegEx with grep

grep does support RegEx, but it is enabled with a flag.

In general, you will want to use grep -P, as that enabled PCREs, but it wont be a 1-to-1 match with the PCRE2 that I showed in class.

Additional useful flags

  • --line-number: Show the line number this match occurred on
  • --no-filename: Dont print the filename when printing a match (this only works when passing in files to grep, not via stdin)
  • --only-matching: Only print the parts of the line that were matched by the regular expression, not the whole line
  • -A/-B/-C: Print the lines after/before/both of the matched line

awk

awk is an entire programming language in itself; it is most used for complex text parsing. awk defaults to being delimited by whitespace.

This is just an example of one of the things that I’ve personally written with awk. Notice the length(), and split() functions that are available.

Terminal window
$ awk '{ \
split($0, days, " "); \
for (i = 1; i <= length(days); i++) { \
if (days[i] ~ /[[:digit:]]-01/)
printf "\033[1;37;44m%s\033[0m\t", days[i]; \
else if (days[i] ~ /Mon$/) \
printf "\033[1;37;43m%s\033[0m\t", days[i]; \
else printf "%s\t", days[i] \
} \
printf "\n" }'

There’s no way I can cover everything that is available, but these are the basics:

Terminal window
$ awk '{}' # the general syntax
$ awk '{print $2}' # print the second field
$ awk -F ':' '{print $4}' # change the delimited awk uses
$ awk '{ if ($1 ~ /[[:digit:]]/) print $1 $2}' # conditional match on a line, using a regular expression
$ echo -e "1\n12\n1234\n4346\n234\n93" | awk 'BEGIN {print "this runs first"} {print $1} END {print "this runs at the end"}' # run text before or after processing any/all text

There are also built in fields that you can query:

Terminal window
$ NF # number of fields in this line
$ $NF # the value at the last field in this row
$ NR # number of rows met thus far

Modifying text (sed)

All of the commands I have shown up to this point take the text out of a file, and search for something, or reorder the text in some way. What if we wont to modify the contents of a file? That is what sed can be used for.

I personally don’t use sed very often, but it allows you to modify the text of a file, in-place. The most common use is a simple find and replace of text in a file:

Terminal window
$ sed 's/foo/bar/g' file.txt # replace all instances of "foo" with "bar"
$ sed --in-place --regexp-extended '/^ssh [[:digit:]]/d' $FILE # delete all lines that start with "ssh \d"

JSON data (jq)

jq is a tool that is incredibly useful when going through JSON data. You can pass in files, or stdin data, just like with grep, and it lets you easily filter and search through JSON data.

Terminal window
$ curl -s 'https://jsonplaceholder.typicode.com/users' | jq ".[5].address.geo"
$ jq ".networks.[0].address" config.json

Accessing the text in files

cat

I have used this command repeatedly in this class. This just takes the text from the file[s] that you provided it, and writes their contents to stdout

head and tail

head and tail show you the top and bottom of a file respectively. You can specify just how much of the file it shows you using the -n flag

Terminal window
$ head -n 10 config.json
$ tail -n 3 out.log

tail -F

The -F flag is particularly useful when data is continuously being written to a file. This is likely to be a long-running separate process you are running which is logging data to a file.

Terminal window
$ tail -F long-running.out

diff

You don’t have to rely on on git to do all of your file diff’s. You can pass in two file paths, and it will do its best to find the difference between them.

Terminal window
$ diff file1.txt file2.txt

There are, of course, tons of commandline flag options that you can pass in.

Homework

For this homework, you should be working in the docker container I made for this class. Here are some notes about how to work with the file from the website, while inside the docker container.

Writing it to a file

One way would be to just write the random data generated specifically for you, to a file. You can do this either by copying and pasting it, or using curl. I will be showing the curl version:

Terminal window
$ curl -s https://intro-to-cmdline-tools.jtledon.com/parsetext/<alias>.txt > <alias>.txt # using output redirection to write to a file (ex. jledon.txt)

If you ran that command from within the docker container, then you’re done. This issue with that is if you are using the --rm flag when running the container, it will disappear and you need to remember to fetch the data and write it to a file each time you start a new container. To get around this, you could either:

  1. Stop using the --rm flag, not delete the container every time, and just reattach to the same container the next time. This will allow the file to persist in the container.
  2. Attach a volume to your container that allows you to pass in part of your directory into the docker container

I prefer to use the second version; here would be a command that allows you to do that: volumes (if you write it to a file)

Terminal window
$ docker pull jasonledon/stuco:latest && docker run -v $PWD:/passthru --rm -it jasonledon/stuco:latest /bin/bash

Explaining what this command does:

  1. docker pull jasonledon/stuco:latest: pulling the latest version of the class docker image from dockerhub
  2. docker run: run the specified image as a container
  3. -v $PWD:/passthru: Creating a volume. This essentially links your current working directory from where you run this command, to the /passthru directory in the container. Please keep in mind that any changes you make to this directory will actually effect your system in the directory. If you run rm or something, you will actually be deleting files on your machine. Its just really useful for persistent changes made in a docker container
  4. --rm: delete this container instance once I exit
  5. -i: interactive communication with the terminals stdin
  6. -t: allocate a TTY
  7. /bin/bash: The entry command to run

Fetching the data from the website

Alternatively, you could skip setting up a volume and just use a network request to fetch the data each time as the start of your chain of commands. This is likely the easiest and preferred solution for this homework.

Terminal window
$ curl -s https://intro-to-cmdline-tools.jtledon.com/parsetext/<alias>.txt | grep ... | ...
$ curl -s https://intro-to-cmdline-tools.jtledon.com/parsetext/<alias>.txt |& grep ... | ...