Parsing Text (grep, sed, awk, find)
Finding files (find
)
The most common command that you will use when looking for files on your system is the aptly named find
command.
It has somewhat strange syntax, so using tldr
is particularly convenient for this command. A common use of find
will look like the following:
$ find <start-directory> -iname <file-name>$ find ~/stuco -iname "*.txt" #case insensitive file name search
There are a bunch of other useful flags:
$ find . -type f # only find files of type "file"$ find . -maxdepth 2 # only recurse 2 levels deep$ find . -exec cat {} \; # run the following command for any file that you find, where {} is the placeholder
Parsing text
grep
I’ve used this command many times throughout this class, and its likely going to the one that you use the most. grep
actually stands for “Global RegEx Print”. It just allows you to find text inside any of the input passed into it.
You can pass text input in using stdin
:
$ cat file.txt | grep "search" # using pipes$ grep "search" < file.txt # using input redirection$ cat file1.txt | grep "search" < file2.txt # or both
You can also pass in file names as input to grep:
$ grep "search" file.txt # notice that we arent using < (input redirection). This is just a file name which grep opens$ grep "search" *.md # we can also using file globbing (we will cover this more later)
-r/--recursive
flag
Sometimes, though, we have no idea what file what we are looking for is in, but we know a keyword that is around it. When that is the case, we can search the contents of every file.
$ grep -r "search" [dir] # search the contents of every file, similar to how `find` recursively searches every directory for a file name$ grep -r "search" --exclude="*.json" # you can, of course, reduce the search area$ grep -r "search" --include="*.md" # in more ways than one
RegEx with grep
grep
does support RegEx, but it is enabled with a flag.
In general, you will want to use grep -P
, as that enabled PCREs, but it wont be a 1-to-1 match with the PCRE2 that I showed in class.
Additional useful flags
--line-number
: Show the line number this match occurred on--no-filename
: Dont print the filename when printing a match (this only works when passing in files to grep, not viastdin
)--only-matching
: Only print the parts of the line that were matched by the regular expression, not the whole line-A/-B/-C
: Print the lines after/before/both of the matched line
awk
awk
is an entire programming language in itself; it is most used for complex text parsing. awk
defaults to being delimited by whitespace.
This is just an example of one of the things that I’ve personally written with awk
. Notice the length()
, and split()
functions that are available.
$ awk '{ \ split($0, days, " "); \ for (i = 1; i <= length(days); i++) { \ if (days[i] ~ /[[:digit:]]-01/) printf "\033[1;37;44m%s\033[0m\t", days[i]; \ else if (days[i] ~ /Mon$/) \ printf "\033[1;37;43m%s\033[0m\t", days[i]; \ else printf "%s\t", days[i] \ } \ printf "\n" }'
There’s no way I can cover everything that is available, but these are the basics:
$ awk '{}' # the general syntax$ awk '{print $2}' # print the second field$ awk -F ':' '{print $4}' # change the delimited awk uses$ awk '{ if ($1 ~ /[[:digit:]]/) print $1 $2}' # conditional match on a line, using a regular expression$ echo -e "1\n12\n1234\n4346\n234\n93" | awk 'BEGIN {print "this runs first"} {print $1} END {print "this runs at the end"}' # run text before or after processing any/all text
There are also built in fields that you can query:
$ NF # number of fields in this line$ $NF # the value at the last field in this row$ NR # number of rows met thus far
Modifying text (sed
)
All of the commands I have shown up to this point take the text out of a file, and search for something, or reorder the text in some way. What if we wont to modify the contents of a file? That is what sed
can be used for.
I personally don’t use sed
very often, but it allows you to modify the text of a file, in-place. The most common use is a simple find and replace of text in a file:
$ sed 's/foo/bar/g' file.txt # replace all instances of "foo" with "bar"$ sed --in-place --regexp-extended '/^ssh [[:digit:]]/d' $FILE # delete all lines that start with "ssh \d"
JSON data (jq
)
jq
is a tool that is incredibly useful when going through JSON data. You can pass in files, or stdin
data, just like with grep
, and it lets you easily filter and search through JSON data.
$ curl -s 'https://jsonplaceholder.typicode.com/users' | jq ".[5].address.geo"$ jq ".networks.[0].address" config.json
Accessing the text in files
cat
I have used this command repeatedly in this class. This just takes the text from the file[s] that you provided it, and writes their contents to stdout
head
and tail
head
and tail
show you the top and bottom of a file respectively. You can specify just how much of the file it shows you using the -n
flag
$ head -n 10 config.json$ tail -n 3 out.log
tail -F
The -F
flag is particularly useful when data is continuously being written to a file. This is likely to be a long-running separate process you are running which is logging data to a file.
$ tail -F long-running.out
diff
You don’t have to rely on on git
to do all of your file diff’s. You can pass in two file paths, and it will do its best to find the difference between them.
$ diff file1.txt file2.txt
There are, of course, tons of commandline flag options that you can pass in.
Homework
For this homework, you should be working in the docker container I made for this class. Here are some notes about how to work with the file from the website, while inside the docker container.
Writing it to a file
One way would be to just write the random data generated specifically for you, to a file. You can do this either by copying and pasting it, or using curl
. I will be showing the curl
version:
$ curl -s https://intro-to-cmdline-tools.jtledon.com/parsetext/<alias>.txt > <alias>.txt # using output redirection to write to a file (ex. jledon.txt)
If you ran that command from within the docker container, then you’re done. This issue with that is if you are using the --rm
flag when running the container, it will disappear and you need to remember to fetch the data and write it to a file each time you start a new container. To get around this, you could either:
- Stop using the
--rm
flag, not delete the container every time, and just reattach to the same container the next time. This will allow the file to persist in the container. - Attach a volume to your container that allows you to pass in part of your directory into the docker container
I prefer to use the second version; here would be a command that allows you to do that: volumes (if you write it to a file)
$ docker pull jasonledon/stuco:latest && docker run -v $PWD:/passthru --rm -it jasonledon/stuco:latest /bin/bash
Explaining what this command does:
docker pull jasonledon/stuco:latest
: pulling the latest version of the class docker image from dockerhubdocker run
: run the specified image as a container-v $PWD:/passthru
: Creating a volume. This essentially links your current working directory from where you run this command, to the/passthru
directory in the container. Please keep in mind that any changes you make to this directory will actually effect your system in the directory. If you run rm or something, you will actually be deleting files on your machine. Its just really useful for persistent changes made in a docker container--rm
: delete this container instance once I exit-i
: interactive communication with the terminalsstdin
-t
: allocate a TTY/bin/bash
: The entry command to run
Fetching the data from the website
Alternatively, you could skip setting up a volume and just use a network request to fetch the data each time as the start of your chain of commands. This is likely the easiest and preferred solution for this homework.
$ curl -s https://intro-to-cmdline-tools.jtledon.com/parsetext/<alias>.txt | grep ... | ...$ curl -s https://intro-to-cmdline-tools.jtledon.com/parsetext/<alias>.txt |& grep ... | ...