Skip to content
ioob.dev
Go back

Linux Basics Part 4 — Text Processing and Pipes

· 7 min read
Linux Series (4/8)
  1. Linux Basics Part 1 — The Shell and Filesystem Structure
  2. Linux Basics Part 2 — File Permissions and Users/Groups
  3. Linux Basics Part 3 — Processes and Signals
  4. Linux Basics Part 4 — Text Processing and Pipes
  5. Linux Basics Part 5 — Network Tools
  6. Linux Basics Part 6 — Systemd and Service Management
  7. Linux Basics Part 7 — Package Management
  8. Linux Basics Part 8 — Bash Scripting Basics
Table of contents

Table of contents

Where the Unix Philosophy Shines Brightest

If you had to explain in one sentence why Unix has survived for over 50 years, it would be this: “Build small tools and connect them with pipes.” No single command tries to do everything. grep only searches, sort only sorts, wc only counts. Instead, they connect to each other via the pipe | to form pipelines. The approach is to assemble tools that each do one thing well to accomplish complex tasks.

The tools covered in this part may look simple at first glance. But when it comes to log analysis, automated config file editing, or CSV data summarization in practice, these are the tools you inevitably reach for. These are the commands practically burned into the muscle memory of DevOps engineers.

The Pipeline Mindset

The pipe | connects the standard output of the preceding command to the standard input of the following command. Spelled out, it sounds obvious, but a diagram reveals the essence.

flowchart LR
    CMD1["Command A<br/>(stdout)"] -->|pipe| CMD2["Command B<br/>(stdin)"]
    CMD2 -->|stdout| CMD3["Command C<br/>(stdin)"]
    CMD3 -->|stdout| SCREEN["Screen<br/>(default stdout)"]

Each command only knows what it does. It doesn’t care what comes before or after. That’s why they can be combined endlessly, like LEGO blocks. The following one-liner is the condensed version of this philosophy.

# Top 10 URLs that returned 404 in recent nginx access logs
tail -n 10000 /var/log/nginx/access.log \
  | awk '$9 == 404 { print $7 }' \
  | sort \
  | uniq -c \
  | sort -rn \
  | head -n 10

Each step does something simple. Extract the last 10,000 lines, filter URLs where the status code is 404, sort them, count duplicates, re-sort by count descending, and keep only the top 10. To compose these pipelines freely, you need to know each piece.

stdin, stdout, stderr — Three Channels

Every process has three standard channels by default.

flowchart LR
    KBD["Keyboard / file"] -->|"stdin (0)"| PROC["Process"]
    PROC -->|"stdout (1)"| SCREEN1["Screen / file"]
    PROC -->|"stderr (2)"| SCREEN2["Screen / file"]

Why are stdout and stderr separate? Because when connecting via pipes, it would be problematic if error messages got mixed into the next program’s input. By separating output from errors, you can log errors while passing only the output to the next stage. This separation is one of Unix design’s masterpieces.

Redirection — Routing I/O to Files

If pipes connect “between processes,” redirection connects “between a process and a file.”

# Redirect stdout to a file (overwrite)
echo "hello" > greeting.txt

# Append stdout to a file
echo "world" >> greeting.txt

# Read file contents as stdin
sort < names.txt

# Redirect only stderr to a file
./run.sh 2> errors.log

# Redirect stdout and stderr to different files
./run.sh > out.log 2> err.log

# Merge stdout and stderr into the same file — used frequently
./run.sh > all.log 2>&1
# Order matters! > all.log must come first

# Discard output entirely — /dev/null is a "black hole"
./noisy-script.sh > /dev/null 2>&1

# In bash, you can abbreviate with &>
./run.sh &> all.log

The 2>&1 syntax feels unfamiliar at first. Expanded, it means “send file descriptor 2 (stderr) to the same place as &1 (file descriptor 1, i.e., wherever stdout points).” This is also why order matters. ./run.sh 2>&1 > all.log results in stderr still going to the terminal and only stdout going to the file — because at the time stderr is duplicated, stdout still points to the terminal.

tee — Splitting Output in Two

Sometimes you want to pipe output to the next command while also saving it to a file. The tee command fills this role. As its name suggests, it splits the pipe into a T-shape.

# Displayed on screen and saved to a file
./deploy.sh 2>&1 | tee deploy.log

# Append mode (-a)
./deploy.sh 2>&1 | tee -a deploy.log

# Writing to a file that requires root privileges
echo "127.0.0.1 test.local" | sudo tee -a /etc/hosts

The last example is interesting. echo ... | sudo >> /etc/hosts does not work. This is because redirection is handled by the shell, and the shell performing the redirection is the original user, not sudo. The solution is sudo tee. tee runs under sudo, and since tee handles the file writing, the permission issue is resolved. This is a common idiom among server administrators.

grep picks out lines matching a pattern from input. The name originates from the ed editor’s g/re/p command (global/regular expression/print). The name itself reveals it was designed with regular expressions as a premise.

# Lines containing "error" in a file
grep "error" /var/log/app.log

# Case-insensitive (-i)
grep -i "error" /var/log/app.log

# Only non-matching lines (-v, inverse)
grep -v "DEBUG" /var/log/app.log

# Include line numbers (-n)
grep -n "TODO" src/**/*.ts

# Count matches only (-c)
grep -c "ERROR" app.log

# Search recursively through directories (-r)
grep -r "TODO" src/

# Include surrounding lines (-A after, -B before, -C both)
grep -C 3 "panic" /var/log/syslog

# Works naturally with pipe input too
ps aux | grep nginx

For searching large codebases quickly, ripgrep (rg) is often recommended over grep. It respects .gitignore, runs in parallel, and has pretty color output by default. But the standard grep is available on any Linux system, and knowing the regex and options covers most needs.

grep and Regular Expressions

grep uses basic regular expressions (BRE) by default. Metacharacters like +, ?, {}, and | must be escaped with \ or you need the -E (extended regular expression, ERE) option.

# "error" or "warn"
grep -E "error|warn" app.log

# Lines starting with a number
grep -E "^[0-9]+" app.log

# Specific HTTP status codes (40x, 50x)
grep -E " [45][0-9][0-9] " access.log

# Rough email search
grep -E "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}" contacts.txt

# Perl-style regex (-P, GNU grep) — supports \d, \w, etc.
grep -P "\berror\b" app.log

The first few months of learning regular expressions can be painful, but once they’re second nature, the perceived speed of working with text changes dramatically. As a tool for search-replace-extract, they’re embedded in virtually every programming language.

sed — The Stream Editor

sed (Stream EDitor) receives input line by line and edits it. As the name says, it’s a “streaming editor.” The most common use is substitution.

# Replace "foo" with "bar" and print (original unchanged)
sed 's/foo/bar/' file.txt

# Replace all occurrences in a line (g flag)
sed 's/foo/bar/g' file.txt

# Modify the original file directly (-i, in-place)
sed -i 's/foo/bar/g' file.txt

# macOS's BSD sed requires a backup extension after -i
sed -i '' 's/foo/bar/g' file.txt    # macOS
sed -i.bak 's/foo/bar/g' file.txt   # Leave a backup

# The delimiter doesn't have to be / — useful when changing paths
sed 's|/old/path|/new/path|g' config.txt

# Delete a specific line (line 3)
sed '3d' file.txt

# Delete lines matching a pattern
sed '/DEBUG/d' app.log

# Print a specific line range (lines 100-200)
sed -n '100,200p' huge.log

sed -i is frequently used in CI scripts for masking sensitive information in logs or automatically changing specific values in config files. However, sed -i modifies the original immediately. An incorrect pattern can be difficult to recover from, so it’s safer to verify the output without -i first, then add it.

awk — The World of Rows and Columns

awk is the most powerful tool covered here. It has its own built-in programming language, so with enough determination, you can handle substantial data processing with awk alone. The name comes from the initials of its three creators (Aho, Weinberger, Kernighan).

The basic mental model of awk is this: Read input line by line, split by whitespace, store in $1, $2, $3..., then execute the program block. $0 refers to the entire line.

# Print only the first field of each line
ps aux | awk '{ print $1 }'

# Only the second (PID) and eleventh (COMMAND) fields
ps aux | awk '{ print $2, $11 }'

# Condition — only processes with CPU over 5%
ps aux | awk '$3 > 5 { print $2, $3, $11 }'

# Specify delimiter (default is whitespace, use comma for CSV)
awk -F',' '{ print $1 }' data.csv

# Add line numbers — NR is the current line number
awk '{ print NR, $0 }' file.txt

# Calculate a sum
cat numbers.txt | awk '{ sum += $1 } END { print sum }'

# Multiple values — average
awk '{ sum += $1; count++ } END { print sum/count }' nums.txt

One powerful aspect: the BEGIN { ... } block runs before reading any input, and the END { ... } block runs after all input is read. This makes them perfect for calculating sums, averages, and maximums.

# Count requests by status code in nginx logs
# access.log format: ... "GET /path HTTP/1.1" 200 1234 ...
awk '{ codes[$9]++ } END { for (c in codes) print c, codes[c] }' access.log
# 200 9823
# 404 142
# 500 7

codes[$9]++ means “use the 9th field (status code) as a key and increment the count.” The END block iterates over the map and prints. This single line does what a plain shell script would take about 10 lines to accomplish.

sort, uniq, wc — The Trio

These three nearly always appear at the end of a pipeline to organize results.

# Alphabetical sort
sort names.txt

# Numeric sort (-n), reverse (-r)
sort -n numbers.txt
sort -rn numbers.txt

# Sort by a specific column (tab or space delimited, 2nd column)
sort -k2 data.txt

# Unique lines only (deduplicate) — input must be sorted
sort names.txt | uniq

# Count occurrences of each line (-c)
sort names.txt | uniq -c

# Lines appearing only once (-u), two or more times only (-d)
sort names.txt | uniq -d

# Line count, word count, byte count
wc file.txt
#   23  184  1025  file.txt

# Line count only (-l), word count only (-w), bytes/characters (-c/-m)
wc -l *.log

The sort | uniq -c | sort -rn combination is the gold standard for “finding the most frequently occurring lines.” It covers 80% of log analysis.

# Top 10 IPs visiting today
awk '{ print $1 }' /var/log/nginx/access.log \
  | sort \
  | uniq -c \
  | sort -rn \
  | head -n 10

head, tail — Front and Back

We briefly saw these in Part 1, but they’re especially useful when combined with pipes.

# First N lines
head -n 20 file.log

# Last N lines
tail -n 20 file.log

# First N bytes
head -c 100 binary.bin

# Real-time follow
tail -f /var/log/app.log

# Follow multiple files simultaneously (file name headers are included)
tail -f /var/log/app.log /var/log/error.log

# Keep following even when logrotate swaps the file (-F)
tail -F /var/log/app.log

The difference between tail -f and tail -F matters. When a service rotates files with logrotate, -f keeps following the original file descriptor and misses the new file, while -F periodically re-opens the file path to follow the rotated new file.

xargs — Turning Piped Values into Command Arguments

grep, find, ls output results line by line. But sometimes you want to run a command on each of those results. Commands like rm and mv don’t read from stdin — they only accept arguments. The bridge you need here is xargs.

# Run rm on find results — executed repeatedly
find . -name "*.tmp" | xargs rm

# Prompt before executing (-p, prompt)
find . -name "*.tmp" | xargs -p rm

# Handle filenames with spaces — always pair -0 with -print0
find . -name "*.tmp" -print0 | xargs -0 rm

# Insert received values in the middle of a command (-I)
ls *.md | xargs -I {} cp {} /backup/{}

# Process N at a time (-n)
echo "a b c d e f" | xargs -n 2 echo
# a b
# c d
# e f

# Parallel processing (-P)
find . -name "*.jpg" -print0 | xargs -0 -P 4 -I {} convert {} {}.webp

The xargs and find combination is a system administrator’s go-to weapon. However, if filenames contain spaces or special characters, not using -0 (NUL-delimited) will cause accidents. Modern find also has -exec with + form, allowing similar work without xargs.

# Instead of xargs, use find -exec
find . -name "*.tmp" -exec rm {} +

The critical difference between the two approaches is parallelization. xargs -P supports parallel execution, while -exec is sequential.

cut, tr — Character-Level Processing

These small tools fill in the gaps of pipelines.

# cut — extract specific columns
# Default delimiter is tab. Change with -d, specify fields with -f
echo "alice,30,developer" | cut -d',' -f1
# alice

echo "alice,30,developer" | cut -d',' -f1,3
# alice,developer

# Usernames and home directories from /etc/passwd
cut -d':' -f1,6 /etc/passwd

# tr — transliterate characters
# Convert to uppercase
echo "hello" | tr 'a-z' 'A-Z'
# HELLO

# Spaces to underscores
echo "my file.txt" | tr ' ' '_'
# my_file.txt

# Delete specific characters (-d)
echo "hello123" | tr -d '0-9'
# hello

# Squeeze consecutive characters into one (-s, squeeze)
echo "hello    world" | tr -s ' '
# hello world

cut is used when things are simpler than awk, and you switch to awk when conditions or calculations are involved. It takes some time to develop a feel for when to use which — as a rough rule, use cut for one-liners and awk when logic is involved.

Practical Example — Log Analysis

Let’s see how the tools covered so far come together in practice. We’ll extract some meaningful statistics from a hypothetical nginx access log.

LOG=/var/log/nginx/access.log

# 1) Total requests today
wc -l $LOG

# 2) Distribution by status code
awk '{ print $9 }' $LOG | sort | uniq -c | sort -rn

# 3) Top 10 client IPs
awk '{ print $1 }' $LOG | sort | uniq -c | sort -rn | head -n 10

# 4) URLs and frequency of 5xx errors
awk '$9 ~ /^5/ { print $7 }' $LOG | sort | uniq -c | sort -rn

# 5) Average response time (assuming last field is response time)
awk '{ sum += $NF; n++ } END { print sum / n }' $LOG

# 6) Filter for a specific time window (UTC 14:00)
grep "20/Apr/2026:14" $LOG | wc -l

# 7) 10 slowest requests (by response time)
sort -k10 -rn $LOG | head -n 10

Each line is a collaboration of small tools connected by pipes. Once you can spontaneously compose combinations like these in your head, you can pull out the statistics you need on the spot without any log analysis tool. Whether digging through Kubernetes log archives, finding CI failure patterns, or detecting anomalous data in DB dumps — this language works.

First Half Summary

Over four parts, we’ve covered the most fundamental pieces of Linux. Let’s recap where we are.

Starting from Part 5, we move to topics directly connected to practice. Network tools, Systemd service management, package management, and Bash scripting. These are the practical tools to stack on top of the fundamentals we’ve built so far.


In the next part, we’ll cover the basic tools for diagnosing network status and communicating with the outside world on Linux.

-> Part 5: Network Tools


Related Posts

Share this post on:

Comments

Loading comments...


Previous Post
Linux Basics Part 3 — Processes and Signals
Next Post
Linux Basics Part 5 — Network Tools