What Are Regular Expressions?
A regular expression — regex for short — is a sequence of characters that defines a search pattern. Think of it as a small specialized language for describing text patterns. Instead of searching for a specific word, you can search for a pattern like "any word that starts with a capital letter and ends with ing" or "a sequence of exactly five digits."
Regular expressions appear in almost every programming language, text editor, command-line tool, and search interface that handles text. Learning the basics gives you a powerful tool that transfers across environments and will serve you for your entire career.
Literal Characters and Metacharacters
The simplest regex is just a literal string. The pattern `hello` matches the text "hello" wherever it appears. But the real power comes from metacharacters — characters that have special meaning in regex syntax.
The dot `.` matches any single character except a newline. So `h.t` matches "hat," "hot," "hit," and even "h3t." The caret `^` matches the start of a line, and the dollar sign `$` matches the end. The pattern `^Hello` only matches "Hello" when it appears at the beginning of a line.
Square brackets `[]` define a character class — a set of characters where any single one can match. `[aeiou]` matches any vowel. `[0-9]` matches any digit. `[A-Za-z]` matches any English letter. Adding a caret inside the brackets negates the class: `[^0-9]` matches any character that is not a digit.
Quantifiers
Quantifiers control how many times a pattern element can repeat. The asterisk `*` means "zero or more times." The plus `+` means "one or more times." The question mark `?` means "zero or one time" — making the preceding element optional.
Curly braces let you specify exact counts. `a{3}` matches exactly three consecutive "a" characters. `a{2,5}` matches between two and five. `a{3,}` matches three or more.
Combining these with character classes is where patterns become truly useful. `[0-9]{3}-[0-9]{4}` matches a pattern like "555-1234." `[A-Z][a-z]+` matches a capitalized word.
Common Shorthand Classes
Typing `[0-9]` repeatedly gets tedious, so regex provides shorthand classes. `\d` matches any digit (equivalent to `[0-9]`). `\w` matches any "word character" — letters, digits, and underscores. `\s` matches any whitespace character — spaces, tabs, and newlines.
Each of these has an uppercase negation. `\D` matches anything that is not a digit. `\W` matches non-word characters. `\S` matches non-whitespace.
Grouping and Alternation
Parentheses `()` create groups. Groups serve two purposes: they let you apply quantifiers to multi-character sequences, and they capture matched text for later reference.
The pattern `(ha)+` matches "ha," "haha," "hahaha," and so on — the plus applies to the entire group, not just the last character. Without parentheses, `ha+` would match "ha," "haa," "haaa" — the plus would only apply to the "a."
The pipe character `|` provides alternation — an "or" operation. `cat|dog` matches either "cat" or "dog." Combined with groups, `(cat|dog) food` matches "cat food" or "dog food."
Practical Examples You Can Use Today
**Email validation (basic):** `^[\w.+-]+@[\w-]+\.[a-zA-Z]{2,}$` This matches a string that starts with word characters, dots, plus signs, or hyphens, followed by @, a domain name, a dot, and a top-level domain of at least two letters.
**Phone number extraction:** `\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}` This matches phone numbers in formats like (555) 123-4567, 555-123-4567, 555.123.4567, and 5551234567.
**URL detection:** `https?://[\w\-]+(\.[\w\-]+)+[/\w\-._~:?#@!$&'()*+,;=]*` This matches HTTP and HTTPS URLs with domain names and optional paths.
**Date formats:** `\d{4}[-/]\d{2}[-/]\d{2}` This matches dates in YYYY-MM-DD or YYYY/MM/DD format.
**Remove HTML tags:** `<[^>]+>` This matches any HTML tag for removal — useful for quick text extraction, though a proper HTML parser is better for complex documents.
Anchors and Boundaries
Beyond `^` and `$` for line boundaries, the word boundary `\b` is extremely useful. It matches the position between a word character and a non-word character. `\bcat\b` matches "cat" as a whole word but not "category" or "concatenate."
This is essential for find-and-replace operations where you need to match whole words only. Without word boundaries, replacing "he" in a document would also affect "the," "she," "here," and every other word containing those letters.
Greedy vs. Lazy Matching
By default, quantifiers are greedy — they match as much text as possible. In the string "start middle end," the pattern `start.*end` matches the entire string because `.*` grabs everything it can before giving back just enough for "end" to match.
Adding a question mark after a quantifier makes it lazy — it matches as little as possible. `start.*?end` would match "start middle end" but stop at the first "end" rather than the last.
This distinction matters when extracting content between delimiters. If you are trying to match individual HTML tags, a greedy `<.*>` would match from the first `<` to the last `>` in the line, swallowing everything in between. A lazy `<.*?>` matches each tag individually.
Tips for Writing Better Regex
Start simple and add complexity. Write a pattern that matches your target, test it, then refine to exclude false positives. Trying to write a perfect regex in one shot usually fails.
Use online regex testers. Tools like regex101.com show real-time matches, explain each part of your pattern, and highlight capture groups. They accelerate learning enormously.
Comment complex patterns. Most regex engines support a verbose mode where you can add whitespace and comments. Use it for any pattern longer than a few characters.
Know when not to use regex. Parsing nested structures like HTML, JSON, or programming languages is generally a job for proper parsers, not regular expressions. Regex works best for flat patterns in text.