Regular expression Puzzles

Question 1

Find: "\s* \s" 

Replace: ","

Explanation: The goal of this regex is to convert whitespace-separated text into CSV format. It finds one or more spaces before and after another space and replaces them with a comma.

Question 2

Find: "(\w+), (\w+), ((.+)$)" 

Replace: "\2 \1 (\3)"

Explanation: This regex captures three groups: two words and the remaining line. The replacement reorders them and wraps the third group in parentheses.

Question 3

Find: "(\d{4})" 

Replace: "\n(\1)"

Explanation: This expression locates any sequence of four digits and replaces it with the same digits on a new line, enclosed in parentheses.

Question 4

Find: "(\d{4})\s+(.*?)(\.mp3)" 

Replace: "\2_(\1\3)"

Explanation: This regex removes a 4-digit prefix and any following whitespace from a filename. It then moves the number to the end and renames the file with the pattern: title_(year.mp3).

Question 5

Find: "(\w)(?:\w*),(\w+),(\d+(?:\.\d+)?),(\d+)" 

Replace: "\1_\2,\4"

Explanation: This regex targets rows with four comma-separated fields: two words, a decimal number, and an integer. It abbreviates the first word to its first letter, appends an underscore, retains the second word and last number, and drops the decimal value.

Question 6

Find: "(\w)(?:\w*),(\w{4})\w*,\d+(?:\.\d+)?,(\d+)"  

Replace: "\1_\2,\3"

Explanation: This pattern captures the first letter of the first word, the first four letters of the second word, and the last integer. It formats the result as: firstLetter_fourLetters,lastNumber.

Question 7

Find: "(\w{3})\w*,(\w{3})\w*,\d+(?:\.\d+)?,(\d+)" 

Replace: "\1\2,\3"

Explanation: This regex extracts the first three letters of the genus and species names, followed by the last number in each line. It rearranges the order in the replacement string.

Question 8

Data Curation Using REGEX

a. The pathogen_binary Column

I observed a number of NA values instead of 0s and 1s. I replaced these with 0 using the regex

b. The Bombus spp and Host_spp Columns

There were inconsistencies in species naming, including special characters and underscores. I used [^\w\s\r\n] to remove unwanted characters, and then removed underscores using a separate regex.

c. The bee_caste Column

In this column, drones were labeled as “male.” I used a regex to search for “male” and replaced it with “drone” for clarity and consistency.