I’m not a regular expression expert, no, not even amateurish in that area, as is with hadoop,…. sigh.
But no fear, there is always the Internet, without which I will be…..of diminished value, until I save enough stuff on my local hdd, and my search engine implemented on it. Alas, the day the world ends could be when the day Internet collapses. Is that even possible….i hope not.
So, regular expressions. It all started with a new dataset, given in excel format. So there were ALT-ENTER in some cells, for some wordy descriptions in some columns. Saved as a csv file introduces all kinds of newlines, cr, lf, crlf etc. To see such non-printable stuff, open in notepad++ > View > Show Symbol > Show End of Line
In R, use gsub and grep to get rid of unwanted stuff. For my particular case, I used
grep('\\R\\n', x=data$ITEM_DESC,value=TRUE) gsub('\\R\\n', '', x=data$ITEM_DESC) -> data$ITEM_DESC grep("\\n\\n", x=data$ITEM_DESC,value=TRUE) gsub('\\n\\n', '', x=data$ITEM_DESC) -> data$ITEM_DESC grep("\\n", x=data$ITEM_DESC,value=TRUE) gsub('\\n', '', x=data$ITEM_DESC) -> data$ITEM_DESC
write.table(data, "out.txt", sep="\t")
To write to a text file, for future use.