WHCSRL 技术网

数据清洗与收集week4

4.1 editing text variables

4.2  regular expressions

4.3 working with Dates  ,Data Resources


4.1 editing text variables

Fixing character vectors - tolower(), toupper()

if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/cameras.csv",method="curl")
cameraData <- read.csv("./data/cameras.csv")
names(cameraData)
[1] "address"      "direction"    "street"       "crossStreet"  "intersection" "Location.1"  
tolower(names(cameraData))
[1] "address"      "direction"    "street"       "crossstreet"  "intersection" "location.1"  

Fixing character vectors - strsplit()

  • Good for automatically splitting variable names
  • Important parameters: xsplit
splitNames = strsplit(names(cameraData),"\\.")
splitNames[[5]]
[1] "intersection"
splitNames[[6]]
[1] "Location" "1"       

Quick aside - lists

mylist <- list(letters = c("A", "b", "c"), numbers = 1:3, matrix(1:25, ncol = 5))
head(mylist)
  1. $letters
  2. [1] "A" "b" "c"
  3. $numbers
  4. [1] 1 2 3
  5. [[3]]
  6. [,1] [,2] [,3] [,4] [,5]
  7. [1,] 1 6 11 16 21
  8. [2,] 2 7 12 17 22
  9. [3,] 3 8 13 18 23
  10. [4,] 4 9 14 19 24
  11. [5,] 5 10 15 20 25

http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf

Quick aside - lists

mylist[1]
  1. $letters
  2. [1] "A" "b" "c"
mylist$letters
[1] "A" "b" "c"
mylist[[1]]
[1] "A" "b" "c"

Fixing character vectors - sapply()

  • Applies a function to each element in a vector or list
  • Important parameters: X,FUN
splitNames[[6]][1]
[1] "Location"
firstElement <- function(x){x[1]}
sapply(splitNames,firstElement)
[1] "address"      "direction"    "street"       "crossStreet"  "intersection" "Location"    

Peer review data

fileUrl1 <- "https://dl.dropboxusercontent.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 <- "https://dl.dropboxusercontent.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1,destfile="./data/reviews.csv",method="curl")
download.file(fileUrl2,destfile="./data/solutions.csv",method="curl")
reviews <- read.csv("./data/reviews.csv"); solutions <- read.csv("./data/solutions.csv")
head(reviews,2)
  1. id solution_id reviewer_id start stop time_left accept
  2. 1 1 3 27 1304095698 1304095758 1754 1
  3. 2 2 4 22 1304095188 1304095206 2306 1
head(solutions,2)
  1. id problem_id subject_id start stop time_left answer
  2. 1 1 156 29 1304095119 1304095169 2343 B
  3. 2 2 269 25 1304095119 1304095183 2329 C

Fixing character vectors - sub()

  • Important parameters: patternreplacementx
names(reviews)
  1. [1] "id" "solution_id" "reviewer_id" "start" "stop" "time_left"
  2. [7] "accept"
sub("_","",names(reviews),)
[1] "id"         "solutionid" "reviewerid" "start"      "stop"       "timeleft"   "accept"    

Fixing character vectors - gsub()

testName <- "this_is_a_test"
sub("_","",testName)
[1] "thisis_a_test"
gsub("_","",testName)
[1] "thisisatest"

Finding values - grep(),grepl()

grep("Alameda",cameraData$intersection)
[1]  4  5 36
table(grepl("Alameda",cameraData$intersection))
  1. FALSE TRUE
  2. 77 3
cameraData2 <- cameraData[!grepl("Alameda",cameraData$intersection),]

More on grep()

grep("Alameda",cameraData$intersection,value=TRUE)
[1] "The Alameda  & 33rd St"   "E 33rd  & The Alameda"    "Harford \n & The Alameda"
grep("JeffStreet",cameraData$intersection)
integer(0)
length(grep("JeffStreet",cameraData$intersection))
[1] 0

http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf

More useful string functions

library(stringr)
nchar("Jeffrey Leek")
[1] 12
substr("Jeffrey Leek",1,7)
[1] "Jeffrey"
paste("Jeffrey","Leek")
[1] "Jeffrey Leek"

More useful string functions

paste0("Jeffrey","Leek")
[1] "JeffreyLeek"
str_trim("Jeff      ")
[1] "Jeff"

Important points about text in data sets

  • Names of variables should be
    • All lower case when possible
    • Descriptive (Diagnosis versus Dx)
    • Not duplicated
    • Not have underscores or dots or white spaces
  • Variables with character values
    • Should usually be made into factor variables (depends on application)
    • Should be descriptive (use TRUE/FALSE instead of 0/1 and Male/Female versus 0/1 or M/F)

4.2 正则表达式


Regular expressions

  • Regular expressions can be thought of as a combination of literals and metacharacters
  • To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar
  • Regular expressions have a rich set of metacharacters

Literals

The literal “Obama” would match to the following lines

  1. Politics r dum. Not 2 long ago Clinton was sayin Obama
  2. was crap n now she sez vote 4 him n unite? WTF?
  3. Screw em both + Mcain. Go Ron Paul!
  4. Clinton conceeds to Obama but will her followers listen??
  5. Are we sure Chelsea didn’t vote for Obama?
  6. thinking ... Michelle Obama is terrific!
  7. jetlag..no sleep...early mornig to starbux..Ms. Obama
  8. was moving

Regular Expressions

  • Simplest pattern consists only of literals; a match occurs if the sequence of literals occurs anywhere in the text being tested

  • What if we only want the word “Obama”? or sentences that end in the word “Clinton”, or “clinton” or “clinto”?

Regular Expressions

We need a way to express

  • whitespace word boundaries
  • sets of literals
  • the beginning and end of a line
  • alternatives (“war” or “peace”) Metacharacters to the rescue!

Metacharacters

Some metacharacters represent the start of a line

^i think

will match the lines

  1. i think we all rule for participating
  2. i think i have been outed
  3. i think this will be quite fun actually
  4. i think i need to go to work
  5. i think i first saw zombo in 1999.

Metacharacters

$ represents the end of a line

  1. morning$

will match the lines

  1. well they had something this morning
  2. then had to catch a tram home in the morning
  3. dog obedience school in the morning
  4. and yes happy birthday i forgot to say it earlier this morning
  5. I walked in the rain this morning
  6. good morning

Character Classes with []

We can list a set of characters we will accept at a given point in the match

[Bb][Uu][Ss][Hh]

will match the lines

  1. The democrats are playing, "Name the worst thing about Bush!"
  2. I smelled the desert creosote bush, brownies, BBQ chicken
  3. BBQ and bushwalking at Molonglo Gorge
  4. Bush TOLD you that North Korea is part of the Axis of Evil
  5. I’m listening to Bush - Hurricane (Album Version)

Character Classes with []

^[Ii] am

will match

  1. i am so angry at my boyfriend i can’t even bear to
  2. look at him
  3. i am boycotting the apple store
  4. I am twittering from iPhone
  5. I am a very vengeful person when you ruin my sweetheart.
  6. I am so over this. I need food. Mmmm bacon...

Character Classes with []

Similarly, you can specify a range of letters [a-z] or [a-zA-Z]; notice that the order doesn’t matter

^[0-9][a-zA-Z]

will match the lines

  1. 7th inning stretch
  2. 2nd half soon to begin. OSU did just win something
  3. 3am - cant sleep - too hot still.. :(
  4. 5ft 7 sent from heaven
  5. 1st sign of starvagtion

Character Classes with []

When used at the beginning of a character class, the “^” is also a metacharacter and indicates matching characters NOT in the indicated class

  1. [^?.]$

will match the lines

  1. i like basketballs
  2. 6 and 9
  3. dont worry... we all die anyway!
  4. Not in Baghdad
  5. helicopter under water? hmmm
表示除?及。号之外的所有符号


More Metacharacters

“.” is used to refer to any character. So

9.11

will match the lines

  1. its stupid the post 9-11 rules
  2. if any 1 of us did 9/11 we would have been caught in days.
  3. NetBios: scanning ip 203.169.114.66
  4. Front Door 9:11:46 AM
  5. Sings: 0118999881999119725...3 !

More Metacharacters: |

This does not mean “pipe” in the context of regular expressions; instead it translates to “or”; we can use it to combine two expressions, the subexpressions being called alternatives

flood|fire

will match the lines

  1. is firewire like usb on none macs?
  2. the global flood makes sense within the context of the bible
  3. yeah ive had the fire on tonight
  4. ... and the floods, hurricanes, killer heatwaves, rednecks, gun nuts, etc.

More Metacharacters: |

We can include any number of alternatives...

flood|earthquake|hurricane|coldfire

will match the lines

  1. Not a whole lot of hurricanes in the Arctic.
  2. We do have earthquakes nearly every day somewhere in our State
  3. hurricanes swirl in the other direction
  4. coldfire is STRAIGHT!
  5. ’cause we keep getting earthquakes

More Metacharacters: |

The alternatives can be real expressions and not just literals

^[Gg]ood|[Bb]ad

will match the lines

  1. good to hear some good knews from someone here
  2. Good afternoon fellow american infidels!
  3. good on you-what do you drive?
  4. Katie... guess they had bad experiences...
  5. my middle name is trouble, Miss Bad News

More Metacharacters: ( and )

Subexpressions are often contained in parentheses to constrain the alternatives

^([Gg]ood|[Bb]ad)

will match the lines

  1. bad habbit
  2. bad coordination today
  3. good, becuase there is nothing worse than a man in kinky underwear
  4. Badcop, its because people want to use drugs
  5. Good Monday Holiday
  6. Good riddance to Limey

More Metacharacters: ?

The question mark indicates that the indicated expression is optional

[Gg]eorge( [Ww]\.)? [Bb]ush

will match the lines

  1. i bet i can spell better than you and george bush combined
  2. BBC reported that President George W. Bush claimed God told him to invade I
  3. a bird in the hand is worth two george bushes
意味着中间的那个()里的东东是可选的

One thing to note...

In the following

[Gg]eorge( [Ww]\.)? [Bb]ush

we wanted to match a “.” as a literal period; to do that, we had to “escape” the metacharacter, preceding it with a backslash In general, we have to do this for any metacharacter we want to include in our match

More metacharacters: * and +

The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”

(.*)

will match the lines

  1. anyone wanna chat? (24, m, germany)
  2. hello, 20.m here... ( east area + drives + webcam )
  3. (he means older men)
  4. ()

More metacharacters: * and +

The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”

[0-9]+ (.*)[0-9]+

will match the lines

  1. working as MP here 720 MP battallion, 42nd birgade
  2. so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin
  3. it went down on several occasions for like, 3 or 4 *days*
  4. Mmmm its time 4 me 2 go 2 bed

More metacharacters: { and }

{ and } are referred to as interval quantifiers; the let us specify the minimum and maximum number of matches of an expression

[Bb]ush( +[^ ]+ +){1,5} debate

will match the lines

  1. Bush has historically won all major debates he’s done.
  2. in my view, Bush doesn’t need these debates..
  3. bush doesn’t need the debates? maybe you are right
  4. That’s what Bush supporters are doing about the debate.
  5. Felix, I don’t disagree that Bush was poorly prepared for the debate.
  6. indeed, but still, Bush should have taken the debate more seriously.
  7. Keep repeating that Bush smirked and scowled during the debate

More metacharacters: and

  • m,n means at least m but not more than n matches
  • m means exactly m matches
  • m, means at least m matches

More metacharacters: ( and ) revisited

  • In most implementations of regular expressions, the parentheses not only limit the scope of alternatives divided by a “|”, but also can be used to “remember” text matched by the subexpression enclosed
  • We refer to the matched text with \1, \2, etc.

More metacharacters: ( and ) revisited

So the expression,这里面的+\1表示重复一次

+([a-zA-Z]+) +\1 +

will match the lines

  1. time for bed, night night twitter!
  2. blah blah blah blah
  3. my tattoo is so so itchy today
  4. i was standing all all alone against the world outside...
  5. hi anybody anybody at home
  6. estudiando css css css css.... que desastritooooo

More metacharacters: ( and ) revisited

The * is “greedy” so it always matches the longest possible string that satisfies the regular expression. So

^s(.*)s

matches  这种呢就会自动匹配最长的字符

  1. sitting at starbucks
  2. setting up mysql and rails
  3. studying stuff for the exams
  4. spaghetti with marshmallows
  5. stop fighting with crackers
  6. sore shoulders, stupid ergonomics

More metacharacters: ( and ) revisited

The greediness of * can be turned off with the ?, as in

^s(.*?)s$

Summary

  • Regular expressions are used in many different languages; not unique to R.
  • Regular expressions are composed of literals and metacharacters that represent sets or classes of characters/words
  • Text processing via regular expressions is a very powerful way to extract data from “unfriendly” sources (not all data comes as a CSV file)
  • Used with the functions grep,grepl,sub,gsub and others that involve searching for text strings (Thanks to Mark Hansen for some material in this lecture.)

4.3 处理日期数据及数据来源

Starting simple

d1 = date()
d1
[1] "Sun Jan 12 17:48:33 2014"
class(d1)
[1] "character"

Formatting dates

%d = day as number (0-31), %a = abbreviated weekday,%A = unabbreviated weekday, %m = month (00-12), %b = abbreviated month, %B = unabbrevidated month, %y = 2 digit year, %Y = four digit year

format(d2,"%a %b %d")
[1] "Sun Jan 12"

Creating dates

x = c("1jan1960", "2jan1960", "31mar1960", "30jul1960"); z = as.Date(x, "%d%b%Y")
z
[1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
z[1] - z[2]
Time difference of -1 days
as.numeric(z[1]-z[2])
[1] -1

Converting to Julian

weekdays(d2)
[1] "Sunday"
months(d2)
[1] "January"
julian(d2)
  1. [1] 16082
  2. attr(,"origin")
  3. [1] "1970-01-01"

Lubridate

library(lubridate); ymd("20140108")
[1] "2014-01-08 UTC"
mdy("08/04/2013")
[1] "2013-08-04 UTC"
dmy("03-04-2013")
[1] "2013-04-03 UTC"

http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/

这个包说是用起来相当不错,在处理日期型数据的时候

Dealing with times

ymd_hms("2011-08-03 10:15:03")
[1] "2011-08-03 10:15:03 UTC"
ymd_hms("2011-08-03 10:15:03",tz="Pacific/Auckland")
[1] "2011-08-03 10:15:03 NZST"
?Sys.timezone

Some functions have slightly different syntax

x = dmy(c("1jan2013", "2jan2013", "31mar2013", "30jul2013"))
wday(x[1])
[1] 3
wday(x[1],label=TRUE)
  1. [1] Tues
  2. Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

Notes and further resources


Open Government Sites

Gapminder is another website that has a lot of data about development, in particular in human health, 


Survey data from the United States

http://www.asdfree.com/

Infochimps Marketplace

http://www.infochimps.com/marketplace


推荐阅读