数据清洗与收集week4
4.1 editing text variables
4.2 regular expressions
4.3 working with Dates ,Data Resources
4.1 editing text variables
Fixing character vectors - tolower(), toupper()
[1] "address" "direction" "street" "crossStreet" "intersection" "Location.1"
[1] "address" "direction" "street" "crossstreet" "intersection" "location.1"
Fixing character vectors - strsplit()
- Good for automatically splitting variable names
- Important parameters: x, split
[1] "intersection"
[1] "Location" "1"
Quick aside - lists
- $letters
- [1] "A" "b" "c"
-
- $numbers
- [1] 1 2 3
-
- [[3]]
- [,1] [,2] [,3] [,4] [,5]
- [1,] 1 6 11 16 21
- [2,] 2 7 12 17 22
- [3,] 3 8 13 18 23
- [4,] 4 9 14 19 24
- [5,] 5 10 15 20 25
http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf
Quick aside - lists
- $letters
- [1] "A" "b" "c"
[1] "A" "b" "c"
[1] "A" "b" "c"
Fixing character vectors - sapply()
- Applies a function to each element in a vector or list
- Important parameters: X,FUN
[1] "Location"
[1] "address" "direction" "street" "crossStreet" "intersection" "Location"
Peer review data
- id solution_id reviewer_id start stop time_left accept
- 1 1 3 27 1304095698 1304095758 1754 1
- 2 2 4 22 1304095188 1304095206 2306 1
- id problem_id subject_id start stop time_left answer
- 1 1 156 29 1304095119 1304095169 2343 B
- 2 2 269 25 1304095119 1304095183 2329 C
Fixing character vectors - sub()
- Important parameters: pattern, replacement, x
- [1] "id" "solution_id" "reviewer_id" "start" "stop" "time_left"
- [7] "accept"
[1] "id" "solutionid" "reviewerid" "start" "stop" "timeleft" "accept"
Fixing character vectors - gsub()
[1] "thisis_a_test"
[1] "thisisatest"
Finding values - grep(),grepl()
[1] 4 5 36
-
- FALSE TRUE
- 77 3
More on grep()
[1] "The Alameda & 33rd St" "E 33rd & The Alameda" "Harford \n & The Alameda"
integer(0)
[1] 0
http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%203.pdf
More useful string functions
[1] 12
[1] "Jeffrey"
[1] "Jeffrey Leek"
More useful string functions
[1] "JeffreyLeek"
[1] "Jeff"
Important points about text in data sets
- Names of variables should be
- All lower case when possible
- Descriptive (Diagnosis versus Dx)
- Not duplicated
- Not have underscores or dots or white spaces
- Variables with character values
- Should usually be made into factor variables (depends on application)
- Should be descriptive (use TRUE/FALSE instead of 0/1 and Male/Female versus 0/1 or M/F)
4.2 正则表达式
Regular expressions
- Regular expressions can be thought of as a combination of literals and metacharacters
- To draw an analogy with natural language, think of literal text forming the words of this language, and the metacharacters defining its grammar
- Regular expressions have a rich set of metacharacters
Literals
The literal “Obama” would match to the following lines
- Politics r dum. Not 2 long ago Clinton was sayin Obama
- was crap n now she sez vote 4 him n unite? WTF?
- Screw em both + Mcain. Go Ron Paul!
-
- Clinton conceeds to Obama but will her followers listen??
-
- Are we sure Chelsea didn’t vote for Obama?
-
- thinking ... Michelle Obama is terrific!
-
- jetlag..no sleep...early mornig to starbux..Ms. Obama
- was moving
Regular Expressions
-
Simplest pattern consists only of literals; a match occurs if the sequence of literals occurs anywhere in the text being tested
-
What if we only want the word “Obama”? or sentences that end in the word “Clinton”, or “clinton” or “clinto”?
Regular Expressions
We need a way to express
- whitespace word boundaries
- sets of literals
- the beginning and end of a line
- alternatives (“war” or “peace”) Metacharacters to the rescue!
Metacharacters
Some metacharacters represent the start of a line
^i think
will match the lines
- i think we all rule for participating
- i think i have been outed
- i think this will be quite fun actually
- i think i need to go to work
- i think i first saw zombo in 1999.
Metacharacters
$ represents the end of a line
- morning$
will match the lines
- well they had something this morning
- then had to catch a tram home in the morning
- dog obedience school in the morning
- and yes happy birthday i forgot to say it earlier this morning
- I walked in the rain this morning
- good morning
Character Classes with []
We can list a set of characters we will accept at a given point in the match
[Bb][Uu][Ss][Hh]
will match the lines
- The democrats are playing, "Name the worst thing about Bush!"
- I smelled the desert creosote bush, brownies, BBQ chicken
- BBQ and bushwalking at Molonglo Gorge
- Bush TOLD you that North Korea is part of the Axis of Evil
- I’m listening to Bush - Hurricane (Album Version)
Character Classes with []
^[Ii] am
will match
- i am so angry at my boyfriend i can’t even bear to
- look at him
-
- i am boycotting the apple store
-
- I am twittering from iPhone
-
- I am a very vengeful person when you ruin my sweetheart.
-
- I am so over this. I need food. Mmmm bacon...
Character Classes with []
Similarly, you can specify a range of letters [a-z] or [a-zA-Z]; notice that the order doesn’t matter
^[0-9][a-zA-Z]
will match the lines
- 7th inning stretch
- 2nd half soon to begin. OSU did just win something
- 3am - cant sleep - too hot still.. :(
- 5ft 7 sent from heaven
- 1st sign of starvagtion
Character Classes with []
When used at the beginning of a character class, the “^” is also a metacharacter and indicates matching characters NOT in the indicated class
- [^?.]$
will match the lines
- i like basketballs
- 6 and 9
- dont worry... we all die anyway!
- Not in Baghdad
- helicopter under water? hmmm
表示除?及。号之外的所有符号
More Metacharacters
“.” is used to refer to any character. So
9.11
will match the lines
- its stupid the post 9-11 rules
- if any 1 of us did 9/11 we would have been caught in days.
- NetBios: scanning ip 203.169.114.66
- Front Door 9:11:46 AM
- Sings: 0118999881999119725...3 !
More Metacharacters: |
This does not mean “pipe” in the context of regular expressions; instead it translates to “or”; we can use it to combine two expressions, the subexpressions being called alternatives
flood|fire
will match the lines
- is firewire like usb on none macs?
- the global flood makes sense within the context of the bible
- yeah ive had the fire on tonight
- ... and the floods, hurricanes, killer heatwaves, rednecks, gun nuts, etc.
- 
More Metacharacters: |
We can include any number of alternatives...
flood|earthquake|hurricane|coldfire
will match the lines
- Not a whole lot of hurricanes in the Arctic.
- We do have earthquakes nearly every day somewhere in our State
- hurricanes swirl in the other direction
- coldfire is STRAIGHT!
- ’cause we keep getting earthquakes
More Metacharacters: |
The alternatives can be real expressions and not just literals
^[Gg]ood|[Bb]ad
will match the lines
- good to hear some good knews from someone here
- Good afternoon fellow american infidels!
- good on you-what do you drive?
- Katie... guess they had bad experiences...
- my middle name is trouble, Miss Bad News
More Metacharacters: ( and )
Subexpressions are often contained in parentheses to constrain the alternatives
^([Gg]ood|[Bb]ad)
will match the lines
- bad habbit
- bad coordination today
- good, becuase there is nothing worse than a man in kinky underwear
- Badcop, its because people want to use drugs
- Good Monday Holiday
- Good riddance to Limey
More Metacharacters: ?
The question mark indicates that the indicated expression is optional
[Gg]eorge( [Ww]\.)? [Bb]ush
will match the lines
- i bet i can spell better than you and george bush combined
- BBC reported that President George W. Bush claimed God told him to invade I
- a bird in the hand is worth two george bushes
意味着中间的那个()里的东东是可选的
One thing to note...
In the following
[Gg]eorge( [Ww]\.)? [Bb]ush
we wanted to match a “.” as a literal period; to do that, we had to “escape” the metacharacter, preceding it with a backslash In general, we have to do this for any metacharacter we want to include in our match
More metacharacters: * and +
The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”
(.*)
will match the lines
- anyone wanna chat? (24, m, germany)
- hello, 20.m here... ( east area + drives + webcam )
- (he means older men)
- ()
More metacharacters: * and +
The * and + signs are metacharacters used to indicate repetition; * means “any number, including none, of the item” and + means “at least one of the item”
[0-9]+ (.*)[0-9]+
will match the lines
- working as MP here 720 MP battallion, 42nd birgade
- so say 2 or 3 years at colleage and 4 at uni makes us 23 when and if we fin
- it went down on several occasions for like, 3 or 4 *days*
- Mmmm its time 4 me 2 go 2 bed
More metacharacters: { and }
{ and } are referred to as interval quantifiers; the let us specify the minimum and maximum number of matches of an expression
[Bb]ush( +[^ ]+ +){1,5} debate
will match the lines
- Bush has historically won all major debates he’s done.
- in my view, Bush doesn’t need these debates..
- bush doesn’t need the debates? maybe you are right
- That’s what Bush supporters are doing about the debate.
- Felix, I don’t disagree that Bush was poorly prepared for the debate.
- indeed, but still, Bush should have taken the debate more seriously.
- Keep repeating that Bush smirked and scowled during the debate
More metacharacters: and
- m,n means at least m but not more than n matches
- m means exactly m matches
- m, means at least m matches
More metacharacters: ( and ) revisited
- In most implementations of regular expressions, the parentheses not only limit the scope of alternatives divided by a “|”, but also can be used to “remember” text matched by the subexpression enclosed
- We refer to the matched text with \1, \2, etc.
More metacharacters: ( and ) revisited
So the expression,这里面的+\1表示重复一次
+([a-zA-Z]+) +\1 +
will match the lines
- time for bed, night night twitter!
- blah blah blah blah
- my tattoo is so so itchy today
- i was standing all all alone against the world outside...
- hi anybody anybody at home
- estudiando css css css css.... que desastritooooo
More metacharacters: ( and ) revisited
The * is “greedy” so it always matches the longest possible string that satisfies the regular expression. So
^s(.*)s
matches 这种呢就会自动匹配最长的字符
- sitting at starbucks
- setting up mysql and rails
- studying stuff for the exams
- spaghetti with marshmallows
- stop fighting with crackers
- sore shoulders, stupid ergonomics
More metacharacters: ( and ) revisited
The greediness of * can be turned off with the ?, as in
^s(.*?)s$
Summary
- Regular expressions are used in many different languages; not unique to R.
- Regular expressions are composed of literals and metacharacters that represent sets or classes of characters/words
- Text processing via regular expressions is a very powerful way to extract data from “unfriendly” sources (not all data comes as a CSV file)
- Used with the functions
grep
,grepl
,sub
,gsub
and others that involve searching for text strings (Thanks to Mark Hansen for some material in this lecture.)
4.3 处理日期数据及数据来源
Starting simple
[1] "Sun Jan 12 17:48:33 2014"
[1] "character"
Formatting dates
%d
= day as number (0-31), %a
= abbreviated weekday,%A
= unabbreviated weekday, %m
= month (00-12), %b
= abbreviated month, %B
= unabbrevidated month, %y
= 2 digit year, %Y
= four digit year
[1] "Sun Jan 12"
Creating dates
[1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"
Time difference of -1 days
[1] -1
Converting to Julian
[1] "Sunday"
[1] "January"
- [1] 16082
- attr(,"origin")
- [1] "1970-01-01"
Lubridate
[1] "2014-01-08 UTC"
[1] "2013-08-04 UTC"
[1] "2013-04-03 UTC"
http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
这个包说是用起来相当不错,在处理日期型数据的时候Dealing with times
[1] "2011-08-03 10:15:03 UTC"
[1] "2011-08-03 10:15:03 NZST"
Some functions have slightly different syntax
[1] 3
- [1] Tues
- Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
Notes and further resources
- More information in this nice lubridate tutorial http://www.r-statistics.com/2012/03/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/
- The lubridate vignette is the same content http://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
- Ultimately you want your dates and times as class "Date" or the classes "POSIXct", "POSIXlt". For more information type
?POSIXlt
Open Government Sites
- United Nations http://data.un.org/
- U.S. http://www.data.gov/
- United Kingdom http://data.gov.uk/
- France http://www.data.gouv.fr/
- Ghana http://data.gov.gh/
- Australia http://data.gov.au/
- Germany https://www.govdata.de/
- Hong Kong http://www.gov.hk/en/theme/psi/datasets/
- Japan http://www.data.go.jp/
- Many more http://www.data.gov/opendatasites
Collections by data scientists
- Hilary Mason http://bitly.com/bundles/hmason/1
- Peter Skomoroch https://delicious.com/pskomoroch/dataset
- Jeff Hammerbacher http://www.quora.com/Jeff-Hammerbacher/Introduction-to-Data-Science-Data-Sets
- Gregory Piatetsky-Shapiro http://www.kdnuggets.com/gps.html
- http://blog.mortardata.com/post/67652898761/6-dataset-lists-curated-by-data-scientists
More specialized collections
- Stanford Large Network Data
- UCI Machine Learning
- KDD Nugets Datasets
- CMU Statlib
- Gene expression omnibus
- ArXiv Data
- Public Data Sets on Amazon Web Services
Some API's with R interfaces
- twitter and twitteR package
- figshare and rfigshare
- PLoS and rplos
- rOpenSci
- Facebook and RFacebook
- Google maps and RGoogleMaps