Scraping websites

%matplotlib inline

import pandas as pd

Will pandas solve my problems?

Some table on web pages can also be read in with read_html. This works for tables that are in the document’s HTML, rather than displayed using JavaScript or some other technique. You can confirm by inspecting the HTML code, but trial and error is better.

The Titanic data might have been displayed on a website like this:

Unlike the other read methods which return a dataframe, read_html returns a list dataframes. This is useful when a page contains more than one table. In this case there is only one table, so it will return a list of one item.

html_table_list = pd.read_html('data/titanic.html')
len(html_table_list)
1
html_table_list[0]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

df_html = html_table_list[0]
df_html.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Your turn

What courses were offered last summer here?

https://www.sv.uio.no/english/research/phd/summer-school/courses-2017/ </div> {:.input_area} ```python courses = pd.read_html('https://www.sv.uio.no/english/research/phd/summer-school/courses-2018/') ``` {:.input_area} ```python pd.concat(courses) ```

PhD courses 23 - 27 July 2018: PhD courses 30 July - 3 August 2018:
0 Case Study Research Methods Professor Andrew ... NaN
1 Urban Culture in Global Cities Professor Wend... NaN
2 The Political Economy of Public Policy Dr. Ch... NaN
3 The Nordic Model in a Global Context Professo... NaN
4 Introduction to Agent-based Modeling and Compu... NaN
5 Psychoanalysis is not what you think: Subject... NaN
0 NaN Elections and Democracy Professor José Antoni...
1 NaN Anthropologies and Aftermaths: Thinking, Narra...
2 NaN Collecting and Analyzing Big Data Associate P...
3 NaN Public Space: People, Power, and Political Eco...
4 NaN Mixed and Merged Methods: Toward a Methodologi...
5 NaN Comparative Policy Studies: Theories, Methods,...
If that doesn't work, see if the data is in a json. https://www.indivisible.org {:.input_area} ```python indy = pd.read_json('https://indivisible-data.firebaseio.com/indivisible_groups.json') ``` {:.input_area} ```python indy.head(15) ```
12029911 12029912 12029913 12029914 12029915 12029917 12029918 12029919 12029920 12029921 ... 23817638 23950129 24047864 24075701 24309894 24390351 24869547 25027188 26147098 26462557
city Fort Worth Hawley Charleston Ada Athens Kingwood Orange Long Beach Dana Point Tyler ... Clarkston Vienna Goldsboro Corpus Christi Scottsdale Missoula West Covina Shelbyville Rogersville
country NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
details NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
email NaN True True True NaN NaN True NaN True True ... True True True NaN True True True True True NaN
facebook https://www.facebook.com/indivisiblefw/ https://www.facebook.com/groups/penn10indivisi... NaN https://www.facebook.com/aheadoh/ http://www.facebook.com/100daysathens https://www.facebook.com/groups/1639853466311375/ https://www.facebook.com/ctprogressivecoalition/ NaN NaN NaN ... https://www.facebook.com/indivisibleasotincounty/ https://www.facebook.com/groups/1644467099182559/ NaN https://www.facebook.com/Corpus-Christi-Indivi... https://www.facebook.com/groups/Standindivisib... https://www.facebook.com/groups/1463471077007886 http://facebook.com/just.advocacy NaN https://www.facebook.com/groups/204843326800292/ https://www.facebook.com/indivisiblehawkinscou...
id 12029911 12029912 12029913 12029914 12029915 12029917 12029918 12029919 12029920 12029921 ... 23817638 23950129 24047864 24075701 24309894 24390351 24869547 25027188 26147098 26462557
interaction_count 2 9 0 5 2 0 0 17 2 0 ... 0 0 0 34 207 0 0 50 0 8
latitude 32.742058 41.404304 38.393184 40.784394 33.905911 30.046777 42.5773 33.756289 33.475120 32.235097 ... 46.362367 38.938421 35.381174 27.729894 33.601112 46.971063 34.066964 34.0489 35.4834 36.4066
longitude -97.381730 -75.118065 -81.595470 -83.813286 -83.323577 -95.221022 -72.3079 -118.130636 -117.705675 -95.320779 ... -117.282597 -77.275520 -78.062514 -97.385247 -111.809488 -114.111212 -117.937007 -111.094 -86.4603 -83.0063
name IndivisibleTX12 PENN8 Indivisible Appalachian Americans United AHEAD: Allen and Hardin for Election Action & ... 100+ Days of Action / Athens Indivisible TX-02 Connecticut Progressive Coalition MOVI Indivisible, Los Angeles N.O.P.E. (Not Our President Ever) Voices of East Texas ... Indivisible Asotin County Team Rise Wayne County Strong Corpus Christi Indivisible Stand Indivisible Arizona Boomer Brigade Social Justice Advocacy Project Indivisible Tohono Shelbyville Indivisible Hawkins County Indivisible
socials [{'category': 'facebook', 'url': 'https://www.... [{'category': 'facebook', 'url': 'https://www.... NaN [{'category': 'facebook', 'url': 'https://www.... [{'category': 'facebook', 'url': 'http://www.f... [{'category': 'facebook', 'url': 'https://www.... [{'category': 'facebook', 'url': 'https://www.... NaN NaN NaN ... [{'category': 'facebook', 'url': 'https://www.... [{'category': 'facebook', 'url': 'https://www.... NaN [{'category': 'facebook', 'url': 'https://www.... [{'category': 'facebook', 'url': 'https://www.... [{'category': 'facebook', 'url': 'https://www.... [{'category': 'facebook', 'url': 'http://faceb... NaN [{'category': 'facebook', 'url': 'https://www.... [{'category': 'facebook', 'url': 'https://www....
state TX PA WV OH GA TX CT CA CA TX ... WA VA NC TX AZ MT CA AZ TN Tennessee
tags [tx-12, companies import-1492081903938, red-st... [pa-10, companies import-1492081903938, margue... [wv-02, companies import-1492081903938, whitne... [oh-04, oh-05, companies import-1492081903938,... [ga-09, ga-10, companies import-1492081903938,... [tx-02, tx-08, companies import-1492081903938,... [ct-03, companies import-1492081903938, margue... [ca-47, companies import-1492081903938, ca dat... [ca-48, ca-49, companies import-1492081903938,... [tx-01, companies import-1492081903938, red-st... ... [wa-05] [va-10] NaN [tx-27] NaN NaN NaN NaN NaN NaN
twitter NaN https://twitter.com/PA10INDIVISIBLE NaN https://twitter.com/ahead_oh NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN https://twitter.com/stand_az NaN http://twitter.com/just_advocacy NaN NaN https://twitter.com/HctnIndivisible
zip 76107 18428 25302 45810 30605 77339 6477 90803 92629 75703 ... 99403 22182 27530 78411 85259 59808 91790

15 rows × 5230 columns

{:.input_area} ```python indy.T ```
city country details email facebook id interaction_count latitude longitude name socials state tags twitter zip
12029911 Fort Worth NaN NaN NaN https://www.facebook.com/indivisiblefw/ 12029911 2 32.742058 -97.381730 IndivisibleTX12 [{'category': 'facebook', 'url': 'https://www.... TX [tx-12, companies import-1492081903938, red-st... NaN 76107
12029912 Hawley NaN NaN True https://www.facebook.com/groups/penn10indivisi... 12029912 9 41.404304 -75.118065 PENN8 Indivisible [{'category': 'facebook', 'url': 'https://www.... PA [pa-10, companies import-1492081903938, margue... https://twitter.com/PA10INDIVISIBLE 18428
12029913 Charleston NaN NaN True NaN 12029913 0 38.393184 -81.595470 Appalachian Americans United NaN WV [wv-02, companies import-1492081903938, whitne... NaN 25302
12029914 Ada NaN NaN True https://www.facebook.com/aheadoh/ 12029914 5 40.784394 -83.813286 AHEAD: Allen and Hardin for Election Action & ... [{'category': 'facebook', 'url': 'https://www.... OH [oh-04, oh-05, companies import-1492081903938,... https://twitter.com/ahead_oh 45810
12029915 Athens NaN NaN NaN http://www.facebook.com/100daysathens 12029915 2 33.905911 -83.323577 100+ Days of Action / Athens [{'category': 'facebook', 'url': 'http://www.f... GA [ga-09, ga-10, companies import-1492081903938,... NaN 30605
12029917 Kingwood NaN NaN NaN https://www.facebook.com/groups/1639853466311375/ 12029917 0 30.046777 -95.221022 Indivisible TX-02 [{'category': 'facebook', 'url': 'https://www.... TX [tx-02, tx-08, companies import-1492081903938,... NaN 77339
12029918 Orange NaN NaN True https://www.facebook.com/ctprogressivecoalition/ 12029918 0 42.5773 -72.3079 Connecticut Progressive Coalition [{'category': 'facebook', 'url': 'https://www.... CT [ct-03, companies import-1492081903938, margue... NaN 6477
12029919 Long Beach NaN NaN NaN NaN 12029919 17 33.756289 -118.130636 MOVI Indivisible, Los Angeles NaN CA [ca-47, companies import-1492081903938, ca dat... NaN 90803
12029920 Dana Point NaN NaN True NaN 12029920 2 33.475120 -117.705675 N.O.P.E. (Not Our President Ever) NaN CA [ca-48, ca-49, companies import-1492081903938,... NaN 92629
12029921 Tyler NaN NaN True NaN 12029921 0 32.235097 -95.320779 Voices of East Texas NaN TX [tx-01, companies import-1492081903938, red-st... NaN 75703
12029922 Kalamazoo NaN NaN NaN https://www.facebook.com/2020kzoosquad/ 12029922 0 42.263841 -85.617047 20/20 Kzoo Squad [{'category': 'facebook', 'url': 'https://www.... MI [mi-06, companies import-1492081903938, elena'... NaN 49008
12029923 Smithville NaN NaN True NaN 12029923 0 35.917978 -85.786903 Forward DeKalb NaN TN [tn-04, tn-06, companies import-1492081903938,... NaN 37166
12029925 Houston NaN NaN True https://www.facebook.com/groups/441401122867071/ 12029925 0 29.740970 -95.391301 Indivisible Houston Creative [{'category': 'facebook', 'url': 'https://www.... TX [tx-02, tx-18, companies import-1492081903938,... NaN 77006
12029926 Cutler NaN NaN True NaN 12029926 2 39.383437 -81.800411 Pioneer Resisters NaN OH [oh-06, companies import-1492081903938, elena'... NaN 45724
12029927 Upland NaN NaN NaN https://www.facebook.com/groups/1833432906910865 12029927 2 34.105282 -117.662035 Indivisible California District 31 [{'category': 'facebook', 'url': 'https://www.... CA [ca-27, ca-31, companies import-1492081903938,... https://twitter.com/indivisibleca31 91786
12029929 Orange NaN NaN NaN https://www.facebook.com/groups/1881639238735288/ 12029929 2 33.808450 -117.791737 Constituents of the California 45th Congressio... [{'category': 'facebook', 'url': 'https://www.... CA [ca-45, companies import-1492081903938, ca dat... NaN 92869
12029930 Orange NaN NaN NaN https://www.facebook.com/groups/45thdistrictvo... 12029930 2 33.808450 -117.791737 45th Congressional District of California Cons... [{'category': 'facebook', 'url': 'https://www.... CA [ca-45, companies import-1492081903938, ca dat... NaN 92869
12029932 New York NaN NaN True https://www.facebook.com/groups/311758132544636/ 12029932 7 40.731829 -73.989181 4hours4years [{'category': 'facebook', 'url': 'https://www.... NY [ny-16, ny-17, companies import-1492081903938,... NaN 10003
12029933 Atlanta NaN NaN True NaN 12029933 2 33.711546 -84.331796 5th District Resistance NaN GA [ga-05, companies import-1492081903938, whitne... NaN 30316
12029934 Everson NaN NaN NaN https://www.facebook.com/search/top/?q=indivis... 12029934 2 48.911829 -122.330175 Indivisible North Whatcom [{'category': 'facebook', 'url': 'https://www.... WA [wa-01, shohet wa groups import 1.3 -151501893... NaN 98247
12029935 Media NaN NaN True https://www.facebook.com/groups/383406765329243 12029935 128 39.920460 -75.416182 DelCo PA Indivisible [{'category': 'facebook', 'url': 'https://www.... PA [pa-01, pa-07, companies import-1492081903938,... NaN 19063
12029936 Salem NaN NaN True https://www.facebook.com/togethernorthshore 12029936 1 40.9009 -80.8568 Together North Shore [{'category': 'facebook', 'url': 'https://www.... MA [ma-06, companies import-1492081903938, margue... NaN 1945
12029938 Buffalo NaN NaN NaN https://www.facebook.com/groups/361760577506208/ 12029938 6 42.951932 -78.898883 Stronger Together WNY [{'category': 'facebook', 'url': 'https://www.... NY [ny-26, companies import-1492081903938, active... NaN 14207
12029939 Pleasant Hill NaN NaN True https://www.facebook.com/groups/413800492297720/ 12029939 59 37.954131 -122.076140 Indivisible Central Contra Costa County (indiv... [{'category': 'facebook', 'url': 'https://www.... CA [ca-11, companies import-1492081903938, chloe ... https://twitter.com/IndivisibleCCCC 94523
12029940 San Francisco NaN NaN NaN https://www.facebook.com/groups/1250893321638409/ 12029940 2 37.750021 -122.415201 An Action A Day [{'category': 'facebook', 'url': 'https://www.... CA [ca-12, companies import-1492081903938, chloe ... NaN 94110
12029944 San Diego NaN NaN True NaN 12029944 2 32.861727 -117.171224 Berniecrats NaN CA [ca-52, companies import-1492081903938, ca dat... NaN 92122
12029945 Oklahoma City NaN NaN NaN https://www.facebook.com/ppgpvotes 12029945 0 35.5417 -97.5649 Planned Parenthood Great Plains Votes [{'category': 'facebook', 'url': 'https://www.... OK [ok-05, companies import-1492081903938, compan... NaN NaN
12029946 Seattle NaN NaN True https://www.facebook.com/SeattleIndivisible/ 12029946 78 47.632810 -122.288511 Seattle Indivisible [{'category': 'facebook', 'url': 'https://www.... WA [wa-09, shohet wa groups import 1.3 -151501893... https://twitter.com/SEAindivisible 98112
12029947 West Warwick NaN NaN NaN https://www.facebook.com/groups/1770753343174786/ 12029947 0 41.7238 -71.4806 The Democratic Socialist Party (Facebook Group) [{'category': 'facebook', 'url': 'https://www.... RI [ri-02, companies import-1492081903938, margue... NaN 2893
12029948 Houston NaN NaN True NaN 12029948 0 29.798249 -95.416933 Progressive Happenings (Houston) NaN TX [tx-02, tx-18, tx-29, companies import-1492081... NaN 77008
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22567286 Adams NaN NaN True NaN 22567286 0 43.886723 -89.807759 Adams County Wisconsin Indivisible NaN WI NaN NaN 53910
22707973 Belfast NaN NaN True NaN 22707973 0 44.463502 -69.037571 Indivisble Belfast NaN ME NaN NaN 04915
22718554 Highland Park NaN NaN True NaN 22718554 0 40.500795 -74.427911 Writers in League with Libraries NaN NJ [nj-06] NaN 08904
22728305 Cahokia NaN NaN True https://www.facebook.com/Indivisible-Metro-Eas... 22728305 57 38.572218 -90.166692 Indivisible Metro East [{'category': 'facebook', 'url': 'https://www.... IL [il-12, distributed fundraising - participant] https://twitter.com/IndivMetroEast 62206
22765123 San Jose NaN NaN True NaN 22765123 0 37.186141 -121.843555 Together Indivisible - South Bay NaN CA [ca-18] NaN 95120
22788401 Newton NaN NaN True https://www.facebook.com/groups/583530138665247/ 22788401 0 42.344457 -71.248617 Reach Teach Impeach [{'category': 'facebook', 'url': 'https://www.... MA [ma-04] NaN 02466
22795423 Fridley NaN NaN True NaN 22795423 0 45.096702 -93.253726 Purple People NaN MN [mn-05] NaN 55432
22796813 San Antonio NaN NaN True https://www.facebook.com/groups/SATXIndivisible/ 22796813 222 29.689580 -98.402411 SATX Indivisible [{'category': 'facebook', 'url': 'https://www.... TX NaN NaN 78261
22827251 KILMARNOCK NaN NaN True NaN 22827251 2 37.735631 -76.346113 Rappahannock Indivisible - We The People NaN VA NaN NaN 22482
22902041 Alexandria NaN NaN True NaN 22902041 0 38.771982 -77.057273 Indivisible Below the Beltway NaN VA [va-08] NaN 22307
22909792 Eugene NaN NaN True NaN 22909792 0 43.939557 -123.192759 South Hills NaN OR NaN NaN 97405
22926046 Spring Valley NaN NaN NaN NaN 22926046 0 32.726237 -116.994318 Spring Valley Vigilant Voters NaN CA NaN NaN 91977
22953832 new york NaN NaN True NaN 22953832 0 40.731829 -73.989181 Equality League NaN NY [ny-10] NaN 10003
22954822 Virginia Beach NaN NaN True https://www.facebook.com/groups/HRIndivisible/ 22954822 0 36.736543 -76.035469 Hampton Roads Indivisble Parent Action Group [{'category': 'facebook', 'url': 'https://www.... VA [va-02] https://twitter.com/IndivisibleVAHR 23456
22965111 Brooklyn NaN NaN True https://www.facebook.com/BrightestInc/ 22965111 0 40.678308 -73.919936 Brightest Stands With Indivisible [{'category': 'facebook', 'url': 'https://www.... NY [ny-09] https://twitter.com/brightest_inc 11233
23066874 Cuyahoga Falls NaN NaN True https://www.facebook.com/groups/1716405838439650/ 23066874 0 41.139266 -81.474873 Crooked River Action [{'category': 'facebook', 'url': 'https://www.... OH [oh-13] NaN 44221
23125421 NaN NaN NaN True https://www.facebook.com/ratifyERAil/ 23125421 0 40.704146 -89.417889 Ratify ERA Illinois [{'category': 'facebook', 'url': 'https://www.... IL NaN https://twitter.com/RatifyERAIL 61571
23170290 Minneapolis NaN NaN NaN NaN 23170290 0 44.938689 -93.221042 Indivisible: Lake Street Speaks NaN MN [mn-05] NaN 55406
23233495 Brenham NaN NaN NaN https://www.facebook.com/txruralvoices 23233495 0 30.215075 -96.410272 Texas Rural Voices [{'category': 'facebook', 'url': 'https://www.... TX [tx-10] NaN 77833
23379027 Titusville NaN NaN True https://www.facebook.com/groups/IndivisbleTitu... 23379027 0 28.533319 -80.792029 Indivisible Titusville [{'category': 'facebook', 'url': 'https://www.... FL [fl-08] NaN 32780
23817638 Clarkston NaN NaN True https://www.facebook.com/indivisibleasotincounty/ 23817638 0 46.362367 -117.282597 Indivisible Asotin County [{'category': 'facebook', 'url': 'https://www.... WA [wa-05] NaN 99403
23950129 Vienna NaN NaN True https://www.facebook.com/groups/1644467099182559/ 23950129 0 38.938421 -77.275520 Team Rise [{'category': 'facebook', 'url': 'https://www.... VA [va-10] NaN 22182
24047864 Goldsboro NaN NaN True NaN 24047864 0 35.381174 -78.062514 Wayne County Strong NaN NC NaN NaN 27530
24075701 Corpus Christi NaN NaN NaN https://www.facebook.com/Corpus-Christi-Indivi... 24075701 34 27.729894 -97.385247 Corpus Christi Indivisible [{'category': 'facebook', 'url': 'https://www.... TX [tx-27] NaN 78411
24309894 Scottsdale NaN NaN True https://www.facebook.com/groups/Standindivisib... 24309894 207 33.601112 -111.809488 Stand Indivisible Arizona [{'category': 'facebook', 'url': 'https://www.... AZ NaN https://twitter.com/stand_az 85259
24390351 Missoula NaN NaN True https://www.facebook.com/groups/1463471077007886 24390351 0 46.971063 -114.111212 Boomer Brigade [{'category': 'facebook', 'url': 'https://www.... MT NaN NaN 59808
24869547 West Covina NaN NaN True http://facebook.com/just.advocacy 24869547 0 34.066964 -117.937007 Social Justice Advocacy Project [{'category': 'facebook', 'url': 'http://faceb... CA NaN http://twitter.com/just_advocacy 91790
25027188 NaN NaN True NaN 25027188 50 34.0489 -111.094 Indivisible Tohono NaN AZ NaN NaN
26147098 Shelbyville NaN NaN True https://www.facebook.com/groups/204843326800292/ 26147098 0 35.4834 -86.4603 Shelbyville Indivisible [{'category': 'facebook', 'url': 'https://www.... TN NaN NaN
26462557 Rogersville NaN NaN NaN https://www.facebook.com/indivisiblehawkinscou... 26462557 8 36.4066 -83.0063 Hawkins County Indivisible [{'category': 'facebook', 'url': 'https://www.... Tennessee NaN https://twitter.com/HctnIndivisible

5230 rows × 15 columns

{:.input_area} ```python pd.read_json? ``` {:.input_area} ```python indy = pd.read_json('https://indivisible-data.firebaseio.com/indivisible_groups.json', orient = 'index') ``` {:.input_area} ```python indy = indy.T ``` {:.input_area} ```python indy.info() ``` {:.output .output_stream} ``` <class 'pandas.core.frame.DataFrame'> Int64Index: 5230 entries, 12029911 to 26462557 Data columns (total 15 columns): city 5130 non-null object country 42 non-null object details 3 non-null object email 2599 non-null object facebook 3081 non-null object id 5230 non-null object interaction_count 5230 non-null object latitude 5217 non-null object longitude 5217 non-null object name 5230 non-null object socials 3310 non-null object state 5204 non-null object tags 5149 non-null object twitter 702 non-null object zip 5057 non-null object dtypes: object(15) memory usage: 813.8+ KB ``` {:.input_area} ```python indy.plot.scatter(y = 'latitude', x ='longitude') ``` {:.output .output_traceback_line} ``` --------------------------------------------------------------------------- ``` {:.output .output_traceback_line} ``` ValueError Traceback (most recent call last) ``` {:.output .output_traceback_line} ``` in () ----> 1 indy.plot.scatter(y = 'latitude', x ='longitude') ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/pandas/plotting/_core.py in scatter(self, x, y, s, c, **kwds) 2853 axes : matplotlib.AxesSubplot or np.array of them 2854 """ -> 2855 return self(kind='scatter', x=x, y=y, c=c, s=s, **kwds) 2856 2857 def hexbin(self, x, y, C=None, reduce_C_function=None, gridsize=None, ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/pandas/plotting/_core.py in __call__(self, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds) 2675 fontsize=fontsize, colormap=colormap, table=table, 2676 yerr=yerr, xerr=xerr, secondary_y=secondary_y, -> 2677 sort_columns=sort_columns, **kwds) 2678 __call__.__doc__ = plot_frame.__doc__ 2679 ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/pandas/plotting/_core.py in plot_frame(data, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds) 1900 yerr=yerr, xerr=xerr, 1901 secondary_y=secondary_y, sort_columns=sort_columns, -> 1902 **kwds) 1903 1904 ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/pandas/plotting/_core.py in _plot(data, x, y, subplots, ax, kind, **kwds) 1685 if isinstance(data, DataFrame): 1686 plot_obj = klass(data, x=x, y=y, subplots=subplots, ax=ax, -> 1687 kind=kind, **kwds) 1688 else: 1689 raise ValueError("plot kind %r can only be used for data frames" ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/pandas/plotting/_core.py in __init__(self, data, x, y, s, c, **kwargs) 835 # the handling of this argument later 836 s = 20 --> 837 super(ScatterPlot, self).__init__(data, x, y, s=s, **kwargs) 838 if is_integer(c) and not self.data.columns.holds_integer(): 839 c = self.data.columns[c] ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/pandas/plotting/_core.py in __init__(self, data, x, y, **kwargs) 810 y = self.data.columns[y] 811 if len(self.data[x]._get_numeric_data()) == 0: --> 812 raise ValueError(self._kind + ' requires x column to be numeric') 813 if len(self.data[y]._get_numeric_data()) == 0: 814 raise ValueError(self._kind + ' requires y column to be numeric') ``` {:.output .output_traceback_line} ``` ValueError: scatter requires x column to be numeric ``` If that doesn't work, see if someone has already written a program for it. scrape gofundme python {:.input_area} ```python from bs4 import BeautifulSoup import requests page = 1 links = set() length = 0 while True: print("Page {}".format(page)) gofundme = requests.get('https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page={}&term=medical-fundraising&country=GB&initialTerm='.format(page)) soup = BeautifulSoup(gofundme.content, "html.parser") links.update([a['href'] for a in soup.find_all('a', href=True)]) # Stop when no new links are found if len(links) == length: break length = len(links) page += 1 for link in sorted(links): print(link) ``` {:.output .output_stream} ``` Page 1 Page 2 Page 3 Page 4 ``` {:.output .output_traceback_line} ``` --------------------------------------------------------------------------- ``` {:.output .output_traceback_line} ``` TypeError Traceback (most recent call last) ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 379 try: # Python 2.7, use buffering of HTTP responses --> 380 httplib_response = conn.getresponse(buffering=True) 381 except TypeError: # Python 2.6 and older, Python 3 ``` {:.output .output_traceback_line} ``` TypeError: getresponse() got an unexpected keyword argument 'buffering' ``` {:.output .output_traceback_line} ``` During handling of the above exception, another exception occurred: ``` {:.output .output_traceback_line} ``` KeyboardInterrupt Traceback (most recent call last) ``` {:.output .output_traceback_line} ``` in () 8 while True: 9 print("Page {}".format(page)) ---> 10 gofundme = requests.get('https://www.gofundme.com/mvc.php?route=category/loadMoreTiles&page={}&term=medical-fundraising&country=GB&initialTerm='.format(page)) 11 soup = BeautifulSoup(gofundme.content, "html.parser") 12 links.update([a['href'] for a in soup.find_all('a', href=True)]) ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/requests/api.py in get(url, params, **kwargs) 70 71 kwargs.setdefault('allow_redirects', True) ---> 72 return request('get', url, params=params, **kwargs) 73 74 ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/requests/api.py in request(method, url, **kwargs) 56 # cases, and look like a memory leak in others. 57 with sessions.Session() as session: ---> 58 return session.request(method=method, url=url, **kwargs) 59 60 ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 506 } 507 send_kwargs.update(settings) --> 508 resp = self.send(prep, **send_kwargs) 509 510 return resp ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs) 616 617 # Send the request --> 618 r = adapter.send(request, **kwargs) 619 620 # Total elapsed time of the request (approximately) ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies) 438 decode_content=False, 439 retries=self.max_retries, --> 440 timeout=timeout 441 ) 442 ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 timeout=timeout_obj, 600 body=body, headers=headers, --> 601 chunked=chunked) 602 603 # If we're going to release the connection in ``finally:``, then ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 381 except TypeError: # Python 2.6 and older, Python 3 382 try: --> 383 httplib_response = conn.getresponse() 384 except Exception as e: 385 # Remove the TypeError from the exception chain in Python 3; ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/http/client.py in getresponse(self) 1329 try: 1330 try: -> 1331 response.begin() 1332 except ConnectionError: 1333 self.close() ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/http/client.py in begin(self) 295 # read until we get a non-100 response 296 while True: --> 297 version, status, reason = self._read_status() 298 if status != CONTINUE: 299 break ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/http/client.py in _read_status(self) 256 257 def _read_status(self): --> 258 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") 259 if len(line) > _MAXLINE: 260 raise LineTooLong("status line") ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/socket.py in readinto(self, b) 584 while True: 585 try: --> 586 return self._sock.recv_into(b) 587 except timeout: 588 self._timeout_occurred = True ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py in recv_into(self, *args, **kwargs) 278 def recv_into(self, *args, **kwargs): 279 try: --> 280 return self.connection.recv_into(*args, **kwargs) 281 except OpenSSL.SSL.SysCallError as e: 282 if self.suppress_ragged_eofs and e.args == (-1, 'Unexpected EOF'): ``` {:.output .output_traceback_line} ``` ~/anaconda3/lib/python3.6/site-packages/OpenSSL/SSL.py in recv_into(self, buffer, nbytes, flags) 1712 result = _lib.SSL_peek(self._ssl, buf, nbytes) 1713 else: -> 1714 result = _lib.SSL_read(self._ssl, buf, nbytes) 1715 self._raise_ssl_error(self._ssl, result) 1716 ``` {:.output .output_traceback_line} ``` KeyboardInterrupt: ``` {:.input_area} ```python links ``` {:.output .output_data_text} ``` {'https://www.gofundme.com/2ckfhnjd', 'https://www.gofundme.com/338ii34', 'https://www.gofundme.com/46snno8', 'https://www.gofundme.com/47bd9nc', 'https://www.gofundme.com/4mgv7cw', 'https://www.gofundme.com/52kh49-help-save-skittles', 'https://www.gofundme.com/58rz1k', 'https://www.gofundme.com/882v7-medical-fundraising', 'https://www.gofundme.com/Coleen-s-Medical-Fundraising', 'https://www.gofundme.com/c8tn3nfw', 'https://www.gofundme.com/chogtaa', 'https://www.gofundme.com/cswjm-medical-fundraising', 'https://www.gofundme.com/delinas-medical-fundraising', 'https://www.gofundme.com/jeannette039s-medical-fundraising', 'https://www.gofundme.com/joeyneedssomeeyes', 'https://www.gofundme.com/medical-fundraising-for-barry-boys', 'https://www.gofundme.com/medical-fundraising-for-leah', 'https://www.gofundme.com/mrhueu', 'https://www.gofundme.com/ovlee4', 'https://www.gofundme.com/pahouasmedical', 'https://www.gofundme.com/poes-medical-fundraising', 'https://www.gofundme.com/pxd4u8', 'https://www.gofundme.com/savingpaws4love', 'https://www.gofundme.com/standupforedison', 'https://www.gofundme.com/t8vfjrek', 'https://www.gofundme.com/troy-alberding', 'https://www.gofundme.com/umd642xg'} ``` If that doesn't work, see if someone has already written a general program for it. from newspaper import Article ![](images/fail_to_install.png) pip install newspaper3k {:.input_area} ```python from newspaper import Article ``` {:.input_area} ```python url = 'http://www.foxnews.com/tech/2018/07/31/facebook-finds-sophisticated-efforts-to-disrupt-us-politics-removes-32-accounts.html' ``` {:.input_area} ```python article = Article(url) ``` {:.input_area} ```python article.download() ``` {:.input_area} ```python article.parse() ``` {:.input_area} ```python article.authors ``` {:.output .output_data_text} ``` ['Christopher Carbone'] ``` {:.input_area} ```python article.publish_date ``` {:.output .output_data_text} ``` datetime.datetime(2018, 7, 31, 0, 0) ``` {:.input_area} ```python article.publish_date ``` {:.output .output_data_text} ``` datetime.datetime(2018, 7, 31, 0, 0) ``` {:.input_area} ```python article.meta_data ``` {:.output .output_data_text} ``` defaultdict(dict, {'classification': '/FOX NEWS/TECH/COMPANIES/Facebook,/FOX NEWS/TECH,/FOX NEWS/NEWS EVENTS/Russia Investigation', 'classification-isa': 'facebook,tech', 'dc.creator': 'Christopher Carbone', 'dc.date': '2018-07-31', 'dc.description': 'Facebook said it has uncovered sophisticated efforts, possibly linked to Russia, to influence U.S. politics on its platforms..', 'dc.format': 'text/html', 'dc.identifier': 'ec9da88c-98b1-4f65-af33-45b67a0a1b81', 'dc.language': 'en-US', 'dc.publisher': 'Fox News', 'dc.source': 'Fox News', 'dc.title': "Facebook finds 'sophisticated' efforts to disrupt US politics, removes 32 accounts", 'dc.type': 'Text.Article', 'dcterms.abstract': 'Facebook said it has uncovered sophisticated efforts, possibly linked to Russia, to influence U.S. politics on its platforms..', 'dcterms.created': '2018-07-31 07:17:12 EDT', 'dcterms.modified': '2018-07-31 07:17:12 EDT', 'description': 'Facebook said it has uncovered sophisticated efforts, possibly linked to Russia, to influence U.S. politics on its platforms..', 'fb': {'app_id': 113186182048399, 'pages': 15704546335}, 'og': {'description': 'Facebook said it has uncovered sophisticated efforts, possibly linked to Russia, to influence U.S. politics on its platforms..', 'image': 'http://a57.foxnews.com/media2.foxnews.com/BrightCove/694940094001/2018/07/31/0/0/694940094001_5816244313001_5816244794001-vs.jpg?ve=1', 'site_name': 'Fox News', 'title': "Facebook finds 'sophisticated' efforts to disrupt US politics, removes 32 accounts", 'type': 'article', 'url': 'http://www.foxnews.com/tech/2018/07/31/facebook-finds-sophisticated-efforts-to-disrupt-us-politics-removes-32-accounts.html'}, 'pagetype': 'article', 'prism.aggregationType': 'subsection', 'prism.channel': 'fnc', 'prism.section': 'tech', 'robots': 'noarchive, noodp', 'twitter': {'card': 'summary_large_image', 'creator': '@foxnews', 'description': 'Facebook said it has uncovered sophisticated efforts, possibly linked to Russia, to influence U.S. politics on its platforms..', 'image': 'http://a57.foxnews.com/images.foxnews.com/content/dam/fox-news/images/2018/04/04/facebook-logo-reuters.jpg.img.png/0/0/1522853765622.png?ve=1', 'site': '@foxnews', 'title': "Facebook finds 'sophisticated' efforts to disrupt US politics, removes 32 accounts", 'url': 'http://www.foxnews.com/tech/2018/07/31/facebook-finds-sophisticated-efforts-to-disrupt-us-politics-removes-32-accounts.html'}, 'viewport': 'width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no'}) ``` {:.input_area} ```python print(article.text) ``` {:.output .output_stream} ``` Facebook uncovered "sophisticated" efforts, possibly linked to Russia, to influence American politics in advance of the U.S. midterm elections. The company said in a blog post that it removed 32 accounts from Facebook and Instagram because they were involved in "coordinated" political behavior and appeared to be fake. Facebook did not explicitly say that the effort was aimed at influencing the midterm elections in November, but the timing of the suspicious activity would be consistent with such an attempt. The company, which said it is in the early stages of its investigation, held briefings in the House and Senate this week. The company said it doesn't know who is behind the efforts, but said there may be connections to Russia. Facebook said it has found some connections between the accounts it removed and the accounts connected to Russia's Internet Research Agency that it removed before and after the 2016 U.S. presidential elections. TWITTER BRINGS IN ANTI-TRUMP ACADEMICS TO FIGHT BIAS U.S. Senator Mark Warner, D-Va., who introduced the Honest Ads Act earlier this year to help prevent foreign interference in U.S. elections, praised Facebook's elections. “Today’s disclosure is further evidence that the Kremlin continues to exploit platforms like Facebook to sow division and spread disinformation, and I am glad that Facebook is taking some steps to pinpoint and address this activity," Warner said in a statement to Fox News. "I also expect Facebook, along with other platform companies, will continue to identify Russian troll activity and to work with Congress on updating our laws to better protect our democracy in the future,” he continued. "It’s clear that whoever set up these accounts went to much greater lengths to obscure their true identities than the Russian-based Internet Research Agency (IRA) has in the past," Facebook said in its statement. "We believe this could be partly due to changes we’ve made over the last year to make this kind of abuse much harder." FACEBOOK, TECH GIANTS CREATING 'CRISIS IN DEMOCRACY,' UK REPORT SAYS "The goal of these operations is to sow discord, distrust, and division in an attempt to undermine public faith in our institutions and our political system," Sen. Richard Burr, R-N.C., said in a statement. "The Russians want a weak America. There is still much that needs to be done to prevent and counter foreign interference on social media." The president has made it clear that his administration will not tolerate foreign interference into our electoral process from any nation state or other malicious actors," White House Deputy Press Secretary Hogan Gidley said in a statement late Tuesday. The earliest page was created in March 2017. Facebook says more than 290,000 accounts followed at least one of the fake pages. The most followed Facebook Pages had names such as "Aztlan Warriors," ''Black Elevation," ''Mindful Being," and "Resisters." Facebook says the pages ran about 150 ads for $11,000 on Facebook and Instagram, paid for in U.S. and Canadian dollars. The first ad was created in April 2017; the last was created in June 2018. The perpetrators used virtual private networks and internet phone services, and paid third parties to run ads on their behalf, according to Facebook. In addition, Facebook revealed that its partnership with the Atlantic Council helped it to identify the bad actors. One of the groups with roughly 4,000 members was located based on leads from U.S. Special Counsel Robert Mueller's recent indictment of 12 Russian nationals for their role in hacking and disinformation efforts during the 2016 U.S. presidential election. That group was created by Russian government figures but had been dormant since Facebook disabled its administrators last year. However, the tech company chose to remove the group to protect the privacy of its members in advance of a forthcoming report from the Atlantic Council that will analyze the Pages, profiles and accounts that Facebook disabled today. Freedom From Facebook, a group that has been pushing for the tech company to be broken up, said that today's announcement was not enough. “While it is good news that this latest attack on democracy via Facebook was discovered early, the fact that it happened highlights the danger posed when so much power is concentrated in a single company," said Sarah Miller, a campaigner with Freedom From Facebook. Miller continued: “We will not be safe from foreign interference -- and Facebook’s own business model of profiting off of bad actors -- until Congress and the FTC step in to break up the company and impose strong privacy rules.” Fox News' Bree Tracey and The Associated Press contributed to this report. ``` {:.input_area} ```python def get_article_info(url): article = Article(url) article.download() article.parse() article_details = {'title' : article.title, 'text' : article.text, 'url' : article.url, 'authors' : article.authors, 'html' : article.html, 'date' : article.publish_date, 'description' : article.meta_description, 'publisher' : article.meta_data['dc.publisher']} return article_details ``` {:.input_area} ```python article_details = get_article_info(url) ``` {:.input_area} ```python article_details ``` {:.output .output_data_text} ``` {'authors': ['Christopher Carbone'], 'date': datetime.datetime(2018, 7, 31, 0, 0), 'description': 'Facebook said it has uncovered sophisticated efforts, possibly linked to Russia, to influence U.S. politics on its platforms..', 'html': '\n\n\n<!DOCTYPE html>\n\n \n\n\n\n\n \n \n \n\n\n\n \n \n \n \n \n\n\n\t\t\t \n\nFacebook finds \'sophisticated\' efforts to disrupt US politics, removes 32 accounts | Fox News\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n \n \n\n\n\n\n\n\n \n \n \n \n \n \n \n\n\n\n\n\n\n\n \n \n \n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\n\t\t\n\t\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n\n \n \n\t<link href=\'https://fonts.googleapis.com/css?family=Source+Sans+Pro:200,300,400,600,700,900|Roboto:300,400,700,900\' rel=\'stylesheet\' type=\'text/css\'>\n \t\n \n \n \n \n \n \n \n \n\t\n\t\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n \n \n\t\n\n\t\n\n\n\n\n
\t\n\t \n\n\n\n\n \t\n\t\t\n\n\n\n\n
\n\t
\n \t\t\n\n\t\t\n\t\t\n\n\n \t\n \n \n \n \t\t\n\t \t\t\n\t \t\t\n\t\t \t\t\n\t\t\t\t \n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t
\n\t\t\t\t\t\t
\n\t\t\t\t\t\t \n\t\t\t\t\t\t
\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t \t\n\t\t \n\t\n\n\n\t\t\r\n\n \t
\n\t\t\n\t
\n\t\n\t\t
\n\t\t\t
\n\t\t\n\t\t\t\t
\n\t\t\n\t\t\t\t\t\n\n\n\n
\n\n\n\t\n\t\n\t\t\n \n\t\t\n\t\t

Facebook finds 'sophisticated' efforts to disrupt US politics, removes 32 accounts

\n\t\t\n\t\t\n\t\t\n\t\t\n\t\t\t\t\t\t\n\t\t\n\t\n
\n\n\t\t\t\t\t\n\t\t\t\t\t
\n\t\t\n\t\t\t\t\t\t
\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\t\n\n\n\n\n\n \n \n\t\n\t\t\n\t\t\t\n\t\t\t\n \n \n\t\t \n\n \n\n\n\n\n\r\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\n\n\n \n\n\n\n \n \n \n \n

Facebook uncovered "sophisticated" efforts, possibly linked to Russia, to influence American politics in advance of the U.S. midterm elections.

\n \n
\n
\n
\n
\n
\n
\n \n \n \n \n \n \r\n\n\n \n \n

The company said in a blog post that it removed 32 accounts from Facebook and Instagram because they were involved in "coordinated" political behavior and appeared to be fake.

\n \n \n \n \n \n \r\n\n\n \n \n

Facebook did not explicitly say that the effort was aimed at influencing the midterm elections in November, but the timing of the suspicious activity would be consistent with such an attempt.

\n \n \n \t\t\n\t\t\t\t\t
\n \n \n \n \r\n\n\n \n \n

The company, which said it is in the early stages of its investigation, held briefings in the House and Senate this week.

\n \n \n \n \n \n \r\n\n\n \n \n

The company said it doesn\'t know who is behind the efforts, but said there may be connections to Russia. Facebook said it has found some connections between the accounts it removed and the accounts connected to Russia\'s Internet Research Agency that it removed before and after the 2016 U.S. presidential elections.

\n \n \n \n \n \n \r\n\n\n \n \n

TWITTER BRINGS IN ANTI-TRUMP ACADEMICS TO FIGHT BIAS

\n \n \n \n \n \n \r\n\n\n \n
\n
\n
\n \n \n

U.S. Senator Mark Warner, D-Va., who introduced the Honest Ads Act earlier this year to help prevent foreign interference in U.S. elections, praised Facebook\'s elections.

\n \n \n \n \n \n \n\n\n\n\n\n\t\n \n \n \n\t
\n\t \t\t\n\t\t\t\t
\n\t\t\t\t\t \n\t\t \n\t\t\t\t\t\t\n\t\t\t\t\t\t\tfacebook bad actors\n\t\t\t\t\t\t\n\t\t\t\t\t \n\t\t\t\t
\t\n\t\t\t\n\t\t\t\n\t\t\t \t \n\t\t \t\n\t\t \t\n\t\t\t \t\n\t\t\t \t\t
\n\t\t\t\t\t\t\t

Two of the pages that the tech giant recently disabled are seen above. \n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t (Facebook)\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t

\n\t\t\t\t \t
\n\t\t\t\t \n\t\t \t\n\t \n\t\t
\n\t\n\r\n\n\n \n \n

“Today’s disclosure is further evidence that the Kremlin continues to exploit platforms like Facebook to sow division and spread disinformation, and I am glad that Facebook is taking some steps to pinpoint and address this activity," Warner said in a statement to Fox News. 

\n \n \n \n \n \n \r\n\n\n \n \n

"I also expect Facebook, along with other platform companies, will continue to identify Russian troll activity and to work with Congress on updating our laws to better protect our democracy in the future,” he continued.   

\n \n \n \n \n \n \r\n\n\n \n \n

"It’s clear that whoever set up these accounts went to much greater lengths to obscure their true identities than the Russian-based Internet Research Agency (IRA) has in the past," Facebook said in its statement. "We believe this could be partly due to changes we’ve made over the last year to make this kind of abuse much harder."

\n \n \n \n \n \n \r\n\n\n \n \n

FACEBOOK, TECH GIANTS CREATING \'CRISIS IN DEMOCRACY,\' UK REPORT SAYS

\n \n \n \n \n \n \r\n\n\n \n \n

"The goal of these operations is to sow discord, distrust, and division in an attempt to undermine public faith in our institutions and our political system," Sen. Richard Burr, R-N.C., said in a statement. "The Russians want a weak America.  There is still much that needs to be done to prevent and counter foreign interference on social media."

\n \n \n \n \n \n \r\n\n\n \n \n

The president has made it clear that his administration will not tolerate foreign interference into our electoral process from any nation state or other malicious actors," White House Deputy Press Secretary Hogan Gidley said in a statement late Tuesday.

\n \n \n \n \n \n \r\n\n\n \n \n

The earliest page was created in March 2017. Facebook says more than 290,000 accounts followed at least one of the fake pages. The most followed Facebook Pages had names such as "Aztlan Warriors," \'\'Black Elevation," \'\'Mindful Being," and "Resisters."

\n \n \n \n \n \n \r\n\n\n \n \n

Facebook says the pages ran about 150 ads for $11,000 on Facebook and Instagram, paid for in U.S. and Canadian dollars. The first ad was created in April 2017; the last was created in June 2018.

\n \n \n \n \n \n \r\n\n\n \n \n

The perpetrators used virtual private networks and internet phone services, and paid third parties to run ads on their behalf, according to Facebook.

\n \n \n \n \n \n \n\n\n\n\n\n\t\n \n \n \n\t
\n\t \t\t\n\t\t\t\t
\n\t\t\t\t\t \n\t\t \n\t\t\t\t\t\t\n\t\t\t\t\t\t\tA figurine is seen in front of the Facebook logo in this illustration taken, March 20, 2018. REUTERS/Dado Ruvic - RC155C02C7D0\n\t\t\t\t\t\t\n\t\t\t\t\t \n\t\t\t\t
\t\n\t\t\t\n\t\t\t\n\t\t\t \t \n\t\t \t\n\t\t \t\n\t\t\t \t\n\t\t\t \t\t
\n\t\t\t\t\t\t\t

\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t (Reuters)\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t

\n\t\t\t\t \t
\n\t\t\t\t \n\t\t \t\n\t \n\t\t
\n\t\n\r\n\n\n \n \n

In addition, Facebook revealed that its partnership with the Atlantic Council helped it to identify the bad actors. One of the groups with roughly 4,000 members was located based on leads from U.S. Special Counsel Robert Mueller\'s recent indictment of 12 Russian nationals for their role in hacking and disinformation efforts during the 2016 U.S. presidential election. 

\n \n \n \n \n \n \r\n\n\n \n \n

That group was created by Russian government figures but had been dormant since Facebook disabled its administrators last year. However, the tech company chose to remove the group to protect the privacy of its members in advance of a forthcoming report from the Atlantic Council that will analyze the Pages, profiles and accounts that Facebook disabled today. 

\n \n \n \n \n \n \r\n\n\n \n \n

Freedom From Facebook, a group that has been pushing for the tech company to be broken up, said that today\'s announcement was not enough. 

\n \n \n \n \n \n \r\n\n\n \n \n

“While it is good news that this latest attack on democracy via Facebook was discovered early, the fact that it happened highlights the danger posed when so much power is concentrated in a single company," said Sarah Miller, a campaigner with Freedom From Facebook.

\n \n \n \n \n \n \r\n\n\n \n \n

Miller continued: “We will not be safe from foreign interference -- and Facebook’s own business model of profiting off of bad actors -- until Congress and the FTC step in to break up the company and impose strong privacy rules.”

\n \n \n \n \n \n \r\n\n\n \n \n

Fox News\' Bree Tracey and The Associated Press contributed to this report. 

\n \n \n \n \n \n \r\n\n\n \n \n \n \n \n \n\n\n \n\n\t\n\t\t
\n\t\t\t

Christopher Carbone is a reporter and news editor covering science and technology for FoxNews.com. He can be reached at christopher.carbone@foxnews.com. Follow him on Twitter @christocarbone.

\n\n\t\t
\n\t\n\n \n \n \n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t
\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t
\t\n\t\t\t\t\t\t\t \t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\n\n\n\n\t\t \t\n\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\r\n \n\t\t\t\t\t \n\t\t\t\t\t\t \t\n\n\n\n
\n
\n\n
\n
\n
\n\n
\n
\t\t\t\t\t\t\n
\n
\n
\n\n
\n\t
\n
\n\n\n\n\n\n\n
\n\n\t\t\t \t\t\t\n\t\t\t\t\t\t\t
\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t
\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t
\t\n\t\t \t\t\t\t\t\t\t\n\t\t\t\t
\n\t\t\t
\t\n\t\t
\n\t
\n\t
\n\t\t\r\n\n\t
\n
\n\n\n\n\n\n\t \t\n\n\n\n \n \t\n \t\t\n\n \n \n\n\n\n\t \n\n\n\n\n\n\t \t \n
\n\n\n', 'publisher': 'Fox News', 'text': 'Facebook uncovered "sophisticated" efforts, possibly linked to Russia, to influence American politics in advance of the U.S. midterm elections.\n\nThe company said in a blog post that it removed 32 accounts from Facebook and Instagram because they were involved in "coordinated" political behavior and appeared to be fake.\n\nFacebook did not explicitly say that the effort was aimed at influencing the midterm elections in November, but the timing of the suspicious activity would be consistent with such an attempt.\n\nThe company, which said it is in the early stages of its investigation, held briefings in the House and Senate this week.\n\nThe company said it doesn\'t know who is behind the efforts, but said there may be connections to Russia. Facebook said it has found some connections between the accounts it removed and the accounts connected to Russia\'s Internet Research Agency that it removed before and after the 2016 U.S. presidential elections.\n\nTWITTER BRINGS IN ANTI-TRUMP ACADEMICS TO FIGHT BIAS\n\nU.S. Senator Mark Warner, D-Va., who introduced the Honest Ads Act earlier this year to help prevent foreign interference in U.S. elections, praised Facebook\'s elections.\n\n“Today’s disclosure is further evidence that the Kremlin continues to exploit platforms like Facebook to sow division and spread disinformation, and I am glad that Facebook is taking some steps to pinpoint and address this activity," Warner said in a statement to Fox News.\n\n"I also expect Facebook, along with other platform companies, will continue to identify Russian troll activity and to work with Congress on updating our laws to better protect our democracy in the future,” he continued.\n\n"It’s clear that whoever set up these accounts went to much greater lengths to obscure their true identities than the Russian-based Internet Research Agency (IRA) has in the past," Facebook said in its statement. "We believe this could be partly due to changes we’ve made over the last year to make this kind of abuse much harder."\n\nFACEBOOK, TECH GIANTS CREATING \'CRISIS IN DEMOCRACY,\' UK REPORT SAYS\n\n"The goal of these operations is to sow discord, distrust, and division in an attempt to undermine public faith in our institutions and our political system," Sen. Richard Burr, R-N.C., said in a statement. "The Russians want a weak America. There is still much that needs to be done to prevent and counter foreign interference on social media."\n\nThe president has made it clear that his administration will not tolerate foreign interference into our electoral process from any nation state or other malicious actors," White House Deputy Press Secretary Hogan Gidley said in a statement late Tuesday.\n\nThe earliest page was created in March 2017. Facebook says more than 290,000 accounts followed at least one of the fake pages. The most followed Facebook Pages had names such as "Aztlan Warriors," \'\'Black Elevation," \'\'Mindful Being," and "Resisters."\n\nFacebook says the pages ran about 150 ads for $11,000 on Facebook and Instagram, paid for in U.S. and Canadian dollars. The first ad was created in April 2017; the last was created in June 2018.\n\nThe perpetrators used virtual private networks and internet phone services, and paid third parties to run ads on their behalf, according to Facebook.\n\nIn addition, Facebook revealed that its partnership with the Atlantic Council helped it to identify the bad actors. One of the groups with roughly 4,000 members was located based on leads from U.S. Special Counsel Robert Mueller\'s recent indictment of 12 Russian nationals for their role in hacking and disinformation efforts during the 2016 U.S. presidential election.\n\nThat group was created by Russian government figures but had been dormant since Facebook disabled its administrators last year. However, the tech company chose to remove the group to protect the privacy of its members in advance of a forthcoming report from the Atlantic Council that will analyze the Pages, profiles and accounts that Facebook disabled today.\n\nFreedom From Facebook, a group that has been pushing for the tech company to be broken up, said that today\'s announcement was not enough.\n\n“While it is good news that this latest attack on democracy via Facebook was discovered early, the fact that it happened highlights the danger posed when so much power is concentrated in a single company," said Sarah Miller, a campaigner with Freedom From Facebook.\n\nMiller continued: “We will not be safe from foreign interference -- and Facebook’s own business model of profiting off of bad actors -- until Congress and the FTC step in to break up the company and impose strong privacy rules.”\n\nFox News\' Bree Tracey and The Associated Press contributed to this report.', 'title': "Facebook finds 'sophisticated' efforts to disrupt US politics, removes 32 accounts", 'url': 'http://www.foxnews.com/tech/2018/07/31/facebook-finds-sophisticated-efforts-to-disrupt-us-politics-removes-32-accounts.html'} ``` {:.input_area} ```python pd.DataFrame([article_details]) ```
authors date description html publisher text title url
0 [Christopher Carbone] 2018-07-31 Facebook said it has uncovered sophisticated e... \n\n\n<!DOCTYPE html>\n<html lang="en">\n \... Fox News Facebook uncovered "sophisticated" efforts, po... Facebook finds 'sophisticated' efforts to disr... http://www.foxnews.com/tech/2018/07/31/faceboo...

Your turn

Make a list of two article URLs from a newspaper. Make a dataframe that contains the text, and relevant meta data using newspaper.article. </div> If that doesn't work, you have to scrape it yourself. How I scrape a page 1. Look at the source code of a sample page. 1. Download the page. 2. Find the thing that you want, and the stuff around that thing. 3. Write a regular expression that matches what you want. 4. Write regular expression that actually matches what you want. 5. Test that it works on one page. 5. Production! How I could also scrape a page 1. Look at the source code of a sample page. 1. Download the page. 2. Find the thing that you want, and the stuff around that thing. 3. Parse the HTML. 5. Test that it works on one page. 5. Production! {:.input_area} ```python import requests ``` {:.input_area} ```python url = 'http://mobilizationjournal.org/toc/maiq/22/2' ``` {:.input_area} ```python volume = 22 issue = 2 url = 'http://mobilizationjournal.org/toc/maiq/' + str(volume) + '/' + str(issue) print(url) ``` {:.output .output_stream} ``` http://mobilizationjournal.org/toc/maiq/22/2 ``` {:.input_area} ```python volume = 22 issue = 2 url = 'http://mobilizationjournal.org/toc/maiq/%s/%s' % (volume, issue) print(url) ``` {:.output .output_stream} ``` http://mobilizationjournal.org/toc/maiq/22/2 ``` {:.input_area} ```python page = requests.get(url) ``` {:.input_area} ```python print(page.headers) ``` {:.output .output_stream} ``` {'Server': 'AtyponWS/7.1', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'Cache-Control': 'no-cache', 'Pragma': 'no-cache', 'X-Webstats-RespID': 'c7b339248d1a5f28e7c8bbd0faa5d670', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Wed, 01 Aug 2018 09:43:50 GMT', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '9577', 'Connection': 'Keep-Alive'} ``` {:.input_area} ```python page.status_code ``` {:.output .output_data_text} ``` 200 ``` {:.input_area} ```python page.text ``` {:.output .output_data_text} ``` '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n\n\n\n\n\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n\n\n\n \n\n\n\n\n\n\n\n \n Mobilization: An International Quarterly\n -\n \n Journal - Table of Contents\n \n \n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n \n \n \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n \n\n\n\n\n\n\n \n\n\n\n\n\n\n\n

\n \n\n\n\n\n\n\n\n\n\n\n\n\n \n
\n
\n \n \n \n \n \n \n \n \n \n \n
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n \n \n\n\n
\n
\n
\n \n\n\n\n\n\n\n\n\n\n\n \n
\n \n\n\n\n\n\n\n\n \n \n\n\n\n\n \n\n\n\n\n\n \n
\n \n\n \n\n\n\n\n\n
\n
\n
\n
\n  \n
\n
\n
\n
\n
\n \n \n
\n
\n \n\n\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n
\n    \n Open access\n Open access\n    \n Full access\n Full access\n    \n Partial access\n Partial access\n    \n No access\n No access\n
   
\n\n\n
\n
\n \n\n\n\n\n

Articles

131
ROUTING AROUND ORGANIZATIONS: SELF-DIRECTED POLITICAL CONSUMPTION
Jennifer Earl, Lauren Copeland and Bruce Bimber
Abstract\r\n\t\t\t| PDF (373 KB) 
No Access
155
BRINGING THE FACTORY BACK IN: THE CRUMBLING OF CONSENT AND THE MOLDING OF COLLECTIVE CAPACITY AT WORK
René Rojas
Abstract\r\n\t\t\t| PDF (357 KB) 
No Access
177
RESPONDING TO THE STREET: GOVERNMENT RESPONSES TO MASS PROTESTS IN DEMOCRACIES
Alejandro Milcíades Peña and Thomas Richard Davies
Abstract\r\n\t\t\t| PDF (1662 KB) 
No Access
201
VOICING OUTRAGE UNEVENLY: DEMOCRATIC DISSATISFACTION, NONPARTICIPATION, AND PARTICIPATION FREQUENCY IN THE 15-M CAMPAIGN
Martín Portos and Juan Masullo
Abstract\r\n\t\t\t| PDF (730 KB) 
No Access
223
WHAT IS TO BE DONE? AGENCY AND THE CAUSATION OF TRANSFORMATIVE EVENTS IN IRELAND\'S 1916 RISING AND 1969 LONG MARCH
Lorenzo Bosi and Donagh Davis
Abstract\r\n\t\t\t| PDF (307 KB) 
No Access
245
WHY ONLY SOME LIFESTYLE ACTIVISTS AVOID STATE-ORIENTED POLITICS: A CASE STUDY IN THE BELGIAN ENVIRONMENTAL MOVEMENT
Joost de Moor, Sofie Marien and Marc Hooghe
Abstract\r\n\t\t\t| PDF (285 KB) 
No Access

Book Reviews

265
BOOK REVIEWS
Deana A. Rohlinger
Citation\r\n\t\t\t| PDF (167 KB) 
No Access
\n\n\n\n\n\n
\n \n \n
\n
\n
\n
\n
\n
\n
\n
\n \n
\n
\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
\n
\n
\n
\n \n \n

Volume 22, Issue 2
(June 2017)

\n \n \n
\n
\n
\n
\n \n
\n \n \n < Previous\n \n \n \n \n Next >\n \n \n

\n \n
\n \n \n
\n
\n \n \n
\n\n \n\n\n\n\n\n\n
\n \n
\n
\n\n\n\n \n \n\n\n\n\n \n Current Issue\n
\n \n \n\n\n\n\n\n Available Issues\n
\n\n\n\n \n\n\n
\n\n
\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n \n \n
\n
\n
\n
\n
\n
\n
\n\n\n\n\n\n\n\n\n
\n
\n
\n
\n

Alerts for the Journal

\n
\n
\n
\n
\n Click\n \n \n to get an email alert for every new issue of

Mobilization: An International Quarterly

\n
\n
\n
\n
\n
\n
\n
\n\n\n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n\n
\n
\n
\n
\n \n \n \n

Journal Information

\n \n \n \n
\n
\n
\n
\n \n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n
ISSN:1086-671X
Frequency:Quarterly
RSS Feed:
(What is this?)
rrs icon
\n \n \n \n
\n
\n
\n
\n
\n
\n
\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n \n \n \n\n\n\n\n\n\n\n\n
\n
\n
\n
\n \n \n \n

Register for a Profile

\n \n \n \n
\n
\n
\n
\n \n

Not Yet Registered?

Benefits of Registration Include:

  • A Unique User Profile that will allow you to manage your current subscriptions (including online access)
  • The ability to create favorites lists down to the article level
  • The ability to customize email alerts to receive specific notifications about the topics you care most about and special offers

Register Now!

\n \n \n \n
\n
\n
\n
\n
\n
\n
\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n\n \n \n\n\n\n
\n
\n
\n
 
\n
\n \n
\n\r\n\r\n\r\n \r\n\n\n\n\n\n\n' ``` {:.input_area} ```python from IPython.display import HTML ``` {:.input_area} ```python HTML(page.text) ```
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> Mobilization: An International Quarterly - Journal - Table of Contents
 
    Open access Open access     Full access Full access     Partial access Partial access     No access No access
   

Articles

131
ROUTING AROUND ORGANIZATIONS: SELF-DIRECTED POLITICAL CONSUMPTION
Jennifer Earl, Lauren Copeland and Bruce Bimber
Abstract | PDF (373 KB) 
No Access
155
BRINGING THE FACTORY BACK IN: THE CRUMBLING OF CONSENT AND THE MOLDING OF COLLECTIVE CAPACITY AT WORK
René Rojas
Abstract | PDF (357 KB) 
No Access
177
RESPONDING TO THE STREET: GOVERNMENT RESPONSES TO MASS PROTESTS IN DEMOCRACIES
Alejandro Milcíades Peña and Thomas Richard Davies
Abstract | PDF (1662 KB) 
No Access
201
VOICING OUTRAGE UNEVENLY: DEMOCRATIC DISSATISFACTION, NONPARTICIPATION, AND PARTICIPATION FREQUENCY IN THE 15-M CAMPAIGN
Martín Portos and Juan Masullo
Abstract | PDF (730 KB) 
No Access
223
WHAT IS TO BE DONE? AGENCY AND THE CAUSATION OF TRANSFORMATIVE EVENTS IN IRELAND'S 1916 RISING AND 1969 LONG MARCH
Lorenzo Bosi and Donagh Davis
Abstract | PDF (307 KB) 
No Access
245
WHY ONLY SOME LIFESTYLE ACTIVISTS AVOID STATE-ORIENTED POLITICS: A CASE STUDY IN THE BELGIAN ENVIRONMENTAL MOVEMENT
Joost de Moor, Sofie Marien and Marc Hooghe
Abstract | PDF (285 KB) 
No Access

Book Reviews

265
BOOK REVIEWS
Deana A. Rohlinger
Citation | PDF (167 KB) 
No Access

Alerts for the Journal

Click to get an email alert for every new issue of

Mobilization: An International Quarterly

Journal Information

ISSN: 1086-671X
Frequency: Quarterly
RSS Feed:
(What is this?)
rrs icon

Register for a Profile

Not Yet Registered?

Benefits of Registration Include:

  • A Unique User Profile that will allow you to manage your current subscriptions (including online access)
  • The ability to create favorites lists down to the article level
  • The ability to customize email alerts to receive specific notifications about the topics you care most about and special offers

Register Now!

 
{:.input_area} ```python page_html = page.text ``` Look at the source code of a sample page. ![](images/source.png) 1. Open Chrome and navigate the web page of your choice. 2. Click on Customize and control Google Chrome Chrome settings icon icon in the upper right-hand side of the browser window. 3. From the drop-down menu that appears, select More tools and then Developer tools. 1. Open Microsoft Edge and navigate to the web page of your choice. 2. Click the More Edge more icon icon in the upper right-hand corner of the screen. 3. Select F12 Developer Tools from the drop-down menu that appears. Look at the source code of a sample page. ![](images/search_html.png) So our headline is here: `div class="art_title">ROUTING AROUND ORGANIZATIONS: SELF-DIRECTED POLITICAL CONSUMPTION<` which fits the pattern `div class="art_title">HEADLINE<` Regular expressions xkcd - 208 ![](https://raw.github.com/nealcaren/workshop_2014/master/notebooks/images/regular_expressions.png) Regular expressions xkcd - 1171 ![](https://raw.github.com/nealcaren/workshop_2014/master/notebooks/images/perl_problems.png) {:.input_area} ```python import re ``` `HEADLINE` becomes `(.*?)` `.` match any character `*` and keep going `?` until you find the first ... {:.input_area} ```python re.findall('div class="art_title">.*?<', page_html) ``` {:.output .output_data_text} ``` ['div class="art_title">ROUTING AROUND ORGANIZATIONS: SELF-DIRECTED POLITICAL CONSUMPTION<', 'div class="art_title">BRINGING THE FACTORY BACK IN: THE CRUMBLING OF CONSENT AND THE MOLDING OF COLLECTIVE CAPACITY AT WORK<', 'div class="art_title">RESPONDING TO THE STREET: GOVERNMENT RESPONSES TO MASS PROTESTS IN DEMOCRACIES<', 'div class="art_title">VOICING OUTRAGE UNEVENLY: DEMOCRATIC DISSATISFACTION, NONPARTICIPATION, AND PARTICIPATION FREQUENCY IN THE 15-M CAMPAIGN<', 'div class="art_title">WHAT IS TO BE DONE? AGENCY AND THE CAUSATION OF TRANSFORMATIVE EVENTS IN IRELAND\'S 1916 RISING AND 1969 LONG MARCH<', 'div class="art_title">WHY ONLY SOME LIFESTYLE ACTIVISTS AVOID STATE-ORIENTED POLITICS: A CASE STUDY IN THE BELGIAN ENVIRONMENTAL MOVEMENT<', 'div class="art_title">BOOK REVIEWS<'] ``` `HEADLINE` becomes `(.*?)` `.` match any character `*` and keep going `?` until you find the first ... and only return things inbetween the `()` {:.input_area} ```python re.findall('div class="art_title">(.*?)<', page_html) ``` {:.output .output_data_text} ``` ['ROUTING AROUND ORGANIZATIONS: SELF-DIRECTED POLITICAL CONSUMPTION', 'BRINGING THE FACTORY BACK IN: THE CRUMBLING OF CONSENT AND THE MOLDING OF COLLECTIVE CAPACITY AT WORK', 'RESPONDING TO THE STREET: GOVERNMENT RESPONSES TO MASS PROTESTS IN DEMOCRACIES', 'VOICING OUTRAGE UNEVENLY: DEMOCRATIC DISSATISFACTION, NONPARTICIPATION, AND PARTICIPATION FREQUENCY IN THE 15-M CAMPAIGN', "WHAT IS TO BE DONE? AGENCY AND THE CAUSATION OF TRANSFORMATIVE EVENTS IN IRELAND'S 1916 RISING AND 1969 LONG MARCH", 'WHY ONLY SOME LIFESTYLE ACTIVISTS AVOID STATE-ORIENTED POLITICS: A CASE STUDY IN THE BELGIAN ENVIRONMENTAL MOVEMENT', 'BOOK REVIEWS'] ``` You might want to include `.*?` or `()` or any of `^$.|\+[{` as part of your search term. You can, with `\` This tell your regular expression interpreter to treat the next character literally. {:.input_area} ```python re.findall('>PDF .*?<', page_html) ``` {:.output .output_data_text} ``` ['>PDF (373 KB)<', '>PDF (357 KB)<', '>PDF (1662 KB)<', '>PDF (730 KB)<', '>PDF (307 KB)<', '>PDF (285 KB)<', '>PDF (167 KB)<'] ```

Your turn

Your turn. Find the sizes of all the linked PDFs!
Store the results somewhere, and try it on a different page. {:.input_area} ```python titles = re.findall('div class="art_title">(.*?)<\/div', page_html) ``` {:.input_area} ```python titles ``` {:.input_area} ```python volume = 23 issue = 2 url = 'http://mobilizationjournal.org/toc/maiq/%s/%s' % (volume, issue) print(url) page = requests.get(url) page_html = page.text titles = re.findall('div class="art_title">(.*?)<\/div', page_html) print(titles) ``` Production time!!! {:.input_area} ```python volume = 20 for issue in range(1,5): url = 'http://mobilizationjournal.org/toc/maiq/%s/%s' % (volume, issue) print(url) ``` {:.input_area} ```python from time import sleep ``` {:.input_area} ```python def get_moby_titles(page_html): titles = re.findall('div class="art_title">(.*?)<\/div', page_html) return titles #somewhere to store the titles titles = [] volume = 20 #Loop through 1 volume pages. for issue in range(1,5): #construct the URL url = 'http://mobilizationjournal.org/toc/maiq/%s/%s' % (volume, issue) #Open the page and grab the HTML page = requests.get(url) page_html = page.text #Extract the headlines new_titles = get_moby_titles(page_html) #Add them to our headline list titles = titles + new_titles #Rest sleep(1) ``` {:.input_area} ```python len(titles) ``` {:.input_area} ```python titles ``` {:.input_area} ```python titles[:10] ``` {:.input_area} ```python titles[-10:] ``` Time to get it out of here. {:.input_area} ```python import pandas as pd pd.DataFrame(titles) ``` {:.input_area} ```python pd.DataFrame(titles, columns=['Article Title']) ``` {:.input_area} ```python moby_df = pd.DataFrame(titles, columns=['Article Title']) moby_df.to_csv('moby_titles.csv', index=False) ```

Your turn

Find the email addresses of the University of Oslo Sociology & Human Geography faculty.

Remember to put things in functions as soon as possible.

Does the script work for your department? </div> [Directory](https://www.sv.uio.no/iss/english/people/aca/?page=1) ### If you know HTML, you can also parse the page. {:.input_area} ```python from bs4 import BeautifulSoup ``` {:.input_area} ```python soup = BeautifulSoup(page_html, "lxml") ``` {:.input_area} ```python soup.find_all('div', attrs={'class':'art_title'}) ``` {:.input_area} ```python ts = soup.find_all('div', attrs={'class':'art_title'}) for title in ts: print(title.contents[0]) ``` {:.input_area} ```python pd.read_excel('data/groups.xlsx') ```

Family name: First name: Group
0 Berg Oddmund 1
1 Blomqvist Niklas 2
2 Dahl Espen Steinung 4
3 Danner Hannah 5
4 Donnally Sandra 3
5 Ericsson Sanna Charlotta 4
6 Gao Wenxin 2
7 Geissinger Andrea 5
8 Getik Demid 6
9 Hammerschmidt Dennis 6
10 Hovdahl Isabel 3
11 Islam Marco 1
12 Jensen Are 4
13 Jesnes Kristin 5
14 Kilman Josefin Mari 2
15 Knutsen Tora Kjærnes 5
16 Knutsson Polina 3
17 Kontareva Alina 1
18 Langørgen Erlend 2
19 Llave Marilex Rea 6
20 Lundstedt Karl Jonas Valter 4
21 Molden Birgitte Hovdan 1
22 Molden Lars Hovdan 2
23 Olme Elisabet 5
24 Oraby Tarek 6
25 Parmer Pernille 3
26 Persarvet Viktor 1
27 Pesl Jan 1
28 Samuelsen Jeanette 4
29 Schultheiss Tobias 2
30 Schwabe Henrik 2
31 Smidt Martin 5
32 Solbakken Simen Sørbøe 3
33 Strøm-Andersen Nhat 3
34 Svennevik Elisabeth M. C. 4
35 Troset Tina Løvsletten 6
36 Wu You Ola 6
37 Yaldiz Nur 1

Big Group Project

There is a file called groups.xlsx in the data folder. That assigns everyone in the class to a group, numbered 1-6. Find your group!

As a group, pick a major newspaper, such as the Washington Post or the Guardian. Your goal is get all their articles on Brexit. Go!

Work together!!! Be prepared to present your results to the class! </div> {:.input_area} ```python pd.concat? ``` {:.input_area} ```python def load_or_get(volume, issue): ''' Tries to open a Moby issue. If not found, gets it from the internet. ''' file_name = 'moby_%s_%s.html' % (volume, issue) url = 'http://mobilizationjournal.org/toc/maiq/%s/%s' % (volume, issue) # First, try to find the file stored locally. try: with codecs.open(file_name, 'r') as infile: page_html = infile.read() # If that didn't work, try getting it from the interent except Exception, e: print 'Going to the internet to get %s-%s' % (issue, volume) page = requests.get(url) page_html = page.text # Save the file so you only go to the page once. It is polite. with codecs.open(file_name, 'wb') as outfile: outfile.write(page_html) #don't forget to send the stuff back return page_html ``` {:.input_area} ```python page_html = load_or_get(8,1) ``` {:.input_area} ```python page_html = load_or_get(10,1) ``` {:.input_area} ```python def scrape_headlines(page_html): titles = re.findall('div class="art_title">(.*?)<\/div', page_html) return titles ``` {:.input_area} ```python volume = 13 for issue in [1,2]: page_html = load_or_get(volume, issue) print scrape_headlines(page_html) ``` {:.input_area} ```python pd.read_html('http://www.espn.com/soccer/commentary?gameId=514949')[2] ``` ### Still stuck? Spoofing a browser `headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}` `r = requests.get(url, headers=headers)` Keep cookies `s = requests.Session()` `s.get('http://httpbin.org/get')` Authentication in requests `requests.get('https://api.github.com/user', auth=('user', 'pass'))` ### Still still stuck 2. [scrapy](https://scrapy.org/) 3. Selenium - control the browser.