import docx2txt
text = docx2txt.process('data/pandas_wiki.docx')
text
'In\xa0computer programming,\xa0pandas\xa0is a\xa0software library\xa0written for the\xa0Python programming language\xa0for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and\xa0time series. It is\xa0free software\xa0released under the\xa0three-clause BSD license.\xa0The name is derived from the term "panel data", an\xa0econometrics\xa0term for data sets that include observations over multiple time periods for the same individuals.\n\n\n\nLibrary features\n\nDataFrame object for data manipulation with integrated indexing.\n\nTools for reading and writing data between in-memory data structures and different file formats.\n\nData alignment and integrated handling of missing data.\n\nReshaping and pivoting of data sets.\n\nLabel-based slicing, fancy indexing, and subsetting of large data sets.\n\nData structure column insertion and deletion.\n\nGroup by engine allowing split-apply-combine operations on data sets.\n\nData set merging and joining.\n\nHierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.\n\nTime series-functionality: Date range generation[3]\xa0and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.\n\nThe library is highly optimized for performance, with critical code paths written in\xa0Cython\xa0or\xa0C.'
print(text)
In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.
Library features
DataFrame object for data manipulation with integrated indexing.
Tools for reading and writing data between in-memory data structures and different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of data sets.
Label-based slicing, fancy indexing, and subsetting of large data sets.
Data structure column insertion and deletion.
Group by engine allowing split-apply-combine operations on data sets.
Data set merging and joining.
Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
Time series-functionality: Date range generation[3] and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.
The library is highly optimized for performance, with critical code paths written in Cython or C.
import PyPDF2
pdfFileObj = open('data/l09r01.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
32
print(pdfReader.getPage(0).extractText())
GE.15
-
21932(E)
*1521932*
Conference of the Parties
Twenty
-
first session
Paris, 30 November
to
11
December 201
5
Agenda item
4
(
b
)
Durban Platform for Enhanced Action (decision 1/CP.17)
Adoption of a protocol, another legal instrument, or an
agreed outcome with legal force under the Convention
applicable to all Parties
ADOPTION OF THE PARIS AGREEMENT
Proposal by
the President
Draft decision
-
/CP.21
The
Conference of the Parties
,
Recalling
decision 1/CP.17 on the establishment of the Ad Hoc Working Group on
the Durban Platform for Enhanced Action,
Also
recalling
Articles 2, 3 and 4 of the Convention,
Further
recalling
relevant
decisions of the Conference of the Parties, including
decisions
1/CP.16,
2/CP.18, 1/CP.19 and 1/CP.20,
Welcoming
the
adoption
of
United Nations General Assembly resolution
A/RES/70/1,
our world: the 2030 Age
nda for Sustaina
, in
particular its goal 13, and the
adoption
of the Addis Ababa Action Agenda of the third
International Conference on Financing for Development
and the adoption of the Sendai
Framework for Disaster Risk Reduction
,
Recognizing
that climate
change represents an urgent and potentially irreversible
threat to human societies and the planet and thus requires the widest possible cooperation
by all countries, and their participation in an effective and appropriate international
response, with a vi
ew to accelerating the reduction of global greenhouse gas
emissions,
Also
r
ecognizing
that
deep reductions
in global emissions will be required in order
to achieve the ultimate objective of the Convention and emphasizing the need for urgency
in
address
ing
climate change,
Acknowledging
that climate change is a common concern of humankind,
Parties
should, when taking action to address climate change, respect
,
promote
and consider
their
respective obligations on human rights,
the right to health, the rights of indigenous peoples,
+
United Nations
FCCC
/CP/201
5
/L.
9
/Rev.1
Distr.: Limited
12
December 2015
Original: English
text = ''
for page_number in range(0,pdfReader.numPages):
text = text + pdfReader.getPage(page_number).extractText()
len(text)
116358
print(text[5135:6135])
uests
Parties to provide
notification of any such provisional
application to the Depositary;
FCCC/CP/2015/L.9
/Rev.1
3
6.
Notes
that the work of the Ad Hoc Working Group on the Durban Platform for
Enhanced Action, in accordance with decision 1/CP.17, paragraph 4, has been completed;
7.
Decides
to establish the Ad Hoc Working Group on the Paris Agreement under the
same arrangement, mutatis mutandis, as those concerning the election of officers to the
Bureau of the
Ad Hoc Working Group on the Durban Platform for Enhanced Action
;
1
8.
Also
decides
that the Ad Hoc Working Group on the Paris Agreement shall prepare
for the entry into force of the Agreement and for the convening of the first session of the
Conference of the Parties serving as the meeting of the Parties to the Paris Agreement;
9.
Furthe
r
decides
to oversee the implementation of the work programme resulting
from the relevant requests contained in this decision;
10.
Requests
t
def extract_page(page_number):
text = pdfReader.getPage(page_number).extractText()
return {'page' : page_number + 1,
'text' : text}
pages = []
for page_number in range(0,pdfReader.numPages):
pages.append(extract_page(page_number))
len(pages)
32
pages[5]
{'page': 6,
'text': 'FCCC/CP/2015/L.9\n/Rev.1\n \n6\n \n \nand the excha\nnge of information, experiences, and best practices amongst Parties to raise \ntheir resilience to these impacts\n;\n*\n \n36.\n \nInvites\n \nParties to communicate, by 2020, to the secretariat mid\n-\ncentury, long\n-\nterm \nlow greenhouse gas emission development strategies in ac\ncordance with Article 4, \nparagraph 19, of the Agreement, and \nrequests\n \nthe secretariat to publish on the UNFCCC \n\n \n37.\n \nRequests\n \nthe Subsidiary Body for Scientific and Technolo\ngical Advice to develop \nand recommend the guidance referred to under Article 6, paragraph 2, of the Agreement for \nadoption by the Conference of the Parties serving as the meeting of the Parties to the Paris \nAgreement at its first session, including guidanc\ne to ensure that double counting is avoided \non the basis of a corresponding adjustment by Parties for \nboth \nanthropogenic emissions by \nsources and removals by sinks covered by their nationally determined contributions under \nthe Agreement;\n \n38.\n \nRecommends \ntha\nt the \nConference of the Parties serving as the meeting of the Parties \nto the Paris Agreement\n \nadopt rules, modalities and procedures for the mechanism \nestablished by Article 6, paragraph 4, of the Agreement on the basis of: \n \n(a)\n \nVoluntary participation auth\norized by each Party involved;\n \n(b)\n \nReal, measurable, and long\n-\nterm benefits related to the mitigation of climate \nchange;\n \n(c)\n \nSpecific scopes of activities; \n \n(d)\n \nReductions in emissions that are additional to any that would otherwise \noccur;\n \n(e)\n \nVerification\n \nand certification of emission reductions resulting from \nmitigation activities by designated operational entities;\n \n(f)\n \nExperience gained with and lessons learned from existing mechanisms and \napproaches adopted under the Convention and its related legal ins\ntruments;\n \n39.\n \nRequests \nthe Subsidiary Body for Scientific and Technological Advice to develop \nand recommend rules, modalities and procedures for the mechanism referred to in \nparagraph 38 above for consideration and adoption by the \nConference of the Parties\n \nserving \nas the meeting of the Parties to the Paris Agreement\n \nat its first session;\n \n40.\n \nAlso\n \nr\nequests\n \nthe Subsidiary Body for Scientific and Technological Advice to \nundertake a work programme under the framework for non\n-\nmarket approaches to \nsustainable development referred to in Article 6, paragraph 8, of the Agreement, with the \nobjective of considering h\now to enhance linkages and create synergy between, inter alia, \nmitigation, adaptation, finance, t\nechnology transfer and capacity\n-\nbuilding, and how to \nfacilitate the implementation and coordination of non\n-\nmarket approaches;\n \n41.\n \nFurther requests\n \nthe Subsidia\nry Body for Scientific and Technological Advice to \nrecommend a draft decision on the work programme referred to in paragraph 40 above, \ntaking into account the views of Parties, for consideration and adoption by the \nConference \nof the Parties serving as the \nmeeting of the Parties to the Paris Agreement\n \nat its first \nsession;\n \nA\nDAPTATION\n \n \n \n \n \n*\n \n \nParagraph 35 has been deleted, and subsequent paragraph numbering and cross references to other \nparagraphs within the document will be amended at a later stage.\n \n'}
import pandas as pd
df = pd.DataFrame(pages)
df.head()
page | text | |
---|---|---|
0 | 1 | \nGE.15\n-\n21932(E)\n \n*1521932*\n \n \n \n... |
1 | 2 | FCCC/CP/2015/L.9\n/Rev.1\n \n2\n \n \nlocal co... |
2 | 3 | FCCC/CP/2015/L.9\n/Rev.1\n \n \n3\n \n6.\n \nN... |
3 | 4 | FCCC/CP/2015/L.9\n/Rev.1\n \n4\n \n \n\n-\nind... |
4 | 5 | FCCC/CP/2015/L.9\n/Rev.1\n \n \n5\n \nemission... |
df.to_csv('data/paris_accord.csv', index=False)
Paris accord looks like this