Skip to content

Commit

Permalink
Merge pull request #93 from kermitt2/line-number
Browse files Browse the repository at this point in the history
Line number and finalize 0.3
  • Loading branch information
kermitt2 authored Aug 22, 2020
2 parents 9cde96a + 12a6488 commit 3216284
Show file tree
Hide file tree
Showing 8 changed files with 837 additions and 359 deletions.
1 change: 0 additions & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
[submodule "xpdf-4.00"]
path = xpdf-4.00
url = https://github.com/kermitt2/xpdf-4.00
branch = nonumericchanamesmapping
52 changes: 29 additions & 23 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,29 +21,29 @@ The latest stable version is *0.2*. Working version (master) is *0.3*.
General usage is as follow:

```
pdfalto [options] <PDF-file> [<xml-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-verbose : display pdf attributes
-noText : do not extract textual objects
-noImage : do not extract Images (Bitmap and Vectorial)
-noImageInline : do not include images inline in the stream
-outline : create an outline file xml (i.e. a table of content) as additional file
-annotation : create an annotations file xml as additional file
-blocks : add blocks informations whithin the structure
-readingOrder : blocks follow the reading order
-fullFontName : fonts names are not normalized
-nsURI <string> : add the specified namespace URI
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-filesLimit <int> : limit of asset files be extracted to the value specified
-q : don't print any messages or errors
-v : print version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
--saveconf <string> : save all command line parameters in the specified XML <file>
Usage: pdfalto [options] <PDF-file> [<xml-file>]
-f <int> : first page to convert
-l <int> : last page to convert
-verbose : display pdf attributes
-noImage : do not extract Images (Bitmap and Vectorial)
-noImageInline : do not include images inline in the stream
-outline : create an outline file xml
-annotation : create an annotations file xml
-noLineNumbers : do not output line numbers added in manuscript-style textual documents
-readingOrder : blocks follow the reading order
-noText : do not extract textual objects (might be useful, but non-valid ALTO)
-charReadingOrderAttr : include TYPE attribute to String elements to indicate right-to-left reading order (might be useful, but non-valid ALTO)
-fullFontName : fonts names are not normalized
-nsURI <string> : add the specified namespace URI
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-filesLimit <int> : limit of asset files be extracted
-q : don't print any messages or errors
-v : print version info
-h : print usage information
-help : print usage information
--help : print usage information
-? : print usage information
```

In addition to the [ALTO](https://github.com/altoxml/documentation/wiki) file describing the PDF content, the following files are generated:
Expand Down Expand Up @@ -93,6 +93,12 @@ The executable `pdfalto` is generated in the root directory. Additionally, this

# Changes

New in version 0.3 (apart various bug fixes):

- line number detection: line numbers (typically added for review in manuscripts/preprints) are specifically identified and not anymore mixed with the rest of text content, they will be grouped in a separate block or, optionally, not outputted in the ALTO file (`noLineNumbers` option)

- removal of `-blocks` option, the block information are always returned for ensuring ALTO validation (`<TextBlock>` element)

New in version 0.2 (apart various bug fixes):

- support Unicode composition of characters
Expand Down
4 changes: 3 additions & 1 deletion install_deps.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ DEP_INSTALL_DIR=install

LIBXML_URI=http://xmlsoft.org/sources/libxml2-2.9.8.tar.gz
FREETYPE_URI=https://download.savannah.gnu.org/releases/freetype/freetype-2.9.tar.gz
ICU_URI=http://download.icu-project.org/files/icu4c/62.1/icu4c-62_1-src.tgz
#ICU_URI=http://download.icu-project.org/files/icu4c/62.1/icu4c-62_1-src.tgz
ICU_URI=https://github.com/unicode-org/icu/releases/download/release-62-2/icu4c-62_2-src.tgz
#ICU_URI=https://github.com/unicode-org/icu/releases/download/release-66-1/icu4c-66_1-src.tgz

mkdir -p $DEP_INSTALL_DIR

Expand Down
31 changes: 8 additions & 23 deletions src/Parameters.cc
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ void Parameters::setDisplayText(GBool text) {
unlockGlobalParams;
}

void Parameters::setDisplayBlocks(GBool block) {
/*void Parameters::setDisplayBlocks(GBool block) {
lockGlobalParams;
displayBlocks = block;
unlockGlobalParams;
}
}*/

void Parameters::setDisplayOutline(GBool outl) {
lockGlobalParams;
Expand Down Expand Up @@ -83,6 +83,12 @@ void Parameters::setOcr(GBool ocrA) {
unlockGlobalParams;
}

void Parameters::setNoLineNumbers(GBool noLineNumberAttrs) {
lockGlobalParams;
noLineNumbers = noLineNumberAttrs;
unlockGlobalParams;
}

void Parameters::saveToXML(const char *fileName,int firstPage,int lastPage){
char* tmp;
tmp=(char*)malloc(10*sizeof(char));
Expand All @@ -109,27 +115,6 @@ void Parameters::saveToXML(const char *fileName,int firstPage,int lastPage){
xmlAddChild(tool,version);
xmlAddChild(tool,desc);

// * -f <int> : first page to convert<br/>
// * -l <int> : last page to convert<br/>
// * -verbose : display pdf attributes<br/>
// * -noText : do not extract textual objects<br/>
// * -noImage : do not extract images (Bitmap and Vectorial)<br/>
// * -noImageInline : do not include images inline in the stream<br/>
// * -outline : create an outline file xml<br/>
// * -annots : create an annotaitons file xml<br/>
// * -cutPages : cut all pages in separately files<br/>
// * -blocks : add blocks informations whithin the structure<br/>
// * -readingOrder : blocks follow the reading order<br/>
// * -fullFontName : fonts names are not normalized<br/>
// * -nsURI : add the specified namespace URI<br/>
// * -q : don't print any messages or errors<br/>
// * -v : print copyright and version information<br/>
// * -h : print usage information<br/>
// * -help : print usage information<br/>
// * --help : print usage information<br/>
// * -? : print usage information<br/>


param = xmlNewNode(NULL,(const xmlChar*)TAG_PAR_PARAM);
xmlNewProp(param,(const xmlChar*)"name",(const xmlChar*)"first page");
xmlNewProp(param,(const xmlChar*)"form",(const xmlChar*)"-f");
Expand Down
23 changes: 19 additions & 4 deletions src/Parameters.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ class Parameters {
/** Destructor */
~Parameters();

// getters

/** Return a boolean which inform if the text is displayed
* @return <code>true</code> if the toText option is selected, <code>false</code> otherwise
*/
Expand All @@ -39,7 +41,7 @@ class Parameters {
/** Return a boolean which inform if blocks informations are diplayed
* @return <code>true</code> if the blocks option is selected, <code>false</code> otherwise
*/
GBool getDisplayBlocks() { return displayBlocks;};
//GBool getDisplayBlocks() { return displayBlocks;};

/** Return a boolean which inform if the images are displayed
* @return <code>true</code> if the noImage option is not selected, <code>false</code> otherwise
Expand Down Expand Up @@ -88,6 +90,13 @@ class Parameters {
*/
int getFilesCountLimit() {return filesCountLimit;}

/** Return a boolean which inform if line numbers tokens are diplayed
* @return <code>true</code> if the noLineNumbers option is selected, <code>false</code> otherwise
*/
GBool getNoLineNumbers() { return noLineNumbers;};

// setters

/** Modify the boolean which inform if the images are displayed
* @param noImage <code>true</code> if the noImage option is not selected, <code>false</code> otherwise
*/
Expand All @@ -101,7 +110,7 @@ class Parameters {
/** Modify the boolean which inform if blocks informations are diplayed
* @param noblock <code>true</code> if the blocks option is selected, <code>false</code> otherwise
*/
void setDisplayBlocks(GBool noblock);
//void setDisplayBlocks(GBool noblock);

/** Modify the boolean which inform if the bookmark is displayed
* @param outline <code>true</code> if the outline option is selected, <code>false</code> otherwise
Expand Down Expand Up @@ -140,6 +149,11 @@ class Parameters {
void setOcr(GBool ocrA);

void setFilesCountLimit(int count);

/** Modify the boolean which inform if line numbers must be diplayed
* @param noLineNumberAttrs <code>true</code> if the noLineNumbers option is selected, <code>false</code> otherwise
*/
void setNoLineNumbers(GBool noLineNumberAttrs);

void saveToXML(const char *fileName,int firstPage,int lastPage);

Expand All @@ -150,7 +164,7 @@ class Parameters {
/** The value of the noText option */
GBool displayText;
/** The value of the blocks option */
GBool displayBlocks;
//GBool displayBlocks;
/** The value of the outline option */
GBool displayOutline;
/** The value of the cutPages option */
Expand All @@ -167,7 +181,8 @@ class Parameters {
GBool ocr;
/** the count limit of files */
int filesCountLimit;

/** PL: the value of the noLineNumbers option*/
GBool noLineNumbers;
};

#endif /*PARAMETERS_H_*/
Expand Down
Loading

0 comments on commit 3216284

Please sign in to comment.