GROBID – PDF into structured XML/TEI

GROBID – GeneRation Of BIbliographic Data.

GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

The following functionalities are available:

  • Header extraction and parsing from article in PDF format. The extraction here covers the bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
  • References extraction and parsing from articles in PDF format. References in footnotes are supported, although still work in progress. They are rare in technical and scientific articles, but frequent for publications in the humanities and social sciences.
  • Parsing of references in isolation.
  • Extraction of patent and non-patent references in patent publications.
  • Parsing of names, in particular author names in header, and author names in references (two distinct models).
  • Parsing of affiliation and address blocks.
  • Parsing of dates.
  • Full text extraction from PDF articles, including a model for the the overall document segmentation and a model for the structuring of the text body.

https://github.com/kermitt2/grobid

https://grobid.readthedocs.io/en/latest/

http://cloud.science-miner.com/grobid/

Cite this entry: "GROBID – PDF into structured XML/TEI," in Open Access Resources, September 19, 2019, https://oaresources.xyz/grobid-pdf-into-structured-xml-tei/