You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
anisotropi4 b403408573 initial commit 8 months ago
Anglia NESA June 2023 data 1 year ago
Kent-Sussex-Wessex NESA June 2023 data 1 year ago
London-North-Eastern NESA June 2023 data 1 year ago
London-North-Western-North NESA June 2023 data 1 year ago
London-North-Western-South NESA June 2023 data 1 year ago
Scotland NESA June 2023 data 1 year ago
Western NESA June 2023 data 1 year ago
archive June 2021 update 4 years ago
download update March 2022 3 years ago
.gitignore Initial commit 5 years ago
LICENSE Initial commit 5 years ago
README.md Update README.md 2 years ago
collate-gauge.py update pandas concat not append 3 years ago
collate-gauge2.py tweaks to run.sh and update rectangle find 1 year ago
filter_pdf.py update removes light greyscale to white and dark greyscale to black 2 years ago
format-md.py initial commit 3 years ago
generate-png.sh initial commit 1 year ago
generate-txt.sh initial commit 8 months ago
pdf-separate.sh initial commit 1 year ago
remove-background.sh comment out debug 2 years ago
requirements.txt update requirements 2 years ago
run.sh tweaks to run.sh and update rectangle find 1 year ago

README.md

nesa

This project consists of a series of scripts that extract data from the Network Rail National Electronic Sectional Appendix (NESA) into a series of Route Clearance reports using PDF text extraction.

The downloadable NESA data is available here and contains as a set of route PDF files with spreadsheet and embedded TIFF image files

Unformatted text Per page Route Clearance TSV Route Clearance XLSX Report
Anglia Route Anglia Route Anglia Route
Kent, Sussex and Wessex Kent, Sussex and Wessex Kent, Sussex and Wessex
London North-Eastern London North-Eastern London North-Eastern
London North-Western North London North-Western North London North-Western North
London North-Western South London North-Western South London North-Western South
Scotland Scotland Scotland
Western Western Western
  • Notes: South Wales data is now in Western, and North Wales in London North Western North. Kent, Sussex and Wessex data is now back in the Kent-Sussex-Wessex directory.

Data Source

The PDF files for these seven routes are available here

Prerequisites

  • jq is a lightweight and flexible command-line JSON processor. On an Debian or similar apt based Linux system:

    $ sudo apt install jq

  • poppler-utils package to decompress, extract text and render PDF based on the xpdf-3.0 code base

    $ sudo apt install poppler-utils

  • ghostscript package to interpret and manipulate PostScript and PDF files

    $ sudo apt install ghostscript

python dependencies

  • python 3.9 to run the scripts PDF based on the xpdf-3.0 code base. Tested against Python 3.7, 3.8 and 3.9
  • Python pandas data processing library
  • Python pdfplumber table and visual debugging PDF data extract library
  • Python pdfminer.six PDF information extraction library
  • Python openpyxl library to write Excel 2010 xlsx files

python virtualenv package

For ease of use manage python packages dependencies with a local virtual environment venv using the python virtualenv package:

$ sudo apt install virtualenv
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Creating the Route Clearance reports

The reports for the routes are created as follows:

Download the data

Download the seven route Section Appendix PDF files into the download directory from here

Process the PDF files

To extract the data execute the run.sh script:

$ ./run.sh

This executes a series of scripts to segment, extract and output the data creating a series of TSV and Excel spreadsheets in the seven route directories

How it works

To extract text from the PDF text-object elements, issues with formatting and use of grey-scale background in a number of the key route-clearance tables breaks pdfplumber and pdfminer formatted text extraction.

To overcome this the PDF files are converted to an uncompressed CMYK PDF/A format, and the grey background removed by deleting the call and graphic state for the embedded grey background image. Out-with that it seems to work, this is in no way a recommended approach.

It creates broken PDF files, as the internal PDF checksums no longer match. It assumes the background grey colour is encoded as 0.8081 g or 1 1 0 rg and rendered using the call to f*. Were the PDF rendering software used by Network Rail, Ghostscript, or qpdf to change this would just break. YMMV

License

Network Rail are copyright holder and retain all intellectual property rights related to the data and derived data contained within the National Electronic Sectional Appendix as set out here

The scripts and other material is provided under the the terms set out in the LICENSE

Acknowledgement

The authors would like to thank Network Rail for providing this data and to all the contributors to the tools and libraries used