Reading and parsing Excel Spreadsheet XML files with Java

Gustavo Santos — Wed, 04 Mar 2015 17:29:31 GMT

On one of our projects, we were asked to implement a process that updates a database with values from an Excel file on a daily basis. The file is located at https://github.com/jbaysolutions/excel-xml-reader/raw/master/ActiveSubstance.xls. The problem with this file is its format. It's an Excel spreadsheet XML.

Usually we parse excel files using Apache POI, but it has no support for Excel spreadsheet XML. The solution was to create our own parser.

For this specific case, we will cover the process of obtaining the substance name (column A), ADI value (column P) and ARfD value (column S).

You can get the sample project used in this post at GitHub.

Downloading the file

This class was to be used by a timer that runs once a day. Starts by downloading the file:

File file = File.createTempFile("substances", "tmp");

String excelFileUrl = "http://ec.europa.eu/sanco_pesticides/public/?event=activesubstance.exportList";
URL url = new URL(excelFileUrl);
System.out.println("downloading file from " + excelFileUrl + " ...");
FileUtils.copyURLToFile(url, file);
System.out.println("downloading finished, parsing...");

Parsing

Since the file format is XML, it can be parsed using SAX.

Since this file has only one sheet, we are keeping the parser simple, so we will only need to find the sheet rows and cells. We declare a class to represent the row:

class XmlRow {
    ArrayList cellList = new ArrayList<>();

    @Override
    public String toString() {
        return cellList.toString();
    }
}

And a SAX handler to load the rows and cell ("Data" in the XML) contents:

class SAXHandler extends DefaultHandler {

    List xmlRowList = new ArrayList<>();
    XmlRow xmlRow = null;
    String content = null;

    @Override
    //Triggered when the start of tag is found.
    public void startElement(String uri, String localName, String qName, Attributes attributes)
            throws SAXException {
        switch (qName) {
            //Create a new Row object when the start tag is found
            case "Row":
                xmlRow = new XmlRow();
                break;
        }
    }

    @Override
    public void endElement(String uri, String localName,
                           String qName) throws SAXException {
        switch (qName) {
            case "Row":
                xmlRowList.add(xmlRow);
                break;
            case "Data":
                xmlRow.cellList.add(content);
                break;
        }
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        content = String.copyValueOf(ch, start, length).trim();
    }
}

Once the file is downloaded, it's loaded to a string to be supplied to the SAX Parser:

String fileContent = IOUtils.toString(new FileInputStream(file));

SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXHandler handler = new SAXHandler();

ByteArrayInputStream bis = new ByteArrayInputStream(fileContent.getBytes());

parser.parse(bis, handler);

When we try and run this the first time, and exception occurs:

Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 14; The processing instruction target matching "[xX][mM][lL]" is not allowed.

This happens because some blank space or other visible content exists before the declaration. For this file, we have to remove the first 2 lines, so it can be parsed:


removeLineFromFile(file.getAbsolutePath(), 1, 2);

String fileContent = IOUtils.toString(new FileInputStream(file));

SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXHandler handler = new SAXHandler();

ByteArrayInputStream bis = new ByteArrayInputStream(fileContent.getBytes());

parser.parse(bis, handler);

The removeLineFromFile() method can be found on the sample project source.

When we try to parse it again, another exception is thrown:

Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 1313; columnNumber: 70; The entity "nbsp" was referenced, but not declared.

This happens because the entity isn't one of XML's predefined entities. To fix it, we have to prepend this to the file before parsing it:

]>

The ´ entity was also found on the file, so we added it as well.

There was also the text "JECFA&ECCO" somewhere in a cell which would cause the parser to throw another exception.

The code now looks like this:

removeLineFromFile(file.getAbsolutePath(), 1, 2);

String fileContent = IOUtils.toString(new FileInputStream(file));
fileContent = fileContent.replaceAll("&ECCO", "&ECCO");
fileContent = "\n" +
        " \n" +
        " \n" +
        "]>" + fileContent;


SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXHandler handler = new SAXHandler();

ByteArrayInputStream bis = new ByteArrayInputStream(fileContent.getBytes());

parser.parse(bis, handler);

We now have a row list, where each row has a cell list. We are going to use this to get our values:

//Parsing the substances list obtained from XML
int count = 0;
for (XmlRow subsRow : handler.xmlRowList) {
    if (subsRow.cellList.size() > 19) {
        String substance = subsRow.cellList.get(0);
        if (substance.equals("Substance")) {
            continue;
        }

        count++;
        String adi = subsRow.cellList.get(15); // column P
        String arfd = subsRow.cellList.get(18); // column S

        System.out.println("Substance name='" + substance + "', ADI='" + adi + "', ARfD='" + arfd + "'");
    }
}

In this example we only care for the substance rows, so we ignore any row that has less than 19 cells. On our project class we use these values to update a DB table, but for simplicity here we are only printing them.

If our goal was to create a regular Excel file, we could change the code to:

Workbook workbook = new HSSFWorkbook();
Sheet sheet = workbook.createSheet();

//Converts all rows to POI rows 
int rowCount = 0;
for (XmlRow subsRow : handler.xmlRowList) {
    Row row = sheet.createRow(rowCount);
    int cellCount = 0;
    for (String cellValue : subsRow.cellList) {
        Cell cell = row.createCell(cellCount);
        cell.setCellValue(cellValue);
        cellCount++;
    }
    rowCount++;
}

String fileOutPath = "/tmp/fileOut.xls";
FileOutputStream fout = new FileOutputStream(fileOutPath);
workbook.write(fout);
workbook.close();
fout.close();

As you can see, it can be easily modified depending on the purpose!

Finally, we delete the downloaded file:

if (file.exists()) {
    System.out.println("delete file-> " + file.getAbsolutePath());
    if (!file.delete()) {
        System.out.println("file '" + file.getAbsolutePath() + "' was not deleted!");
    }
}
//System.out.println(result);
System.out.println("getAndParseFile finished, processed " + count + " substances!");

Conclusion

For the purpose of reading data from an Excel spreadsheet XML file, it's relatively simple to do it.

The sample project used in this post at GitHub has two main classes. ExcelXmlReader to download, parse the file and print the values to the console and the ExcelXmlConverter class, to download, parse the file and converts it to OOXML file format.

The code can be modified to read multiple sheets or even the styles if necessary. It just wasn't needed for our use case.

Hope it helped anyone having the same issues as us!

References

Excel spreadsheet XML

SAX

Apache POI

excel 2003 - JBay Solutions - The Dev Blog

Reading and parsing Excel Spreadsheet XML files with Java

Downloading the file

Parsing

Conclusion

References