<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[excel 2003 - JBay Solutions - The Dev Blog]]></title><description><![CDATA[JBay Solutions Development Blog on Java, Android, Play2 and others]]></description><link>http://blog.jbaysolutions.com/</link><generator>Ghost 0.7</generator><lastBuildDate>Wed, 16 Oct 2024 01:15:05 GMT</lastBuildDate><atom:link href="http://blog.jbaysolutions.com/tag/excel-2003/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Reading and parsing Excel Spreadsheet XML files with Java]]></title><description><![CDATA[<p>On one of our projects, we were asked to implement a process that updates a database with values from an Excel file on a daily basis. The file is located at <a href="https://github.com/jbaysolutions/excel-xml-reader/raw/master/ActiveSubstance.xls">https://github.com/jbaysolutions/excel-xml-reader/raw/master/ActiveSubstance.xls</a>. The problem with this file is its format. It's an</p>]]></description><link>http://blog.jbaysolutions.com/2015/03/04/parsing-excel-spreadsheet-xml/</link><guid isPermaLink="false">2c3dd67a-5ff8-48b5-8ccb-24c0f2c8bf39</guid><category><![CDATA[java]]></category><category><![CDATA[apache poi]]></category><category><![CDATA[excel]]></category><category><![CDATA[excel spreadsheet xml]]></category><category><![CDATA[excel 2003]]></category><dc:creator><![CDATA[Gustavo Santos]]></dc:creator><pubDate>Wed, 04 Mar 2015 17:29:31 GMT</pubDate><content:encoded><![CDATA[<p>On one of our projects, we were asked to implement a process that updates a database with values from an Excel file on a daily basis. The file is located at <a href="https://github.com/jbaysolutions/excel-xml-reader/raw/master/ActiveSubstance.xls">https://github.com/jbaysolutions/excel-xml-reader/raw/master/ActiveSubstance.xls</a>. The problem with this file is its format. It's an <a href="http://en.wikipedia.org/wiki/Microsoft_Office_XML_formats#Excel_XML_Spreadsheet_example">Excel spreadsheet XML</a>.</p>

<p>Usually we parse excel files using <a href="http://poi.apache.org/" title="Apache POI">Apache POI</a>, but it has no support for Excel spreadsheet XML. The solution was to create our own parser.</p>

<p>For this specific case, we will cover the process of obtaining the substance name (column A), ADI value (column P) and ARfD value (column S).</p>

<p>You can get the sample project used in this post at <a href="https://github.com/jbaysolutions/excel-xml-reader">GitHub</a>.</p>

<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>  
<!-- Horizontal For Posts - Text Only -->  
<ins class="adsbygoogle" style="display:inline-block;width:728px;height:90px" data-ad-client="ca-pub-1311169549359552" data-ad-slot="3316155422"></ins>
<script>  
(adsbygoogle = window.adsbygoogle || []).push({});
</script>  

<p><br></p>

<h2 id="downloadingthefile">Downloading the file</h2>

<p>This class was to be used by a timer that runs once a day. Starts by downloading the file:</p>

<pre><code>File file = File.createTempFile("substances", "tmp");

String excelFileUrl = "http://ec.europa.eu/sanco_pesticides/public/?event=activesubstance.exportList";
URL url = new URL(excelFileUrl);
System.out.println("downloading file from " + excelFileUrl + " ...");
FileUtils.copyURLToFile(url, file);
System.out.println("downloading finished, parsing...");
</code></pre>

<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>  
<!-- Horizontal For Posts - Text Only -->  
<ins class="adsbygoogle" style="display:inline-block;width:728px;height:90px" data-ad-client="ca-pub-1311169549359552" data-ad-slot="3316155422"></ins>
<script>  
(adsbygoogle = window.adsbygoogle || []).push({});
</script>  

<p><br></p>

<h2 id="parsing">Parsing</h2>

<p>Since the file format is XML, it can be parsed using <a href="http://en.wikipedia.org/wiki/Simple_API_for_XML">SAX</a>.</p>

<p>Since this file has only one sheet, we are keeping the parser simple, so we will only need to find the sheet rows and cells. We declare a class to represent the row:</p>

<pre><code>class XmlRow {
    ArrayList&lt;String&gt; cellList = new ArrayList&lt;&gt;();

    @Override
    public String toString() {
        return cellList.toString();
    }
}
</code></pre>

<p>And a SAX handler to load the rows and cell ("Data" in the XML) contents:</p>

<pre><code>class SAXHandler extends DefaultHandler {

    List&lt;XmlRow&gt; xmlRowList = new ArrayList&lt;&gt;();
    XmlRow xmlRow = null;
    String content = null;

    @Override
    //Triggered when the start of tag is found.
    public void startElement(String uri, String localName, String qName, Attributes attributes)
            throws SAXException {
        switch (qName) {
            //Create a new Row object when the start tag is found
            case "Row":
                xmlRow = new XmlRow();
                break;
        }
    }

    @Override
    public void endElement(String uri, String localName,
                           String qName) throws SAXException {
        switch (qName) {
            case "Row":
                xmlRowList.add(xmlRow);
                break;
            case "Data":
                xmlRow.cellList.add(content);
                break;
        }
    }

    @Override
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        content = String.copyValueOf(ch, start, length).trim();
    }
}
</code></pre>

<p>Once the file is downloaded, it's loaded to a string to be supplied to the SAX Parser:</p>

<pre><code>String fileContent = IOUtils.toString(new FileInputStream(file));

SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXHandler handler = new SAXHandler();

ByteArrayInputStream bis = new ByteArrayInputStream(fileContent.getBytes());

parser.parse(bis, handler);
</code></pre>

<p>When we try and run this the first time, and exception occurs:</p>

<pre><code>Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 14; The processing instruction target matching "[xX][mM][lL]" is not allowed.
</code></pre>

<p>This happens because some blank space or other visible content exists before the <?xml ?> declaration. For this file, we have to remove the first 2 lines, so it can be parsed:</p>

<pre><code>
<b>removeLineFromFile(file.getAbsolutePath(), 1, 2);</b>

String fileContent = IOUtils.toString(new FileInputStream(file));

SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXHandler handler = new SAXHandler();

ByteArrayInputStream bis = new ByteArrayInputStream(fileContent.getBytes());

parser.parse(bis, handler);
</code></pre>

<p><strong>The removeLineFromFile() method can be found on the sample project source.</strong></p>

<p>When we try to parse it again, another exception is thrown:</p>

<pre><code>Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 1313; columnNumber: 70; The entity "nbsp" was referenced, but not declared.
</code></pre>

<p>This happens because the entity <code>&amp;nbsp;</code> isn't one of XML's predefined entities. To fix it, we have to prepend this to the file before parsing it:</p>

<pre><code>&lt;?xml version="1.0"?&gt;
&lt;!DOCTYPE some_name [ 
    &lt;!ENTITY nbsp "&amp;#160;"&gt; 
    &lt;!ENTITY acute "&amp;#180;"&gt;
]&gt;
</code></pre>

<p>The <code>&amp;acute;</code> entity was also found on the file, so we added it as well.</p>

<p>There was also the text "JECFA&amp;ECCO" somewhere in a cell which would cause the parser to throw another exception.</p>

<p>The code now looks like this:</p>

<pre><code>removeLineFromFile(file.getAbsolutePath(), 1, 2);

String fileContent = IOUtils.toString(new FileInputStream(file));
fileContent = fileContent.replaceAll("&amp;ECCO", "&amp;#38;ECCO");
fileContent = "&lt;?xml version=\"1.0\"?&gt;\n" +
        "&lt;!DOCTYPE some_name [ \n" +
        "&lt;!ENTITY nbsp \"&amp;#160;\"&gt; \n" +
        "&lt;!ENTITY acute \"&amp;#180;\"&gt; \n" +
        "]&gt;" + fileContent;


SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXHandler handler = new SAXHandler();

ByteArrayInputStream bis = new ByteArrayInputStream(fileContent.getBytes());

parser.parse(bis, handler);
</code></pre>

<p>We now have a row list, where each row has a cell list. We are going to use this to get our values:</p>

<pre><code>//Parsing the substances list obtained from XML
int count = 0;
for (XmlRow subsRow : handler.xmlRowList) {
    if (subsRow.cellList.size() &gt; 19) {
        String substance = subsRow.cellList.get(0);
        if (substance.equals("Substance")) {
            continue;
        }

        count++;
        String adi = subsRow.cellList.get(15); // column P
        String arfd = subsRow.cellList.get(18); // column S

        System.out.println("Substance name='" + substance + "', ADI='" + adi + "', ARfD='" + arfd + "'");
    }
}
</code></pre>

<p>In this example we only care for the substance rows, so we ignore any row that has less than 19 cells. On our project class we use these values to update a DB table, but for simplicity here we are only printing them.</p>

<p>If our goal was to create a regular Excel file, we could change the code to:</p>

<pre><code>Workbook workbook = new HSSFWorkbook();
Sheet sheet = workbook.createSheet();

//Converts all rows to POI rows 
int rowCount = 0;
for (XmlRow subsRow : handler.xmlRowList) {
    Row row = sheet.createRow(rowCount);
    int cellCount = 0;
    for (String cellValue : subsRow.cellList) {
        Cell cell = row.createCell(cellCount);
        cell.setCellValue(cellValue);
        cellCount++;
    }
    rowCount++;
}

String fileOutPath = "/tmp/fileOut.xls";
FileOutputStream fout = new FileOutputStream(fileOutPath);
workbook.write(fout);
workbook.close();
fout.close();
</code></pre>

<p>As you can see, it can be easily modified depending on the purpose!</p>

<p>Finally, we delete the downloaded file:</p>

<pre><code>if (file.exists()) {
    System.out.println("delete file-&gt; " + file.getAbsolutePath());
    if (!file.delete()) {
        System.out.println("file '" + file.getAbsolutePath() + "' was not deleted!");
    }
}
//System.out.println(result);
System.out.println("getAndParseFile finished, processed " + count + " substances!");
</code></pre>

<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>  
<!-- Horizontal For Posts - Text Only -->  
<ins class="adsbygoogle" style="display:inline-block;width:728px;height:90px" data-ad-client="ca-pub-1311169549359552" data-ad-slot="3316155422"></ins>
<script>  
(adsbygoogle = window.adsbygoogle || []).push({});
</script>  

<p><br></p>

<h2 id="conclusion">Conclusion</h2>

<p>For the purpose of reading data from an Excel spreadsheet XML file, it's relatively simple to do it.</p>

<p>The sample project used in this post at <a href="https://github.com/jbaysolutions/excel-xml-reader">GitHub</a> has two main classes. <a href="https://github.com/jbaysolutions/excel-xml-reader/blob/master/src/main/java/com/jbaysolutions/excelparser/ExcelXmlReader.java">ExcelXmlReader</a> to download, parse the file and print the values to the console and the <a href="https://github.com/jbaysolutions/excel-xml-reader/blob/master/src/main/java/com/jbaysolutions/excelparser/ExcelXmlConverter.java">ExcelXmlConverter</a> class, to download, parse the file and converts it to OOXML file format.</p>

<p>The code can be modified to read multiple sheets or even the styles if necessary. It just wasn't needed for our use case. </p>

<p>Hope it helped anyone having the same issues as us!</p>

<h2 id="references">References</h2>

<p><a href="http://en.wikipedia.org/wiki/Microsoft_Office_XML_formats#Excel_XML_Spreadsheet_example">Excel spreadsheet XML</a></p>

<p><a href="http://en.wikipedia.org/wiki/Simple_API_for_XML">SAX</a></p>

<p><a href="http://poi.apache.org/" title="Apache POI">Apache POI</a></p>]]></content:encoded></item></channel></rss>