Skip to content Skip to sidebar Skip to footer

Extracting Data Between Two Tags In Html File

I've got a HUUUGE HTML file here saved on my system, which contains data from a product catalogue. The data is structured such that for each product record the name is between two

Solution 1:

A file of size 50 MB isn't so big that you can't just load its contents directly into MATLAB as a string, which you can do with the function FILEREAD:

strContents = fileread('yourfile.html');

Assuming the file format you have above, you can then parse the contents with the function REGEXP (using named token capture):

expr = '<(?<tag>name|prodId|color)>''([^<>]+)''</\k<tag>>';tokens = regexp(strContents,expr,'tokens');tokens = vertcat(tokens{:});

And the contents of token using your sample file contents will be:

tokens = 

    'name''hat''prodId''1829493''color''cyan''name''shirt''prodId''193''name''dress''prodId''18''color''dark purple'

You may then want to parse the resulting N-by-2 cell array and place the contents in a structure array with fields 'name', 'prodId', and 'color'. The difficulty is that not every entry will have all three fields. Assuming each 'name' will be followed by either a 'prodId', a 'color', or both (in the order 'prodId'then'color'), then the following code should work for you:

s = struct('name',[],'prodId',[],'color',[]);  %# Initialize structure
nTokens = size(tokens,1);                      %# Get number of tokens
nameIndex = find(strcmp(tokens(:,1),'name'));  %# Find indices of 'name'
[s(1:numel(nameIndex)).name] = deal(tokens{nameIndex,2});  %# Fill 'name' field

%# Find and fill 'prodId' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'prodId');
[s(index).prodId] = deal(tokens{nameIndex(index)+1,2});

%# Find and fill 'color' that follows a 'name':
index = strcmp(tokens(min(nameIndex+1,nTokens),1),'color');
[s(index).color] = deal(tokens{nameIndex(index)+1,2});

%# Find and fill 'color' that follows a 'prodId':
index = strcmp(tokens(min(nameIndex+2,nTokens),1),'color');
[s(index).color] = deal(tokens{min(nameIndex(index)+2,nTokens),2});

And the contents of s using your sample file contents will be:

>> s(1)

      name:'hat'prodId:'1829493'color:'cyan'>> s(2)

      name:'shirt'prodId:'193'color: []

>> s(3)

      name:'dress'prodId:'18'color:'dark purple'

Solution 2:

There are two ways of solving this sort of problem: string manipulation with regexes (as suggested by gnovice) or parsing the file (or a mix of the two). Parsing is often best if your file is very well structured; regexes win for messy files.

Here's the parsing solution.

Start by downloading xmliotools, and calling xml_read on your file. Your example isn't completely reproducible, so here are two different versions of the data.

Save this to test1.xml:

<?xml version="1.0" encoding="utf-8"?><root><name>'hat'</name><prodId>'1829493'</prodId><color>'cyan'</color><name>'dress'</name><prodId>'18'</prodId><color>'dark purple'</color></root>

Save this to test2.xml.

<?xml version="1.0" encoding="utf-8"?><root><item><name>'hat'</name><prodId>'1829493'</prodId><color>'cyan'</color></item><item><name>'dress'</name><prodId>'18'</prodId><color>'dark purple'</color></item></root>

Now compare

x1 = xml_read('test1.xml')
x2 = xml_read('test2.xml')

Post a Comment for "Extracting Data Between Two Tags In Html File"