Parsing an HTML file in Arexx

Gundam · 08 April 2014, 12:14

I want to extract information from <td> elements, eg:

<td>
A

</td>
<td>
B

</td>

...

How do I achieve that in Arexx ?
Thanks in advance.

thomas · 08 April 2014, 13:02

How much do you know about these HTML files? For example, are <td> value </td> always on seperate lines? Are <td> </td> always written in lower case?

To write a generic HTML parser might be very difficult. But if you know that the files you read are always built up in the same way, it will get easier.

Gundam · 08 April 2014, 13:55

Quote:

Originally Posted by thomas

How much do you know about these HTML files? For example, are <td> value </td> always on seperate lines? Are <td> </td> always written in lower case?

To write a generic HTML parser might be very difficult. But if you know that the files you read are always built up in the same way, it will get easier.

The <td> elements (on separate lines) are within a <table class="tableclass">; the "class" attribute "tableclass" is a unique identifier.

thomas · 08 April 2014, 15:03

This wasn't what I asked for.

Let's look at an example. I saved the source code of this page as board.htm. The ARexx program board.rexx extracts the posting time and user name from the table of posts. It's done in a quite easy way but this means that it only works on HTML files which look exactly like this.

Code:

3> rx board.rexx
Today, 12:14 Gundam
Today, 13:02 thomas
Today, 13:55 Gundam
3>

Gundam · 08 April 2014, 15:26

Quote:

Originally Posted by thomas

This wasn't what I asked for.

Let's look at an example. I saved the source code of this page as board.htm. The ARexx program board.rexx extracts the posting time and user name from the table of posts. It's done in a quite easy way but this means that it only works on HTML files which look exactly like this.
...

that's OK!
I don't need a generic parser; I just want to extract the contents of <td> elements which are in the <table> with THAT specific identifer.
The value is on a separate line, like this:

<td>
value
</td>

the <td> tags don't have any "id" attribute, so they can't be easily identified; thats why I asked for help.

daxb · 08 April 2014, 17:23

Thats easy. Use Thomas script as template and change/reduce it to your needs. You just have to readln() and check if the string is <td>. If yes, read next line until </td> is reached.

Gundam · 08 April 2014, 18:20

Quote:

Originally Posted by daxb

Thats easy. Use Thomas script as template and change/reduce it to your needs. You just have to readln() and check if the string is <td>. If yes, read next line until </td> is reached.

thanks!

Lonewolf10 · 09 April 2014, 00:16

Quote:

Originally Posted by Gundam

that's OK!
I don't need a generic parser; I just want to extract the contents of <td> elements which are in the <table> with THAT specific identifer.
The value is on a separate line, like this:

<td>
value
</td>

the <td> tags don't have any "id" attribute, so they can't be easily identified; thats why I asked for help.

Just be aware that not all HTML code and data are on separate lines. Code generated by programs usually are, but I personally hand generate my HTML code for my website as I find it more fun, and easier to read doing it my way. What I'm trying to say is that some data may be stored thusly:

Code:

<TABLE>
<TR><TH>Some data</TH></TR>
<TR><TD>More data</TD></TR>
<TR><TD>Even more data</TD></TR>
<!-- and the rest of the table follows... --->
</TABLE>

<TR> is table row start and </TR> marks it's end
<TH> is table (column/s) header and </TH> marks it's end

The above example is a 1 column table, it is entirely possible to have multiple entries of each tag on the same row.

08 April 2014, 12:14	#1
Gundam Users Awaiting Email Confirmation Join Date: Mar 2014 Location: my town Posts: 12	Parsing an HTML file in Arexx I want to extract information from <td> elements, eg: <td> A </td> <td> B </td> ... How do I achieve that in Arexx ? Thanks in advance.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
xml parsing in arexx	amiga_user	Coders. General	2	17 November 2011 15:42
error parsing global configuration file line 16	DDNI	project.WHDLoad	7	21 March 2011 13:09
HTML datatype?	NovaCoder	support.Apps	7	05 July 2010 12:59
No html posting???	Thorham	project.EAB	14	18 February 2008 02:21
HTML problem	Dastardly	Amiga websites reviews	11	28 November 2002 15:21

08 April 2014, 13:02	#2
thomas Registered User Join Date: Jan 2002 Location: Germany Posts: 6,985	How much do you know about these HTML files? For example, are <td> value </td> always on seperate lines? Are <td> </td> always written in lower case? To write a generic HTML parser might be very difficult. But if you know that the files you read are always built up in the same way, it will get easier.

08 April 2014, 17:23	#6
daxb Registered User Join Date: Oct 2009 Location: Germany Posts: 3,303	Thats easy. Use Thomas script as template and change/reduce it to your needs. You just have to readln() and check if the string is <td>. If yes, read next line until </td> is reached.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)