Author Topic: HTML to CSV  (Read 3139 times)

0 Members and 1 Guest are viewing this topic.

Offline ElementCoder

  • LV7 Elite (Next: 700)
  • *******
  • Posts: 611
  • Rating: +42/-2
    • View Profile
HTML to CSV
« on: November 07, 2012, 01:30:31 pm »
I am in need of converting HTML tables to a CSV file. I was thinking of something like:
1. Look for <table></table> and remove everything outside of it
2. Remove any formatting tags like <a href=> etc.
3. Somehow put every <tr> on  a new line and all <td> elements within that on the same line, seperated by a , (CSV :P)
Truth be told. I have no idea how to accomplish this.

Some people need a high five in the face... with a chair.
~EC

Offline squidgetx

  • Food.
  • CoT Emeritus
  • LV10 31337 u53r (Next: 2000)
  • *
  • Posts: 1881
  • Rating: +503/-17
  • rawr.
    • View Profile
Re: HTML to CSV
« Reply #1 on: November 07, 2012, 03:26:43 pm »
I'm only kinda new to java but this is how I would begin thinking about it. Pipe in the file .html, parse as text, move through it until you reach <table>, begin parse table code which just looks through and ignores everything in <> unless it's tr, then look for stuff in td and output it on a line (when you reach </td>, output a comma), once you reach </tr> make a newline, and loop until you reach </table>

I guess the annoying part would be trying to make a good tag recognition algorithm, but you could if you were feeling lazy/inelegant just scan one character at a time and then just use a lot of if statements once you find a "<" character.

This is when I think a lower-level language would be more appropriate than java, mainly because looking for certain character patterns would be super easy in something like Axe or C (i think, anyway), where in Java I have no idea what the proper library/api is or how to use it for that purpose :P
« Last Edit: November 07, 2012, 03:30:20 pm by squidgetx »

Offline ElementCoder

  • LV7 Elite (Next: 700)
  • *******
  • Posts: 611
  • Rating: +42/-2
    • View Profile
Re: HTML to CSV
« Reply #2 on: November 14, 2012, 02:38:26 am »
Well the only language I'm somewhat good at is Java (I'm learning C++ but that is far from complete :P) so I'll just mess around a bit and post any progress here I guess.
[edit]
Reading it in is quite easily done with scanner (which I found out after messing around way too long with BufferedReader, thanks BuilderBoy). The table is easily isolated too by splitting the string at the <table> and </table> tags.
« Last Edit: November 15, 2012, 03:32:21 am by ElementCoder »

Some people need a high five in the face... with a chair.
~EC