Tuesday, February 26, 2013

Filling the database

Alpha and Omega of website, like QetriX Particle Database, is data, data, data.

My primary goal was to import all countries of the world with major/capital cities and all settlements in the Czech Republic with all main attributes into the database. It was quite easy task and it filled about 60 000 rows (about 13 500 entities). I used mostly a spreadsheet (OOo Calc), because I downloaded and consolidated some Excel files.

Then I found list of all streets in Prague, all first and last names in the Czech Republic, I gathered list of all mobile phone models, some aircrafts and cars, a lot of series with episodes and finally I got a list of almost all airports in the world. Altogether it was about 400 000 rows.

I used custom made XML and TSV parsers. TSV is my favorite, because unlike commas, semicolons, pipes etc., tabs usually don't appear in data. Plus, when I copy-paste data thru clipboard from Calc to import front end, it's already as TSV - values separated by tabs. I had to create some custom made PHP scripts as well, for more complex or unevenly structured data sets.

One day I found some good keywords to find all kind of lists of Czech companies, so I put some of them into the database as well, along with some Czech POIs. But it was quite hard to find source of POIs without licence... There are great websites for POIs, but I don't steal. Anyway, 600 000 rows.

Then I found free to use sources of detailed structured data for particular aircrafts, like serial numbers, registrations, types, built dates etc. I love flying and I was always curious about the age of the aircraft I was about to embark. Not because I was scared, it was more about current state and what I can expect on board. Thanks to this I was able to create a simple app just for this :) 800 000 rows.

After that I had a lot of unfinished data sheets in Calc - taxonomy with animal species, administrative divisions and subdivisions of countries and cities, movies, songs etc. I was confident this is a good approach, because the data will be pure, complete to certain level and without any significant duplicities or mistakes. I also defended this position in some disputation about amount of data the system provides. Even I had some previous experience with web crawlers and data parsers, I didn't expect I'll have to use it here (rather I didn't want to use it here). Boy, I was wrong! :) The more I defended it, the more I thought about it and the more I understood this is the only way for larger number of data in database.

No comments:

Post a Comment