Alpha and Omega of website, like QetriX Particle Database, is data, data, data.
My primary goal was to import all countries of the world with
major/capital cities and all settlements in the Czech Republic with all
main attributes into the database. It was quite easy task and it filled
about 60 000 rows (about 13 500 entities). I used mostly a spreadsheet
(OOo Calc), because I downloaded and consolidated some Excel files.
Then
I found list of all streets in Prague, all first and last names in the
Czech Republic, I gathered list of all mobile phone models, some aircrafts and
cars, a lot of series with episodes and finally I got a list of
almost all airports in the world. Altogether it was about 400 000 rows.
I used custom made XML and TSV parsers. TSV is my favorite, because unlike commas, semicolons, pipes etc., tabs usually don't appear in data. Plus, when I copy-paste data thru clipboard from Calc to import front end, it's already as TSV - values separated by tabs. I had to create some custom made PHP scripts as well, for more complex or unevenly structured data sets.
One day I found some good keywords to find all kind of lists of Czech companies, so I put some of them into the database as well, along with some Czech POIs. But it was quite hard to find source of POIs without licence... There are great websites for POIs, but I don't steal. Anyway, 600 000 rows.
Then I found free to use sources of detailed structured data for particular aircrafts, like serial numbers, registrations, types, built dates etc. I love flying and I was always curious about the age of the aircraft I was about to embark. Not because I was scared, it was more about current state and what I can expect on board. Thanks to this I was able to create a simple app just for this :) 800 000 rows.
After that I had a lot of unfinished data sheets in Calc - taxonomy with animal species, administrative divisions and subdivisions of countries and cities, movies, songs etc. I was confident this is a good approach, because the data will be pure, complete to certain level and without any significant duplicities or mistakes. I also defended this position in some disputation about amount of data the system provides. Even I had some previous experience with web crawlers and data parsers,
I didn't expect I'll have to use it here (rather I didn't want to use it here). Boy, I was wrong!
:) The more I defended it, the more I thought about it and the more I understood this is the only way for larger number of data in database.
No comments:
Post a Comment