Statistic’s Software Development – Some Tips

I’ve been working at a new project that requires me to save a whole lot of data into a database. Without giving too much details right now (I’m not quite ready for that yet :)) I would say a good replacement would be to think I’ve developing something like Google Analytics. So, there are a whole lot of records to be recorded in a near real-time fashion.

The problem with this kind of software is the high complexity of the queries that need to be ran and also the number of different queries that each page/report usually shows. Doing queries on this sort of data involves some joins and a lot of disk space, but as you try to optimize the database for storage space and avoid redundancy the number of joins can become quite high. On top of that, its quite easy to dismiss the real performance penalties during the development phase because you usually have a very limited set of data to work with. These are some thoughts on my current experience with the project and MySQL…

Store data correctly

The number of records will be huge. So it’s very important that you remove any data redundancy and hand pick each and every data type. Leave no loose ends to this rule.

On a medium-sized database (i.e. commerce site with lots of customers/traffic) it can make little difference to pick between a Int and MediumInt (MySQL types). With over 1 million records it represents nearly 1MB of difference which would represent very little of the total database size. But unless you really need the Int it would make a lot of difference when you have 1 billion records (nearly 1GB) it can make things harder not only on disk space but also make queries slower.

Of course, if you need 1billion records, and the column is the key column… then you really need the Int in place. Also, if you delete records in the normal application flow, you might need to still have an Int instead of MediumInt if you use the auto-increment feature. While the auto-increment makes inserting data easier because it’s hard to manage concurency issues, if you delete data, the auto-increment feature doesn’t renew old values (i.e. if you delete id 4, the number 4 will never be used again unless you explicitly insert the 4 in the insert).

Another issue is the storing of dates and times. You’ve several ways of storing dates, but you need to analyse how you’ll be using those dates on the application. Using the Google Analytics example, most reports need the date only, and a handful of them need the hour too. When picking between the MySql types available (Time/Date: 3 bytes; DateTime: 8 bytes; Timestamp: 4 bytes) I choose 1 Date and a Bit(5) to store the hour. The reason for that choice is the fact that since I just need the hour of day, and even that is rarely used, I can save space and make my queries run faster. If I would store the data using a Timestamp for example, each record would need 4 bytes (only 3 bits more) which would require 350MB of additional space with unused data if we had 1 billion records. And although 350MB is reason enough, it isn’t the stronger reason at all. When running each query for the reports, the database engine would need to read the extra data and parse it, making things slower to run, specially since most reports would be grouped by date and the engine would need special care to sort that. So, by making sure we use the right field types and separate things with the usage objectives in mind, we can archive better and easier levels of performance.

You can also save disk space and improve performance by removing redundancy of the date fields. If you need precision up to the seconds or milliseconds part, then this probably isn’t a good practice. But if you need precision up to the minute then you can use a MediumInt (3 bytes) to store your data. This is true because 60s x 24 hours x 365 days = 525.600 and a MediumInt would be able to handle dates nearly for nearly 32 years ! If on the other hand you needed precision up to the hour only, you could even use a SmallInt for 7,5 years or MediumInt for 2.000 years. Well, but what advantages do we get from this? For starters, if you use a MediumInt instead of a Timestamp you save 1 byte for each records (1GB for 1 billion records). Since there might be periods of inactivity, you can also do a reference table and a foreign key increasing the number of records your field can hold – although I don’t recommend this because it makes the next benefit harder to archive. The last benefit is all about performance, by storing dates as simple integers your queries become faster since the data base engine will have an easier time to process all those conditions by comparing integers.

Those benefits come at a cost, not only you’re reducing the time frame of your data requiring future upgrades (but I imagine that in 32 years that won’t be a problem!) but you also need to take some extra precautions using the “new” date fields. The reason the database engine is heavier when storing and parsing dates and times is that it needs to take into account that each month has a different amount of days and so does each year. The Timestamp data type is the only that makes this easier by using the same technique I describe here, it stores the number of seconds since 1 Jan 1970. Just like the Timestamp field you need to check a reference table or calculate the real date at runtime. Fortunately, most reports are based in a start date and end date, which means you can simply calculate the int representation of a date just once and store it somewhere (i.e. the session for a web application) reusing those values for each report and query.

Lastly, when trying to reduce the redundancy to the maximum, it can be easier to debug if we use foreign keys. Foreign keys are good, they help to keep things working nicely, but they’re also expensive because the database engine needs to check the data integrity before each insert is committed. Since most of the data insertion code lives in the logging/recording tools (using the Google Analytics example again), it might be a good thing to make sure the code is checking everything by itself and disable the database engine integrity checking when moving into the production environment. With MySql, innodb is the most popular table engine that supports foreign keys, but when disabling them it might be even better to consider other options and choose another table engine that has better performance for this case scenarios. Haven’t personally got to this, but Myisam seems the obvious choice.

Pre-processing data by aggregating it

If you don’t pre-process the data, you’ll end up needing to fetch all the records for a period of time each time the user views the report. This is fine for a few thousands, maybe a few millions of records, but if you’ve several users and/or billions of records (i.e. a user getting a report for a whole year), you’ll either have a hard time optimizing things to archive reasonable performance, which could be even impossible, or most probably expending too much money in hardware.

On the other hand if you analyse the data your reports really need you’ll find common patterns. Even if you don’t receive the logs in real time (i.e. some logs are not delivered to the server as soon as the event happens but instead they are somehow cached and sent in batch mode days or even weeks after) you can process the data in a more compact and usable way. Using the Google Analytics example, when dealing with the page views instead of storing each ip, browser, etc… as records, you can store the number of page view for each hour or day on 1 record and keep incrementing it as new records arrive.

Some cases aren’t just that simple and need some additional amount of thought. If you, for example, are storing the number of page views per day and you want somewhere to show just the average page views/hour but not the number of page views each hour, then you can store the number of hours next to the number of page views and use that to calculate the averages. Of course this can only work is all data is real-time or incremental (you don’t cache/delay log records) for this particular example due to the fact that you need which hours were recorded already but the same principle can be applied to several other types of statistic.

One other case is when you need to process particular things, for example, what’s the most common way a user reachs a page. In those cases what you need to do is to store this kind of information in a specific way and keep incrementing it accordingly. You can for example store the list of pages that came before a particular page with a numeric field and keep incrementing it as new data comes. At the end, you have the number of times each page was called before a particular page. Of course this takes more space and effort, but at the end of the pre-processing you can usually remove the extra data if you just want to display the top 10 or so or some fields are empty/zero.

Overall, you can improve performance and remove many queries and/or complex joins by pre-processing, aggregating and organizing data the way you’ll need it. For the aggregating tables, there isn’t much need to go to the extreme optimizing the database schema to reduce redundancy to the maximum. The amount of data will be heavily reduced to the essential, and since you might even transfer this data from the logging/recording servers into the servers that actually generate/display the reports to the users, it should be better to keep some redundancy in order to avoid unnecessary joins and be able to get the data to the end-user as fast as possible.

Still, deleting the “raw” records isn’t a good choice for 2 reasons. First, there are cases where you simply can’t pre-process and aggregate data, and obviously you simply won’t delete them. Secondly, pre-processing and aggregating data is a process directly tied to the current application needs, you’re removing data when aggregating it. If somewhere in the future you wish to add a new type of report of particular view on the data, you might need the old records and re-pre-process them again to enable your users to view that report for past dates.

Automatically generate tons of records for the development phase

During the development and testing phase, real world data is usually scarce, so you need to generate tons of random records. Not only this is a good practice during the queries development phase to see how queries will perform but using the logging tool itself to do the actual recording of the data is a big plus. You can easily detect flaws in the recording tool and the queries if you’ve enough records. How many records are enough? That actually depends on the project and real world expectations. Use a calculator or spreadsheet to get to know an estimate of how many records will exist over the first day, week, month, year and 5 years of the project per user. Those numbers aren’t exactly the number of records you need to generate for the development phase, you need to take into account the number of servers you’ll be running. So, if you expect there will be 1 billion records over the first year of the project for each user and using 100 server, you would need a database with about 10 million records. 10 million records are probably overkill for an early development phase, you can probably start with 100.000 and increase the number as you feel the application is maturing.

The generation of the random data can be tricky too. Some types of data exists more than other. You need to create an algorithm that takes this into account, but its also good to make sure it will generate lots of noise data. What I mean by noise data is to make sure things aren’t exactly 50/50, it should randomize it a bit so that there’s a chance that one generation process could even make 65-35 or 25-75. This noise data usually represents different user types, using the Google Analytics example, it could represent the differences between a blog user and an e-commerce store for some particular reports.

On the programming side of this, generating data usually means a lot of randoms, some statistics knowledge and depending on the number of records a considerable ammount of time for the code to run. Generating random dates is, in my experience, the least obvious piece, check this previous post for some tips. The time it takes to generate the data and record it using your real world logging tool can be vital to know how it will behave. Your users can handle slowness or even small bugs when checking the statistics be it online or by generating static reports – bugs can be fixed! But users will not (ever!) tolerate loss of data, so make sure your logging tool can handle even the most abnormal spikes of activity.

To test for spikes of activity, my recommendation is that your random data generator should be really fast – by fast I mean faster than the logging tool which is usually not possible – or you should generate the records and save them to something like a text file and then use the text file to feed the logging tool. Since reading files is usually something really fast, it should be able to overload the logging tool and you should be able to know how many requests/second it can handle. And how many requests/second is it good enough? Well, just like the number of random records, it depends on number of users, the amount data each user will produce and number of servers. Take that into account and make sure overall your logging tool can handle spikes of up to 1000% of the normal data flow. 1000% can seem a lot but I would go as far as saying it is too little. For a Google Analytics project I would go with a logging tool capable of handling 10.000-100.000% the normal amount of data. Since the logging itself can be really simple, it shouldn’t be hard to archive those levels of performance. Once you know you reached the max performance the tool can give, you’ll be able to calculate how many logging servers you’ll need just for those purposes based on the total number of users you want.

Post to Twitter Post to Delicious Post to Digg Post to Facebook Post to Reddit Post to StumbleUpon

2 Responses to “Statistic’s Software Development – Some Tips”

  1. Seth says:

    Ok, since you chose to leave us in the dark
    :) I’ll admit I couldn’t get my attention span around the full article. However, my browsers happily oblige the /-key to search and the single most absent word that grabbed my attention was

    Once you mention a billion records, you will *have* to think about partitioning; with a view to scalability (clustering/farming) and to avoid death by index

    Just wanted to shout it, because, you know, you just *might* not need it. Very slim chance :)

  2. Tyler says:

    When I saw a billion records with a gigabit of filespace difference
    made out to be an issue I felt nauseous. My sd-card has a gb on it. Gb’s cost pennies now. Speed wise, hardware will likely outpace your development cycle if your going to be this hard on yourself.
    But, I will say that I’m not a programmer; have done some, but ain’t one as I run away. You do have a point when your talking about speed, but you have no data to talk about how your speed will be
    affected by. If I where you I would just aim for speed and clustering if you intend to build something that requires a mainframe as you imply

Leave a Reply

For spam filtering purposes, please copy the number 5276 to the field below: