CRC question
User:
mcholste
Date: 3/14/2010 8:14 am
Date: 3/14/2010 8:14 am
Views: 1590
Rating: 0
Rating: 0
I wondered if someone on the list could help me understand something: I'm trying to
find a very cheap, deterministic function that will allow me to create
a uint32 ID for a small string. One-way hashes are fine as I will
store the original value alongside the the generated ID. I'm trying to solve problems I'm getting using auto-increment values, so any 32-bit int will do, it just has to be able to be calculated the same way by independent machines so that they don't have to sync their ID's. My current
plan is to use crc32 as the hashing function, but I'm trying to figure
out how likely collisions will be. If I understand CRC correctly,
CRC32 guarantees no collisions for 16 bits, is that right? Can you
think of a better hashing algorithm to use? MD5 and SHA are way too
lengthy to store as uint32, and I'm trying to avoid writing a custom function. I'm aware that's impossible to come up with an entirely unique id for every possible string of up to 32 chars in 32 bits. What I'm trying to come up with is something that would be good enough so that the chances of getting a collision on a set of a few thousand strings under 32 chars would be less than the chance of an asteroid destroying the earth. Is this reasonable?
Thanks,
Martin
Thanks,
Martin
Re: CRC question
User:
haarg
Date: 3/14/2010 9:38 am
Date: 3/14/2010 9:38 am
Views: 213
Rating: 0
Rating: 0
I don't really know how to compare the algorithms for your usage, but
SHA-1 is a strong enough algorithm you may be able to use it and only
take the first 32 bits and use it for your identifier.
In git, every object is identified by a SHA-1 hash. In many places it
will display shortened forms of the identifiers that are 6 or 7
characters long instead because it is still good enough to identify
the object.
SHA-1 is a strong enough algorithm you may be able to use it and only
take the first 32 bits and use it for your identifier.
In git, every object is identified by a SHA-1 hash. In many places it
will display shortened forms of the identifiers that are 6 or 7
characters long instead because it is still good enough to identify
the object.
Re: CRC question
User:
mcholste
Date: 3/14/2010 12:04 pm
Date: 3/14/2010 12:04 pm
Views: 201
Rating: 0
Rating: 0
Ok, was the list server down? I sent my original question back on Thursday and it's just popping up now...
That's a good point about not needing all of the hash to get what I need out of it. So, follow-up, would 4 chars of SHA1 be less collision-prone than the 4 chars of a CRC32? If there's no big difference, than I would go with CRC because it's orders of magnitude faster. In particular, String::CRC32's calculation is only three times slower than accessing a hash value with a hash key in a basic Perl hash. One word of warning: Digest::CRC is more than 1000x slower than String::CRC32, so don't use it unless you have to!
--Martin
That's a good point about not needing all of the hash to get what I need out of it. So, follow-up, would 4 chars of SHA1 be less collision-prone than the 4 chars of a CRC32? If there's no big difference, than I would go with CRC because it's orders of magnitude faster. In particular, String::CRC32's calculation is only three times slower than accessing a hash value with a hash key in a basic Perl hash. One word of warning: Digest::CRC is more than 1000x slower than String::CRC32, so don't use it unless you have to!
--Martin
On Sun, Mar 14, 2010 at 8:38 AM, <haaarg@gmail.com> wrote:
haarg wrote:
I don't really know how to compare the algorithms for your usage, but
SHA-1 is a strong enough algorithm you may be able to use it and only
take the first 32 bits and use it for your identifier.
In git, every object is identified by a SHA-1 hash. In many places it
will display shortened forms of the identifiers that are 6 or 7
characters long instead because it is still good enough to identify
the object.
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Re: CRC question
User:
chrisdolan
Date: 3/14/2010 12:33 pm
Date: 3/14/2010 12:33 pm
Views: 218
Rating: 0
Rating: 0
If you expect 10^6 records in your database, then there are about 10^6 * 10^6 / 2 =~= 5 * 10^11 collision opportunities. A 32-bit hash has 4 * 10^9 possible values. So that means you should expect there to be about 100 collisions. If your database grows by another factor of 10 records, then the average number of collisions goes up by 100.
So, if you're using only 32 bits for the hash then you application needs to plan for ID collisions. On the other hand, if you stick with synchronized auto incrementing (or assigned pools of auto-increment IDs per node perhaps) then you can get the full 4 * 10^9 records in the database before you get a collision.
Chris
On Mar 14, 2010, at 12:04 PM, <madtalk@madmongers.org> <madtalk@madmongers.org> wrote:
mcholste wrote:
Ok, was the list server down? I sent my original question back on Thursday and it's just popping up now...
That's a good point about not needing all of the hash to get what I need out of it. So, follow-up, would 4 chars of SHA1 be less collision-prone than the 4 chars of a CRC32? If there's no big difference, than I would go with CRC because it's orders of magnitude faster. In particular, String::CRC32's calculation is only three times slower than accessing a hash value with a hash key in a basic Perl hash. One word of warning: Digest::CRC is more than 1000x slower than String::CRC32, so don't use it unless you have to!
--MartinOn Sun, Mar 14, 2010 at 8:38 AM, <haaarg@gmail.com> wrote:haarg wrote:
I don't really know how to compare the algorithms for your usage, but
SHA-1 is a strong enough algorithm you may be able to use it and only
take the first 32 bits and use it for your identifier.
In git, every object is identified by a SHA-1 hash. In many places it
will display shortened forms of the identifiers that are 6 or 7
characters long instead because it is still good enough to identify
the object.
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Re: CRC question
User:
mcholste
Date: 3/14/2010 3:28 pm
Date: 3/14/2010 3:28 pm
Views: 198
Rating: 0
Rating: 0
Ah, good answer, that's what I was trying to figure out. Right now, I've only got 10^3 records and I only anticipate a max of 10^4, though I can't guarantee that. I developed a full transaction locking system for the auto-incrementing version, but the problem is that I couldn't solve the issue for when only when node in a cluster is up and it is brand-new (as in, it can't consult its colleagues for their previous responses). The easy answer is "don't do that" but I'm convinced that the odds of accidentally doing that in production are significantly higher than the odds of getting one of the 100 collisions likely in the CRC scenario.
So, here's the follow-up question: what do I do when I get a collision? I guess I can arbitrarily set an ID to a pre-known, static value in a "reserved" range, but that's fodder for collision as well, so it's not a perfect solution. The point is that the procedure for setting the arbitrary value must be deterministic or I lose the entire advantage of the no-comm cluster that the hashing algorithm provides. Another imperfect solution would be to append some char to the string so that it isn't the same string anymore and then re-CRC it, but that's obviously imperfect as well. On the other hand, maybe this will work:
$orig_str = $string;
my $crc = crc($string);
while ( has_collision($crc) ){
$string .= "_";
$crc = crc($string);
}
db_insert($crc, $orig_str);
That way, even though the CRC is done against a modified string, all nodes will modify it the same, predictable way, and so they will all arrive at the same conclusion and the original string still gets stored in the db. The modification will continue until a non-collision occurs, which would almost always be in the first try. Does anyone see a flaw with that plan?
So, here's the follow-up question: what do I do when I get a collision? I guess I can arbitrarily set an ID to a pre-known, static value in a "reserved" range, but that's fodder for collision as well, so it's not a perfect solution. The point is that the procedure for setting the arbitrary value must be deterministic or I lose the entire advantage of the no-comm cluster that the hashing algorithm provides. Another imperfect solution would be to append some char to the string so that it isn't the same string anymore and then re-CRC it, but that's obviously imperfect as well. On the other hand, maybe this will work:
$orig_str = $string;
my $crc = crc($string);
while ( has_collision($crc) ){
$string .= "_";
$crc = crc($string);
}
db_insert($crc, $orig_str);
That way, even though the CRC is done against a modified string, all nodes will modify it the same, predictable way, and so they will all arrive at the same conclusion and the original string still gets stored in the db. The modification will continue until a non-collision occurs, which would almost always be in the first try. Does anyone see a flaw with that plan?
On Sun, Mar 14, 2010 at 11:33 AM, <chris@chrisdolan.net> wrote:
chrisdolan wrote:
You can calculate the best-case collision likelihood by assuming a perfect hashing function and using via the Birthday Paradox: http://en.wikipedia.org/wiki/Birthday_problemIf you expect 10^6 records in your database, then there are about 10^6 * 10^6 / 2 =~= 5 * 10^11 collision opportunities. A 32-bit hash has 4 * 10^9 possible values. So that means you should expect there to be about 100 collisions. If your database grows by another factor of 10 records, then the average number of collisions goes up by 100.So, if you're using only 32 bits for the hash then you application needs to plan for ID collisions. On the other hand, if you stick with synchronized auto incrementing (or assigned pools of auto-increment IDs per node perhaps) then you can get the full 4 * 10^9 records in the database before you get a collision.ChrisOn Mar 14, 2010, at 12:04 PM, <madtalk@madmongers.org> <madtalk@madmongers.org> wrote:mcholste wrote:
Ok, was the list server down? I sent my original question back on Thursday and it's just popping up now...
That's a good point about not needing all of the hash to get what I need out of it. So, follow-up, would 4 chars of SHA1 be less collision-prone than the 4 chars of a CRC32? If there's no big difference, than I would go with CRC because it's orders of magnitude faster. In particular, String::CRC32's calculation is only three times slower than accessing a hash value with a hash key in a basic Perl hash. One word of warning: Digest::CRC is more than 1000x slower than String::CRC32, so don't use it unless you have to!
--MartinOn Sun, Mar 14, 2010 at 8:38 AM, <haaarg@gmail.com> wrote:haarg wrote:
I don't really know how to compare the algorithms for your usage, but
SHA-1 is a strong enough algorithm you may be able to use it and only
take the first 32 bits and use it for your identifier.
In git, every object is identified by a SHA-1 hash. In many places it
will display shortened forms of the identifiers that are 6 or 7
characters long instead because it is still good enough to identify
the object.
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Re: CRC question
User:
chrisdolan
Date: 3/15/2010 6:27 pm
Date: 3/15/2010 6:27 pm
Views: 226
Rating: 0
Rating: 0
Chris
On Mar 14, 2010, at 3:28 PM, <madtalk@madmongers.org> <madtalk@madmongers.org> wrote:
mcholste wrote:
Ah, good answer, that's what I was trying to figure out. Right now, I've only got 10^3 records and I only anticipate a max of 10^4, though I can't guarantee that. I developed a full transaction locking system for the auto-incrementing version, but the problem is that I couldn't solve the issue for when only when node in a cluster is up and it is brand-new (as in, it can't consult its colleagues for their previous responses). The easy answer is "don't do that" but I'm convinced that the odds of accidentally doing that in production are significantly higher than the odds of getting one of the 100 collisions likely in the CRC scenario.
So, here's the follow-up question: what do I do when I get a collision? I guess I can arbitrarily set an ID to a pre-known, static value in a "reserved" range, but that's fodder for collision as well, so it's not a perfect solution. The point is that the procedure for setting the arbitrary value must be deterministic or I lose the entire advantage of the no-comm cluster that the hashing algorithm provides. Another imperfect solution would be to append some char to the string so that it isn't the same string anymore and then re-CRC it, but that's obviously imperfect as well. On the other hand, maybe this will work:
$orig_str = $string;
my $crc = crc($string);
while ( has_collision($crc) ){
$string .= "_";
$crc = crc($string);
}
db_insert($crc, $orig_str);
That way, even though the CRC is done against a modified string, all nodes will modify it the same, predictable way, and so they will all arrive at the same conclusion and the original string still gets stored in the db. The modification will continue until a non-collision occurs, which would almost always be in the first try. Does anyone see a flaw with that plan?
On Sun, Mar 14, 2010 at 11:33 AM, <chris@chrisdolan.net> wrote:chrisdolan wrote:
You can calculate the best-case collision likelihood by assuming a perfect hashing function and using via the Birthday Paradox: http://en.wikipedia.org/wiki/Birthday_problemIf you expect 10^6 records in your database, then there are about 10^6 * 10^6 / 2 =~= 5 * 10^11 collision opportunities. A 32-bit hash has 4 * 10^9 possible values. So that means you should expect there to be about 100 collisions. If your database grows by another factor of 10 records, then the average number of collisions goes up by 100.So, if you're using only 32 bits for the hash then you application needs to plan for ID collisions. On the other hand, if you stick with synchronized auto incrementing (or assigned pools of auto-increment IDs per node perhaps) then you can get the full 4 * 10^9 records in the database before you get a collision.ChrisOn Mar 14, 2010, at 12:04 PM, <madtalk@madmongers.org> <madtalk@madmongers.org> wrote:mcholste wrote:
Ok, was the list server down? I sent my original question back on Thursday and it's just popping up now...
That's a good point about not needing all of the hash to get what I need out of it. So, follow-up, would 4 chars of SHA1 be less collision-prone than the 4 chars of a CRC32? If there's no big difference, than I would go with CRC because it's orders of magnitude faster. In particular, String::CRC32's calculation is only three times slower than accessing a hash value with a hash key in a basic Perl hash. One word of warning: Digest::CRC is more than 1000x slower than String::CRC32, so don't use it unless you have to!
--MartinOn Sun, Mar 14, 2010 at 8:38 AM, <haaarg@gmail.com> wrote:haarg wrote:
I don't really know how to compare the algorithms for your usage, but
SHA-1 is a strong enough algorithm you may be able to use it and only
take the first 32 bits and use it for your identifier.
In git, every object is identified by a SHA-1 hash. In many places it
will display shortened forms of the identifiers that are 6 or 7
characters long instead because it is still good enough to identify
the object.
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Re: CRC question
User:
mcholste
Date: 3/16/2010 1:05 pm
Date: 3/16/2010 1:05 pm
Views: 0
Rating: 0
Rating: 0
That's the exact setup I use for the primary key on the logs as they flow in. Unfortunately, that doesn't work on assigning ID's to the program string in the logs because data is flowing in a round-robin fashion, so all nodes could receive the same string that needs an ID, and the map/reduce aggregation (group by) functions require that the ID for a given string be identical across all nodes.
On Mon, Mar 15, 2010 at 6:27 PM, <chris@chrisdolan.net> wrote:
chrisdolan wrote:
Why not just segment the ID space into a number of nodes? If you anticipate 10 nodes, then be conservative and divide the 32 bits into 128 domains and each node can increment in its own domain. Unless you anticipate a huge increase in node count, or a transient worker pool, then this approach seems much simpler.ChrisOn Mar 14, 2010, at 3:28 PM, <madtalk@madmongers.org> <madtalk@madmongers.org> wrote:mcholste wrote:
Ah, good answer, that's what I was trying to figure out. Right now, I've only got 10^3 records and I only anticipate a max of 10^4, though I can't guarantee that. I developed a full transaction locking system for the auto-incrementing version, but the problem is that I couldn't solve the issue for when only when node in a cluster is up and it is brand-new (as in, it can't consult its colleagues for their previous responses). The easy answer is "don't do that" but I'm convinced that the odds of accidentally doing that in production are significantly higher than the odds of getting one of the 100 collisions likely in the CRC scenario.
So, here's the follow-up question: what do I do when I get a collision? I guess I can arbitrarily set an ID to a pre-known, static value in a "reserved" range, but that's fodder for collision as well, so it's not a perfect solution. The point is that the procedure for setting the arbitrary value must be deterministic or I lose the entire advantage of the no-comm cluster that the hashing algorithm provides. Another imperfect solution would be to append some char to the string so that it isn't the same string anymore and then re-CRC it, but that's obviously imperfect as well. On the other hand, maybe this will work:
$orig_str = $string;
my $crc = crc($string);
while ( has_collision($crc) ){
$string .= "_";
$crc = crc($string);
}
db_insert($crc, $orig_str);
That way, even though the CRC is done against a modified string, all nodes will modify it the same, predictable way, and so they will all arrive at the same conclusion and the original string still gets stored in the db. The modification will continue until a non-collision occurs, which would almost always be in the first try. Does anyone see a flaw with that plan?
On Sun, Mar 14, 2010 at 11:33 AM, <chris@chrisdolan.net> wrote:chrisdolan wrote:
You can calculate the best-case collision likelihood by assuming a perfect hashing function and using via the Birthday Paradox: http://en.wikipedia.org/wiki/Birthday_problemIf you expect 10^6 records in your database, then there are about 10^6 * 10^6 / 2 =~= 5 * 10^11 collision opportunities. A 32-bit hash has 4 * 10^9 possible values. So that means you should expect there to be about 100 collisions. If your database grows by another factor of 10 records, then the average number of collisions goes up by 100.So, if you're using only 32 bits for the hash then you application needs to plan for ID collisions. On the other hand, if you stick with synchronized auto incrementing (or assigned pools of auto-increment IDs per node perhaps) then you can get the full 4 * 10^9 records in the database before you get a collision.ChrisOn Mar 14, 2010, at 12:04 PM, <madtalk@madmongers.org> <madtalk@madmongers.org> wrote:mcholste wrote:
Ok, was the list server down? I sent my original question back on Thursday and it's just popping up now...
That's a good point about not needing all of the hash to get what I need out of it. So, follow-up, would 4 chars of SHA1 be less collision-prone than the 4 chars of a CRC32? If there's no big difference, than I would go with CRC because it's orders of magnitude faster. In particular, String::CRC32's calculation is only three times slower than accessing a hash value with a hash key in a basic Perl hash. One word of warning: Digest::CRC is more than 1000x slower than String::CRC32, so don't use it unless you have to!
--MartinOn Sun, Mar 14, 2010 at 8:38 AM, <haaarg@gmail.com> wrote:haarg wrote:
I don't really know how to compare the algorithms for your usage, but
SHA-1 is a strong enough algorithm you may be able to use it and only
take the first 32 bits and use it for your identifier.
In git, every object is identified by a SHA-1 hash. In many places it
will display shortened forms of the identifiers that are 6 or 7
characters long instead because it is still good enough to identify
the object.
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org
Madison Area Perl Mongers - MadMongers
http://www.madmongers.org