Sunday, March 25, 2012

Case Sensitive vs Insensitive

After all the pain I've been going through with code pages and collation, I was asked how, when sql server does it's joins and predicate searches, how does it actual (internals now) know the an "A" = "a" in an insensitive search?

I didn't have the answer.

Damn, Now I really have to pick up Kelans book.I've seen tricks suchs as uppercasing or lowercasing both prior the compare. I think formatting routines such as these can be very quick because they only have lookup's to do.|||Well, the questions stems from an insensetive server

You don't need to use CASE functions in this case

"A" = "a"
"A" = "A"
"a" = "A", and
"a" = "a"

Internally each has it's on ASCII represenation.

How does an insesitive ("the big jerk") server resolve this?|||Perhaps it "ands" 32 to both characters if either are in the ASCII range 65-90? Probably a number of ways of doing it. Is this one of those questions where an IT Director thinks he has the better of you, because he found a question you can not answer?|||Perhaps it "ands" 32 to both characters if either are in the ASCII range 65-90? Probably a number of ways of doing it. Is this one of those questions where an IT Director thinks he has the better of you, because he found a question you can not answer?

No, they don't know, and because DB2 OS/390 is case sensitive, he didn't understand...and because I've got sql boxes built six ways to sunday, it's been bugging me...and that's how it arose...

But now it's bugging me as well...|||helluva question - 1st i'd smack the guy who asked.

it's just like asking how does SQL server know 1 != 2?
not a technically valid question, IMHO, but thats way the cookie crumbles sometimes, i reckon.

my understanding is that collation is a 3 legged stool
the sort order + the code page + the character set or "dictionary".

if the sort order is binary, my personal favorite,
its pretty simple: the bit pattern mapped to 'a' is not the same as 'A'. what ends up writing to disk is simply as different between 'a' and 'A' as it is between '1' and '2'. it knows 'a' is not 'A' because they are completely differnt patterns stored on the disk.

i dont think i've ever totally understood how non-binary case-sensitive sort orders know the difference - but i do know it involves 'interrogating' the character set dictionary at a very very low level to get the info it needs - which is why it is a noticably slower way to store/join/compare your data.

both binary and dictionary sorts 'technically' use bit patterns, but sometimes a particular character set stored in non-binary can have funky characters and stuff that make it not "always" work.

in either scenario, depending on the level of the person asking the question - i'd think the bit pattern explanation is close enough for gov't work, after the back-hand, of course.|||I think ascii comparisons are from the old days. If case-insensitive is set, I would think it'll look at the codepage and have the codepage decide which character is its uppercase or lowercase equivalent.|||helluva question - 1st i'd smack the guy who asked.

Now do you seriously think I'd entertain this question from someone who didn't SIGN my check?

Anyway, it'a valid question, and from what ya'll (no I'm not from the south) have mentioned, it makes a lot of sense, and was thinking in those terms...

Still gotta google some details...

Thanks|||LOL - i know what you mean.
Gotta make that mortgage payment.|||The problem is that the actual process isn't that simple. There is bitmap that shows which charcters are mapped, and a hash of arrays that allows each unicode character set to "remap" characters as needed. Each array contains the hash signature, a value for comparison, and a value for sorting. Most characters don't have lookup entries in the hash, so they use the actual binary value.

Aren't you glad that you asked?

-PatP|||hey pat -
does the 'array' you refer to only store the unicode, and special non-unicode characters for languages that use them (tilde, etc)? or all characters?

sounds as though the premise that the code page has all the info in it and SQL services communicate with its contents for sorting and comparison rules is the basic idea for dictionary sorts - or it uses the binary value directly for binary sorts.

is that your take, at a high level?|||The hash is the actual lookup mechanism. The array that is returned from the hash contains values that are used for comparisons and sorting.

You can think of it something like this: When processing a character, it gets expanded to 16 bits if it wasn't already, then that 16 bit value is used as a lookup into the bitmask to see if this character gets special processing (most don't). If it needs processing, a hash lookup is done which returns the character itself, the value to use for comparison purposes, and a value to use for sorting purposes. If the character didn't need processing, it is copied into all three values in an empty array.

-PatP|||thanks!
'splains a little better why binary sorts are mo'better, IMO.
they need no translation and/or re-sorting.

of course, some users expect charater ordering to work like it did for them in 3rd grade, but, eh, NMFP. the database is faster, that IS my problem.

i'm starting to think i'm an evil dba...

:confused:|||Once you get past the surface...it's all about the internals and undersatanding how things work...

Thanks again Pat...

(He has a signed copy of Kelans book no doubt)|||quite true.
there's so much to learn i'll be busy the rest of my life.

might be useful for this thread and/or compiling a detailed answer for your bossman:

master..syscharsets is where SQL server goes to get the binary definitions for character sets and sort orders.

peace.|||(He has a signed copy of Kelans book no doubt)That I do! If you mean Kalen's Inside Microsoft SQL Server 2000 (http://search.barnesandnoble.com/booksearch/isbnInquiry.asp?userid=UM7Mwdc9jn&isbn=0735609985&itm=1).

-PatP

No comments:

Post a Comment