Pascal Newsletter #3
INDEX
1. A FEW WORDS FROM THE EDITOR
2. FIND FILE: ADDING A CONTEXT MENU
3. PORTING ISSUES: UTF-8 STRINGS
Strings types in Delphi
MultiByte Character Strings (MBCS) in Windows
Length of an ANSI string
Introduction to UTF-8 (UCS Transformation Format)
UTF-8 encoding
Length of an UTF-8 string
4. LINKS
________________________________________________________________________
1. A FEW WORDS FROM THE EDITOR
If you visited our web site in the last week or two, you must have seen
the new look. If not, then take a look at
http://www.latiumsoftware.com/en/index.php
I would like to thank a friend of mine for this nice job. Please report
if you have any troubles viewing it with your browser.
We have added a few articles about Delphi. Most of them have been
featured in past issues of this newsletter (and its predecessors), but
there are also new things:
Determining the short name (DOS name) of a file
http://www.latiumsoftware.com/en/delphi/00007.php
Accessing hidden properties
http://www.latiumsoftware.com/en/delphi/00008.php
Adding new methods and properties without registering new components
http://www.latiumsoftware.com/en/delphi/00009.php
We will keep you informed about new additions to the site.
Please remember that if you have doubts or questions regarding the
articles of this newsletter, or any other question about Delphi
programming, you can post them to our mailing list.
We would like to hear about your programming needs to see if we can
cover them in this newsletter.
Regards,
Ernesto De Spirito
eds2008 @ latiumsoftware.com
________________________________________________________________________
JfControls Library. Multi-language. Multi-appearance. Skins. Privileges.
More than 40 integrated and customizable components. Impressive GUI.
Centralized resources administration. Multiple programming problems
solved. For Delphi 3-2006 & C++ Builder 3-6. http://www.jfactivesoft.com
________________________________________________________________________
2. FIND FILE: ADDING A CONTEXT MENU
In this article we continue building our Find File application we
started in the former Delphi Newsletter. This time we are going to add
a context menu to the file list, so the user can choose to open the file
or the folder where the file is located.
Adding a context menu to a control is easy: 1) Drop a TPopupMenu
component on the form; 2) Edit its Items property adding menu items with
their corresponding OnClick event handlers; and 3) Assign the menu
object to the control by setting the PopupMenu property of the control.
This way, the context menu will popup whenever the user right-clicks on
the control (or presses the Apps key in Windows 95 keyboards).
For our purpose we are going to make it a little bit more difficult by
skipping step 3) and calling the popup menu using the Popup method when
we need it. We have to do it "by hand" because we need to differentiate
between keyboard or mouse invocation basically to determine which file
or folder is the one we should open. We would also like to set the
default menu option to be "Open" if the user right-clicks on a filename,
or "Open folder" if the user right-clicks on a folder.
Ok, enough introduction. Now let's work! Drop a TPopupMenu component on
the form and edit its Items property adding two menu items with the
following properties:
1) Name = 'Open1' 2) Name = 'OpenFolder1'
Caption = 'Open' Caption = 'Open folder'
OnClick = 'Open1Click' OnClick = 'OpenFolder1Click'
Add the following code to the event handlers:
procedure TForm1.Open1Click(Sender: TObject);
begin
if ShellExecute(Self.Handle, nil,
PChar(SelectedItem.SubItems.Strings[0] + SelectedItem.Caption),
nil, nil, SW_SHOWMAXIMIZED) <= 32 then begin
Application.MessageBox(cstrCouldNotExecApp,
'Error', MB_ICONEXCLAMATION);
end; // if
end;
procedure TForm1.OpenFolder1Click(Sender: TObject);
begin
if ShellExecute(Self.Handle, 'explore',
PChar(SelectedItem.SubItems.Strings[0]),
nil, nil, SW_SHOWMAXIMIZED) <= 32 then begin
Application.MessageBox(cstrCouldNotExecApp,
'Error', MB_ICONEXCLAMATION);
end; // if
end;
These methods open a file and a directory respectively, exactly as we
have seen in past issues. The only difference is that we assume
SelectedItem is a variable or property of type TListItem that references
the item in the TListView object (ListView1) that was selected before
calling up the menu. So, first thing before going into this is declaring
SelectedItem. We have declared it as a private property of the form:
type
TForm1 = class(TForm)
...
private
{ Private declarations }
SelectedItem: TListItem;
...
Now we should capture the OnMouseDown and OnKeyDown events of ListView1
to set the value of SelectedItem and invoke the popup:
procedure TForm1.ListView1MouseDown(Sender: TObject;
Button: TMouseButton; Shift: TShiftState; X, Y: Integer);
var
Col: Integer;
begin
Last.X := X;
Last.Y := Y;
if Shift = [ssRight] then begin
SelectedItem := TListViewX(ListView1).GetItemAtX(X, Y, Col);
if (SelectedItem <> nil) and (Col <= 1) then
PopupMenu1.Items[Col].Default := True;
PopupMenu1.Popup(
Left + ListView1.Left + X + 10,
Top + ListView1.Top + Y + 20);
end;
end;
procedure TForm1.ListView1KeyDown(Sender: TObject; var Key: Word;
Shift: TShiftState);
begin
if (Key = VK_APPS) or (Shift = [ssShift]) and (Key = VK_F10)
then begin
SelectedItem := ListView1.ItemFocused;
if SelectedItem <> nil then begin
PopupMenu1.Items[0].Default := True;
PopupMenu1.Popup(
Left + ListView1.Left + SelectedItem.Position.X + 20,
Top + ListView1.Top + SelectedItem.Position.Y + 35);
end;
end;
end;
The Popup method expects the coordinates of the menu. This coordinates
are relative to the screen, so to the coordinates of the focused item or
the mouse position (relative to the control) we add the coordinates of
the form and the control (to make them relative to the screen), plus a
little offset. And that's it! Now you can try it...
As usual, the full source code is available at our site:
http://www.latiumsoftware.com/en/file.php?id=p03
________________________________________________________________________
3. PORTING ISSUES: UTF-8 STRINGS
This article is intended mainly for future programmers for the Linux
environment and intends to present some of the differences that will
exist regarding string processing between Windows and Linux.
Strings types in Delphi
=======================
A string (as you probably know by now :-) is a sequence of characters.
Delphi has three types of strings:
* Short strings
Short strings are declared using the ShortString keyword. This string
type comes from the old times of Turbo Pascal and is supported for
backwards compatibility. A short string variable normally uses 256
bytes in total, although its length (stored in the first byte) can
vary from 0 to 255.
For example:
var
s: shortstring;
begin
s := 'Hello!';
The string s takes 256 bytes. s[0] is the length of the string, so in
the example its value would be #6. You can't access s[0] directly,
but rather you should use Length and SetLength. s[1] is the first
character ('H'), s[2] is the second character ('e'), and so on.
From s[7] to s[255] the values would be undefined.
* ANSI strings
Usually called "long strings", ANSI strings are declared using the
AnsiString keyword. ANSI strings are actually pointers to a data
structure consisting of two integers (that hold the length of the
string and the reference count) and the sequence of bytes allocated
for the string, that can range from 1 byte to almost 2 GB (providing
you have enough memory).
For example:
var
s: ansistring;
begin
s := 'Hello!';
The variable s itself takes 4 bytes (a 32-bit pointer). The data
structure it points to takes 8 bytes for the two integers and in this
case 6 bytes for the 6 characters, giving 14 bytes in total. Like
with the short string, s[1] is the first character ('H'), and so on.
* Wide strings
Wide strings, also named UNICODE strings, are special strings where
each character (of type WideChar) takes two bytes (a word). In the
UNICODE character set, the first 256 values correspond to the ANSI
character set. Wide strings are pointers, like ANSI strings, but
they are not reference counted, so when you make an assignment
between two wide-string variables, the string is actually copied (in
the case of ANSI strings the reference count is incremented), so
they are inefficient in comparison, but the COM and OLE APIs use this
type of strings, and so do ActiveX objects.
For example:
var
s: widestring;
begin
s := 'Hello!';
Here, the variable s takes 4 bytes for the pointer, and the data
structure takes 4 bytes for the length and 12 bytes for the 6
characters (2 bytes each), giving 16 bytes in total. s[1] is the
first character ('H'), except it is of type WideChar instead of
AnsiChar and takes two bytes instead of one. s[2] is the second
character ('e') and starts in the third byte (the first two bytes
are for s[1]).
The type String is mapped by default to AnsiString. Char is mapped to
AnsiChar, and PChar is mapped to PAnsiChar.
MultiByte Character Strings in Windows (MBCS)
=============================================
When working with Ansi strings, normally we consider that each character
occupies one byte, which is true for Western European languages, but for
most Asian languages, 256 characters are simply not enough.
A possible solution is using wide strings, and another solution is
encoding some characters in one byte and others in two (DBCS: Double-
Byte Character Strings). For this to work, there must be a way to know
whether a byte in a string is a character, or is the "lead byte" of a
two byte character. Delphi defines a character set named LeadBytes that
contains the characters that are lead bytes in the current Windows
locale. For Western locales, this set is empty (there are no lead bytes
since there is an equivalence between bytes and characters), and in
general for other locales, if the value of the byte ranges from 0 to 127
it is an ASCII character, and if it is greater than 127, then it is a
lead byte and the next character is called "trail byte" (may range from
0 to 255).
For reasons of efficiency and backwards compatibility, Delphi comes with
different versions of string functions for SBCS (Single-Byte Character
Strings) and DBCS. For SBCS (one byte = one character) there is no point
in going thru the overhead of trying to see if each byte is a character
or a lead byte (since there are no lead bytes), so for SBCS you can use
the standard functions like Pos, LowerCase, etc., while for DBCS you
should use funtions like AnsiPos, AnsiLowerCase, etc. which take into
account that some characters may be represented by more than one byte
(and thus these functiones are slower).
Length of an ANSI string
========================
Indexing a DBCS can be tricky, since s[i] represents the i-th byte, not
necessarily the i-th character because previous characters could have
had two bytes. The number of bytes in a string returned by the Length
function may or may not represent the actual number of characters
contained in a DBCS. To determine this number you can use a function
like the following:
function AnsiLength(const s: string): integer;
var
i, n: integer;
begin
Result := 0;
n := Length(s);
i := 1;
while i <= n do begin
inc(Result);
if s[i] in LeadBytes then inc(i);
inc(i);
end;
end;
Introduction to UTF-8 (UCS Transformation Format)
=================================================
Windows can work with Unicode strings, as well as SBCS and DBCS, but the
Linux kernel works with UTF-8 strings, where one character may take up
to six bytes! Normally one or two in Western languages and from one to
three in Asian languages. UTF-8 is a multibyte character encoding that
can accommodate all the characters of the UCS (Universal Character Set),
which contains 31-bit characters that can represent practically all the
characters of known languages living and dead, as well as scripts like
Hiragana, Kiragana, etc. It also leaves space for more languages,
scripts and hieroglyphics, so in the future we can expect to be able to
read Klingon poetry, the Ferengi Acquisition Rules and Bajoran
prophecies in their original versions... :-)
UTF-8 has these important features:
* Variable-length encoding for UCS characters
UTF-8 can encode UCS (ISO 10646) characters in up to 6 bytes.
* Transparency and uniqueness for ASCII characters
7-bit ASCII characters (#0..#127) are encoded as plain 7-bit ASCII
(1 byte per character). All non-ASCII characters (#128..#255) are
represented purely with non-ASCII 8-bit values (#128..#255) so that
non-ASCII characters cannot be mistaken for ASCII characters, and
ASCII-based text processing tools can be used on UTF-8 text as long
as they pass 8-bit characters without interpretation.
* Null character
Character #0 (ASCII NULL) only appears where a NULL is intended. It
can't be a trail byte for instance.
* Self-synchronization for fast speed processing
High bit patterns unambiguates character boundaries, and makes it easy
to know whether a byte is a single-byte character (0xxxxxxx), a lead
byte (11yyyyyx) or a fill byte (10xxxxxx). This feature is very
important because it allows UTF-8 strings processing functions be by
far a lot more efficient than Windows DBCS. For example, an UTF-8
string can be parsed backwards and also string searches for a
multibyte character beginning with a lead byte will never match on the
fill byte in the middle of an unwanted multibyte character. And as the
lead-byte announces the length of the multibyte character you can
quickly tell how many bytes to skip for fast forward parsing.
* Processor-friendliness
UTF-8 can be read and written quickly with simple bitmask and bitshift
operations without any multiplication or division (that are slow CPU
operation).
* Reasonable compression
UTF-8 is not as compact as Windows DBCS, but for Western languages it
is better than Unicode, and in the worst case (Eastern languages) it
is no worse than UCS-4.
* Canonical sort-order
UTF-8 preserves the sort ordering for plain 8-bit comparison routines
like strcmp (a C standard function).
* Flag characters
The octets #$FE and #$FF never appear, so you can use them as flags
to signal a special meaning (avoiding the possibility of mistaking a
flag with a real character).
* Detectability
It's easy to detect an UTF-8 input with high probability if you see
the UTF-8 signature #$EF#$BB#$BF ('') or if you see valid UTF-8
multibyte characters since it is very unlikely that they accidentally
appear in ISO 8859-1 (Latin-1) text.
UTF-8 encoding
==============
This is the general format used to encode UCS characters in UTF-8:
Bits Bytes Representation
7 1 0xxxxxxx
11 2 110xxxxx 10xxxxxx
16 3 1110xxxx 10xxxxxx 10xxxxxx
21 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
26 5 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
31 6 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Notice that the number of leading 1 bits in the lead byte is the number
of bytes in a multibyte sequence.
The copyright sign ('©' = #169 = #$A9) in binary would be 10101001 and
since it needs 8 bits, we would have to use two bytes:
110xxxxx 10xxxxxx
We have to fill 11 bits (x), so we add three zeroes to the left of
10101001:
00010 101001
The UTF-8 representation for the copyright character would then be:
11000010 10101001
It could also be represented with more bytes than needed in an overlong
string sequence. For example with four bytes it would be:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
000 000000 000010 101001
----------------------------------------
11110000 10000000 10000010 10101001
Overlong sequences are usually used to "camouflage" characters to cheat
UTF-8 substring tests. For example, if you look for the copyright sign
exactly as 11000010 10101001 (the shortest possible encoding), then you
won't find it.
Length of an UTF-8 string
=========================
In Delphi for Linux, long strings will be in UTF-8 format, while wide
strings will remain as two-byte Unicode, although they will be reference
counted. To know the number of characters stored in a UTF-8 string we
could use a function like the following:
function UTF8Length(const s: string): integer;
var
i, n: integer;
c: byte;
begin
Result := 0;
n := Length(s);
i := 1;
while i <= n do begin
inc(Result);
c := byte(s[i]);
if (c and $80) = 0 then inc(i)
else if (c and $E0) = $C0 then inc(i, 2)
else if (c and $F0) = $E0 then inc(i, 3)
else if (c and $F8) = $F0 then inc(i, 4)
else if (c and $FC) = $F8 then inc(i, 5)
else if (c and $FE) = $FC then inc(i, 6)
else
raise Exception.Create('Not an UTF-8 string!');
end;
if i > n + 1 then
raise Exception.Create('Not an UTF-8 string!');
end;
Of course this function should be written using pointers and a bit of
assembler to improve its performance, but let's leave that for the
pros... :)
________________________________________________________________________
4. LINKS
* Torry's Delphi Pages
http://www.torry.ru
* Delphi Programming Source Code
http://ssapcs.hispeed.com/index.html
* Swiss Delphi Center
http://www.swissdelphicenter.ch
* Delphi Downloads Web Page
http://members.xoom.com/sandbrook/downloads/Download.htm
* AlphaCom, Inc.
http://alphacom.hypermart.net
* Top Delphi Sites
http://ssapcs.hispeed.com/topsites/index.html
* Advanced Delphi Developer's Guide to ADO
http://delphi.about.com/od/productreviews/l/blbr1556227582.htm
* Central Iowa Delphi Users Group
http://www.bigcreek.com/delphi
* The Delphi Cafe
http://www.geocities.com/ResearchTriangle/6201
* Natalia Elmanova
http://www.geocities.com/SiliconValley/way/9281
________________________________________________________________________
YOU CAN HELP US
We need your help to keep this newsletter going and growing. You can
help by referring the newsletter to your colleagues:
http://www.latiumsoftware.com/en/pascal/delphi-newsletter.php
Or you can help by voting for us in some or all of these rankings to
give more visibility to our web site and thus increase the number of
subscriptions to this newsletter:
http://www.programmingpages.com/?r=latiumsoftwarecomenpascal
http://top100borland.com/in.php?who=20
It's just a few seconds for you that REALLY mean a lot to us.
________________________________________________________________________
If you haven't received the full source code examples for this issue,
you can get them from http://www.latiumsoftware.com/en/file.php?id=p03
________________________________________________________________________
This newsletter is provided "AS IS" without warranty of any kind. Its
use implies the acceptance of our licensing terms and disclaimer of
warranty you can read at http://www.latiumsoftware.com/en/legal.php
where you will also find a note about legal trademarks. Articles are
copyright of their respective authors and they are reproduced here with
their permission. You can redistribute this newsletter as long as you do
it in full (including copyright notices), without changes, and gratis.
________________________________________________________________________
Main page: http://www.latiumsoftware.com/en/pascal/delphi-newsletter.php
Group home page: http://groups.yahoo.com/group/pascal-newsletter/
Subscribe/join: pascal-newsletter-subscribe@yahoogroups.com
Unsubscribe/leave: pascal-newsletter-unsubscribe@yahoogroups.com
Problems with your subscription? eds2008 @ latiumsoftware.com
________________________________________________________________________
Latium Software http://www.latiumsoftware.com/en/index.php
Copyright (c) 2000 by Ernesto De Spirito. All rights reserved.
________________________________________________________________________
|